This is about a big error by your side, I will do my best to explain it here in detail, and how can it be replicated.
Description of the error:
There are millions of trades missing in the binance futures data offered at :
https://data.binance.vision/?prefix=data/futures/um/monthly/trades/BTCUSDT/
Not all the months are corrupted, but there are months with literally millions of trades missing.
I have contacted the support regarding this issue, but they were unable to understand it properly. I was even told by them that the data does not follow and order and that sometimes, the trade id’s are not secuential.
I will provide here irrefutable evidence and an easy way to visualize the problem using the data of one of the corrupted months.
Replication of the issue:
So I was told that the trade data can sometimes lack of chronological order, so in my code I will sort the information by timestamp, so it follows a chronological order.
For you to be able to easily check this issue, we are going to resample the trade data of 2023-01 to hours(you can choose your preferred timeframe).
We are going to make use of pyhton3 and pandas.
Please, note that if you have less than 10gb of RAM available, you might not be able to complete the resampling.
Steps:
1- Download the data on this link:
https://data.binance.vision/data/futures/um/monthly/trades/BTCUSDT/BTCUSDT-trades-2023-01.zip
2- Do the checksum to check integrity.
3- Decompress the data
4- Use the provided code to resample it.
5- Open the resulting output_timeframe.csv file
6- Observe the big gaps.(You can as well compare it with data of the api)
From line 549, you can see the gaps in this file, to see them in more detail, you can resample to seconds.
7- Compensate me for discovering this issue and taking my time to report it.
Code:
import pandas as pd
from datetime import datetime
def t_to_dt(milliseconds_timestamp):
seconds_timestamp = milliseconds_timestamp / 1000
datetime_object = datetime.utcfromtimestamp(seconds_timestamp)
return datetime_object
def tick_to_kline(interval, data_path):
#Check if the first row contains a header, and if yes, skip it.
a_ski = 0
with open(data_path, 'r') as file:
first_line = file.readline()
a = first_line.split(',')
if a[0] == "id":
a_ski = 1
else:
a_ski = 0
#Load data
columns_to_load = [1,2,3,4]
df = pd.read_csv(data_path, header=None, usecols=columns_to_load, skiprows=a_ski)
df[4] = pd.to_datetime(df[4], unit='ms')
#Rename columns
df.rename(columns={1:'price',2:'volume',3:'quote_volume',4:'timestamp'}, inplace=True)
# Set the timestamp column as the DataFrame index
df.set_index('timestamp', inplace=True)
# Sort it
df.sort_index(ascending=True,inplace=True)
print(df.head())
# Resample the data
ohlc_dict = {
'price': ['first', 'max', 'min', 'last'],
'volume': 'sum',
'quote_volume':'sum'
}
candlestick_data = df.resample(interval).apply(ohlc_dict)
candlestick_data.index = (candlestick_data.index.astype(int) / 10**6).astype(int) * 1000
print(candlestick_data.info())
![binance_issue|510x500](upload://h2ckyuum7xInqH1SVSmbQX9iQEk.png)
candlestick_data.to_csv(f'output_{interval}.csv',index=True, header=False)
# Structure: id,price,qty,quote_qty,time,is_buyer_maker
#Interval keys: 1D(Day) 1H(hour) 1T(minute) 1S(second)
#Set the output interval
interval = '1H'
#Specify the location of the trades file
data_path = "BTCUSDT-trades-2023-01.csv"
tick_to_kline(interval, data_path)