Monthly trade data of BTCUSDT (USD-M Futures) at data.binance.vision is corrupted

Trader_alpha · January 17, 2024, 4:30pm

This is about a big error by your side, I will do my best to explain it here in detail, and how can it be replicated.

Description of the error:

There are millions of trades missing in the binance futures data offered at :

https://data.binance.vision/?prefix=data/futures/um/monthly/trades/BTCUSDT/

Not all the months are corrupted, but there are months with literally millions of trades missing.

I have contacted the support regarding this issue, but they were unable to understand it properly. I was even told by them that the data does not follow and order and that sometimes, the trade id’s are not secuential.

I will provide here irrefutable evidence and an easy way to visualize the problem using the data of one of the corrupted months.

Replication of the issue:

So I was told that the trade data can sometimes lack of chronological order, so in my code I will sort the information by timestamp, so it follows a chronological order.

For you to be able to easily check this issue, we are going to resample the trade data of 2023-01 to hours(you can choose your preferred timeframe).

We are going to make use of pyhton3 and pandas.

Please, note that if you have less than 10gb of RAM available, you might not be able to complete the resampling.

Steps:
1- Download the data on this link:
https://data.binance.vision/data/futures/um/monthly/trades/BTCUSDT/BTCUSDT-trades-2023-01.zip
2- Do the checksum to check integrity.

3- Decompress the data

4- Use the provided code to resample it.

5- Open the resulting output_timeframe.csv file

6- Observe the big gaps.(You can as well compare it with data of the api)
From line 549, you can see the gaps in this file, to see them in more detail, you can resample to seconds.

7- Compensate me for discovering this issue and taking my time to report it.

Code:

import pandas as pd
from datetime import datetime


def t_to_dt(milliseconds_timestamp):
    seconds_timestamp = milliseconds_timestamp / 1000
    datetime_object = datetime.utcfromtimestamp(seconds_timestamp) 
    return datetime_object

def tick_to_kline(interval, data_path):

    #Check if the first row contains a header, and if yes, skip it.
    a_ski = 0

    with open(data_path, 'r') as file:
        first_line = file.readline()
        a = first_line.split(',')
        if a[0] == "id":
            a_ski = 1
        else:
            a_ski = 0
        


    #Load data
    columns_to_load = [1,2,3,4]
    
    df = pd.read_csv(data_path, header=None, usecols=columns_to_load, skiprows=a_ski)

    df[4] = pd.to_datetime(df[4], unit='ms')

    #Rename columns
    df.rename(columns={1:'price',2:'volume',3:'quote_volume',4:'timestamp'}, inplace=True)

    # Set the timestamp column as the DataFrame index
    df.set_index('timestamp', inplace=True)
    # Sort it
    df.sort_index(ascending=True,inplace=True)

    print(df.head())

    # Resample the data 

    ohlc_dict = {
        'price': ['first', 'max', 'min', 'last'],
        'volume': 'sum',
        'quote_volume':'sum'
    }

    candlestick_data = df.resample(interval).apply(ohlc_dict)

    candlestick_data.index = (candlestick_data.index.astype(int) / 10**6).astype(int) * 1000

    print(candlestick_data.info())
![binance_issue|510x500](upload://h2ckyuum7xInqH1SVSmbQX9iQEk.png)


    candlestick_data.to_csv(f'output_{interval}.csv',index=True, header=False)


# Structure: id,price,qty,quote_qty,time,is_buyer_maker
#Interval keys:  1D(Day) 1H(hour) 1T(minute) 1S(second)

#Set the output interval
interval = '1H' 
#Specify the location of the trades file
data_path = "BTCUSDT-trades-2023-01.csv"

tick_to_kline(interval, data_path)

dino · January 18, 2024, 4:13am

Thanks for feedback, could you please give the missing trade records in the files, for example trade id xxx is not found in the csv file?

thanks

Khun · January 20, 2024, 5:38pm

Try aggTrades instead of trades and see if it works