Data Discrepancy Between Klines and Aggregated Trades

Luca_D · August 29, 2024, 2:29pm

I’ve encountered an inconsistency in the data returned by the API endpoints for K-lines (‘/api/v3/klines’) and aggregated trades (‘/api/v3/aggTrades’). Specifically, there are discrepancies in the trade counts and volumes reported. For aggregated trades, I make multiple requests up to the final trade of the minute to ensure alignment.

For instance, a request for a one-minute K-Line typically covers the period from 13:00:00:000 to 13:01:59:999. I ensure that my function for aggregated trades thoroughly scans this entire interval, making multiple requests as necessary. However, I’ve noticed that, occasionally, even with different trading pairs, the trade counts do not match.

Has anyone else experienced this issue? What could be the underlying cause?

P.S. I have seen some similar topics but found no answers/solutions.

Asheesh_Kumar · August 30, 2024, 11:43am

See if below helps:

This issue was discussed in the context of using websockets for K-line data, where users noticed inconsistencies, which were partly attributed to the specific streams subscribed to and how these handle real-time data updates

Differences in how endpoints process and report data can lead to mismatches. For example, historical data loaders might be affected by endpoint changes, as was noted in discussions about the differences between aggTrades and trades endpoints on platforms like GitHub
historic trades broken for binance - aggTrades vs trades · Issue #4635 · ccxt/ccxt · GitHub.

Understanding the specific characteristics of each endpoint can help tailor your data handling strategies accordingly.

Luca_D · August 30, 2024, 5:28pm

Thank you for your answer.

In short, after also reading those conversations, it becomes quite clear that aggregated trades are not very reliable and present critical issues that need to be investigated.

I myself have found these discrepancies between k-lines and aggregates (number of trades and volumes) to be statistically significant and thus unreliable.

If one wants to perform an analysis on the distribution, it is better to rely on individual trades (‘/api/v3/historicalTrades’ and WS).

It’s a real pity because aggregated trades retain a good amount of information with a much smaller volume of data.

ilammy · August 31, 2024, 12:52am

That’s two minutes though, not one.

Luca_D · August 31, 2024, 7:45am

Typo.

ilammy · August 31, 2024, 8:29am

Could you please share a specific example of discrepancies next time you encounter this issue?

Luca_D · August 31, 2024, 8:43am

This is the output of my code, actually I am testing this range on BTCUSDT.

All trades are within the specified time range. 2024-04-09 13:04:00, 2024-04-09 16:03:59.999000
k trades: 478732.0, k vol 10851.719830000013
agg trades: 473406, agg vol 10733.94963999999

k trades and k vol are obtained through k-line endpoint, agg trades and agg vol through the aggTrades.

This range has a high volume of trades, where discrepancies happen more frequently. I am doing other tests.

Edit: the interval for klines is 1 minute

ilammy · August 31, 2024, 10:32am

@Luca_D thanks for the example! That really helps.

How do you iterate through aggtrades? You should use startTime and endTime only to establish the aggtrade ID range and use fromId for pagination. Otherwise, multiple aggtrades at the same timestamp might be missing.

Here’s my script in Python that iterates over the range you mentioned and shows the same results for klines and aggtrades:

klines: 478732 trades, 10851.71983000 volume
aggtrades: 478732 trades, 10851.71983000 volume

#!/usr/bin/env python3

import requests
import time
from decimal import Decimal


BINANCE_API = 'https://api.binance.com'
SYMBOL = 'BTCUSDT'
START_TIME = 1712667840000 # 2024-04-09 13:04:00.000
END_TIME   = 1712678639999 # 2024-04-09 16:03:59.999


# Get klines, we can fetch them in one go
response = requests.get(f'{BINANCE_API}/api/v3/klines?symbol={SYMBOL}&interval=1m&startTime={START_TIME}&endTime={END_TIME}')
klines = response.json()
assert len(klines) == 180
assert klines[0][0] == 1712667840000
assert klines[-1][6] == 1712678639999

kline_num_trades = sum([k[8] for k in klines])
kline_volume = sum([Decimal(k[5]) for k in klines])

print(f'klines: {kline_num_trades} trades, {kline_volume} volume')


# Get aggtrades, in batches
response = requests.get(f'{BINANCE_API}/api/v3/aggTrades?symbol={SYMBOL}&startTime={START_TIME}&endTime={END_TIME}')
agg_trades = response.json()
while agg_trades[-1]['T'] < END_TIME + 1:
	next_id = agg_trades[-1]['a'] + 1
	response = requests.get(f'{BINANCE_API}/api/v3/aggTrades?symbol={SYMBOL}&fromId={next_id}&limit=1000')
	agg_trades += response.json()
	time.sleep(0.1) # be gentle to the API

# Cut off the overshoot by time at the end
agg_trades = [t for t in agg_trades if t['T'] <= END_TIME]
assert len(agg_trades) == 381806
assert agg_trades[0]['a'] == 2959684856
assert agg_trades[-1]['a'] == 2960066661

aggtrade_num_trades = sum([t['l'] - t['f'] + 1 for t in agg_trades])
aggtrade_volume = sum([Decimal(t['q']) for t in agg_trades])

print(f'aggtrades: {aggtrade_num_trades} trades, {aggtrade_volume} volume')

Luca_D · August 31, 2024, 11:22am

Here’s my code snippet to get aggtrades:

while True:
                response = requests.get(self.base_url + endpoint, params=params)
                response.raise_for_status()
                trades = response.json()
                aggregated_trades.extend(trades)
                if len(trades) < params['limit']:
                    break            
                last_trade_time = trades[-1]['T'] + 1 
                params['startTime'] = last_trade_time
                if last_trade_time > end_timestamp:
                    break
                time.sleep(0.1)

And this is a dummy func to calculate the trades but it does not seem to be here the problem.

def calc_trades(trades):
    total_individual_trades = sum((trade['l'] - trade['f']) + 1 for trade in trades if trade['p'] != '0' and trade['q'] != '0')
    total_volume = sum(float(trade['q']) for trade in trades)
    return total_individual_trades, total_volume

I think this is the solution and I really want to thank you. I was stuck here.
And above all, I am glad to know I was wrong and not the data. Cheers

ilammy · August 31, 2024, 11:57am

Yes, it’s most likely this part that had an error.

If the API response ends with a run of multiple aggtrades at the same millisecond (think, somebody placed a big order that traded at multiple price levels), some might get cut due to the limit. That depends on the exact trades, explaining why it’s randomly happening.

It is still possible to use a loop with just startTime and endTime like you do, but the trick is to query the trades with some overlap:

params['startTime'] = trades[-1]['T']   # note: no adjustment by "+ 1"

and then skip some trades at the front — the ones you’ve already seen, based on the ID:

last_aggtrade_id = aggregated_trades[-1]['a']
trades = [t for t in trades if t['a'] > last_aggtrade_id]

This way might be easier to code and comprehend.