Questions about tdigest* KQL functions

Question

Questions about tdigest* KQL functions

Kyle Burney 20

In the documentation for the tdigest function, it shows sample outputs in the following format but doesn't explain what each of the nested arrays represent.

[[n],[a,b,c],[d,e,f]]

It appears the second array is every unique sample that was in the input and the third array is the number of occurrences for each of them. Is that correct and what does the first array represent?

Also, since the digest produced contains every unique sample and doesn't do any centroid compression or calculate/return trimmed-means, this doesn't appear to be a true t-digest and increases the query time. Are there plans to optimize the implementation?

Accepted answer

0 additional answers

Your answer

Answer 1

Hi Kyle Burney,

Thanks for reaching out to Microsoft Q&A.

The output format of the tdigest function in Azure Data Explorer is structured as below:

[[n],[a,b,c],[d,e,f]]. Each nested array represents different components of the t-digest aggregation results:

First Array ([n]): This array contains a single value, which represents the total number of samples processed. In the context of the t-digest function, this indicates how many data points were aggregated.
Second Array ([a,b,c]): This array lists every unique sample that was present in the input data. It reflects the distinct values that contributed to the aggregation.
Third Array ([d,e,f]): This array shows the count of occurrences for each of the unique samples listed in the second array. Each element corresponds to the frequency of the respective sample in the input data.

Your concern about the tdigest function not performing centroid compression or returning trimmed-means, and thus not being a true t-digest, is valid. A true t-digest algorithm involves maintaining a set of centroids that approximate the distribution of data, allowing for more efficient quantile estimation with bounded error. The current function in azure data expl seems to focus on preserving all unique samples and their counts, which could indeed lead to increased query times and larger data structures, especially with large datasets.

There are no specific details available in the docs about future optimizations for this implementation, but feedback mechanisms are in place for users to express their needs and suggestions for improvements. Would suggest keeping an eye on the Azure Data Explorer release notes for updates.

https://learn.microsoft.com/ro-ro/azure/data-explorer/kusto/query/tdigest-aggregation-function

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Kyle Burney 20 Reputation points

2024-07-20T05:29:37.94+00:00

If the first array represents the total number of samples processed, there appears to be a problem with the function. It is currently returning [[12],...] for any input no matter how many samples they have. Earlier it was doing the same thing but only returning [[7],...]
Vinodh247 34,666 Reputation points MVP Volunteer Moderator

2024-07-20T09:37:11.9233333+00:00

i agree! this should not the expected behavior for the tdigest function. there seems to be an issue with the function in data explorer. I would suggest you record your feedback to microsoft, under data expl team tag.

Share via

Questions about tdigest* KQL functions

0 additional answers

Your answer