Hi Kyle Burney,
Thanks for reaching out to Microsoft Q&A.
The output format of the tdigest
function in Azure Data Explorer is structured as below:
[[n],[a,b,c],[d,e,f]]
. Each nested array represents different components of the t-digest aggregation results:
- First Array (
[n]
): This array contains a single value, which represents the total number of samples processed. In the context of the t-digest function, this indicates how many data points were aggregated. - Second Array (
[a,b,c]
): This array lists every unique sample that was present in the input data. It reflects the distinct values that contributed to the aggregation. - Third Array (
[d,e,f]
): This array shows the count of occurrences for each of the unique samples listed in the second array. Each element corresponds to the frequency of the respective sample in the input data.
Your concern about the tdigest
function not performing centroid compression or returning trimmed-means, and thus not being a true t-digest, is valid. A true t-digest algorithm involves maintaining a set of centroids that approximate the distribution of data, allowing for more efficient quantile estimation with bounded error. The current function in azure data expl seems to focus on preserving all unique samples and their counts, which could indeed lead to increased query times and larger data structures, especially with large datasets.
There are no specific details available in the docs about future optimizations for this implementation, but feedback mechanisms are in place for users to express their needs and suggestions for improvements. Would suggest keeping an eye on the Azure Data Explorer release notes for updates.
https://learn.microsoft.com/ro-ro/azure/data-explorer/kusto/query/tdigest-aggregation-function
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.