Questions about tdigest* KQL functions

Kyle Burney 20 Reputation points
2024-07-19T18:04:48.3133333+00:00

In the documentation for the tdigest function, it shows sample outputs in the following format but doesn't explain what each of the nested arrays represent.

[[n],[a,b,c],[d,e,f]]

It appears the second array is every unique sample that was in the input and the third array is the number of occurrences for each of them. Is that correct and what does the first array represent?

Also, since the digest produced contains every unique sample and doesn't do any centroid compression or calculate/return trimmed-means, this doesn't appear to be a true t-digest and increases the query time. Are there plans to optimize the implementation?

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
529 questions
0 comments No comments
{count} votes

Accepted answer
  1. Vinodh247 22,696 Reputation points
    2024-07-20T01:18:23.39+00:00

    Hi Kyle Burney,

    Thanks for reaching out to Microsoft Q&A.

    The output format of the tdigest function in Azure Data Explorer is structured as below:

    [[n],[a,b,c],[d,e,f]]. Each nested array represents different components of the t-digest aggregation results:

    1. First Array ([n]): This array contains a single value, which represents the total number of samples processed. In the context of the t-digest function, this indicates how many data points were aggregated.
    2. Second Array ([a,b,c]): This array lists every unique sample that was present in the input data. It reflects the distinct values that contributed to the aggregation.
    3. Third Array ([d,e,f]): This array shows the count of occurrences for each of the unique samples listed in the second array. Each element corresponds to the frequency of the respective sample in the input data.

    Your concern about the tdigest function not performing centroid compression or returning trimmed-means, and thus not being a true t-digest, is valid. A true t-digest algorithm involves maintaining a set of centroids that approximate the distribution of data, allowing for more efficient quantile estimation with bounded error. The current function in azure data expl seems to focus on preserving all unique samples and their counts, which could indeed lead to increased query times and larger data structures, especially with large datasets.

    There are no specific details available in the docs about future optimizations for this implementation, but feedback mechanisms are in place for users to express their needs and suggestions for improvements. Would suggest keeping an eye on the Azure Data Explorer release notes for updates.

    https://learn.microsoft.com/ro-ro/azure/data-explorer/kusto/query/tdigest-aggregation-function

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.