autocluster plugin
Applies to: ✅ Microsoft Fabric ✅ Azure Data Explorer
autocluster
finds common patterns of discrete attributes (dimensions) in the data. It then reduces the results of the original query, whether it's 100 or 100,000 rows, to a few patterns. The plugin was developed to help analyze failures (such as exceptions or crashes) but can potentially work on any filtered dataset. The plugin is invoked with the evaluate
operator.
Note
autocluster
is largely based on the Seed-Expand algorithm from the following paper: Algorithms for Telemetry Data Mining using Discrete Attributes.
Syntax
T |
evaluate
autocluster
(
[SizeWeight [,
WeightColumn [,
NumSeeds [,
CustomWildcard [,
... ]]]]])
Learn more about syntax conventions.
Parameters
The parameters must be ordered as specified in the syntax. To indicate that the default value should be used, put the string tilde value ~
. For more information, see Examples.
Name | Type | Required | Description |
---|---|---|---|
T | string |
✔️ | The input tabular expression. |
SizeWeight | double | A double between 0 and 1 that controls the balance between generic (high coverage) and informative (many shared) values. Increasing this value typically reduces the quantity of patterns while expanding coverage. Conversely, decreasing this value generates more specific patterns characterized by increased shared values and a smaller percentage coverage. The default is 0.5 . The formula is a weighted geometric mean with weights SizeWeight and 1-SizeWeight . |
|
WeightColumn | string |
Considers each row in the input according to the specified weight. Each row has a default weight of 1 . The argument must be a name of a numeric integer column. A common usage of a weight column is to take into account sampling or bucketing or aggregation of the data that is already embedded into each row. |
|
NumSeeds | int |
Determines the number of initial local search points. Adjusting the number of seeds impacts result quantity or quality based on data structure. Increasing seeds can enhance results but with a slower query tradeoff. Decreasing below five yields negligible improvements, while increasing above 50 rarely generates more patterns. The default is 25 . |
|
CustomWildcard | string |
A type literal that sets the wildcard value for a specific type in the results table, indicating no restriction on this column. The default is null , which represents an empty string. If the default is a good value in the data, a different wildcard value should be used, such as * . You can include multiple custom wildcards by adding them consecutively. |
Returns
The autocluster
plugin usually returns a small set of patterns. The patterns capture portions of the data with shared common values across multiple discrete attributes. Each pattern in the results is represented by a row.
The first column is the segment ID. The next two columns are the count and percentage of rows out of the original query that are captured by the pattern. The remaining columns are from the original query. Their value is either a specific value from the column, or a wildcard value (which are by default null) meaning variable values.
The patterns aren't distinct, may be overlapping, and usually don't cover all the original rows. Some rows may not fall under any pattern.
Tip
Use where and project in the input pipe to reduce the data to just what you're interested in.
When you find an interesting row, you might want to drill into it further by adding its specific values to your where
filter.
Examples
Using evaluate
T | evaluate autocluster()
Using autocluster
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)
Output
SegmentId | Count | Percent | State | EventType | Damage |
---|---|---|---|---|---|
0 | 2278 | 38.7 | Hail | NO | |
1 | 512 | 8.7 | Thunderstorm Wind | YES | |
2 | 898 | 15.3 | TEXAS |
Using custom wildcards
StormEvents
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.2, '~', '~', '*')
Output
SegmentId | Count | Percent | State | EventType | Damage |
---|---|---|---|---|---|
0 | 2278 | 38.7 | * | Hail | NO |
1 | 512 | 8.7 | * | Thunderstorm Wind | YES |
2 | 898 | 15.3 | TEXAS | * | * |