Machine learning capability in Azure Data Explorer
Azure Data Explorer is a Big Data analytics platform. It's used to monitor service health, QoS, or malfunctioning devices. The built-in anomaly detection and forecasting functions check for anomalous behavior. Once such a pattern is detected, a Root Cause Analysis (RCA) is run to mitigate or resolve the anomaly.
The diagnosis process is complex and lengthy, and done by domain experts. The process includes:
- Fetching and joining additional data from different sources for the same time frame
- Looking for changes in the distribution of values on multiple dimensions
- Charting additional variables
- Other techniques based on domain knowledge and intuition
Since these diagnosis scenarios are common in Azure Data Explorer, machine learning plugins are available to make the diagnosis phase easier, and shorten the duration of the RCA.
Azure Data Explorer has three Machine Learning plugins:
diffpatterns. All plugins implement clustering algorithms. The
basket plugins cluster a single record set, and the
diffpatterns plugin clusters the differences between two record sets.
Clustering a single record set
A common scenario includes a data set selected by a specific criteria such as:
- Time window that shows anomalous behavior
- High temperature device readings
- Long duration commands
- Top spending users You want a fast and easy way to find common patterns (segments) in the data. Patterns are a subset of the data set whose records share the same values over multiple dimensions (categorical columns).
The following query builds and shows a time series of service exceptions over the period of a week, in ten-minute bins:
let min_t = toscalar(demo_clustering1 | summarize min(PreciseTimeStamp)); let max_t = toscalar(demo_clustering1 | summarize max(PreciseTimeStamp)); demo_clustering1 | make-series num=count() on PreciseTimeStamp from min_t to max_t step 10m | render timechart with(title="Service exceptions over a week, 10 minutes resolution")
The service exception count correlates with the overall service traffic. You can clearly see the daily pattern for business days, Monday to Friday. There's a rise in service exception counts at mid-day, and drops in counts during the night. Flat low counts are visible over the weekend. Exception spikes can be detected using time series anomaly detection in Azure Data Explorer.
The second spike in the data occurs on Tuesday afternoon. The following query is used to further diagnose and verify whether it's a sharp spike. The query redraws the chart around the spike in a higher resolution of eight hours in one-minute bins. You can then study its borders.
let min_t=datetime(2016-08-23 11:00); demo_clustering1 | make-series num=count() on PreciseTimeStamp from min_t to min_t+8h step 1m | render timechart with(title="Zoom on the 2nd spike, 1 minute resolution")
You'll see a narrow two-minute spike from 15:00 to 15:02. In the following query, count the exceptions in this two-minute window:
let min_peak_t=datetime(2016-08-23 15:00); let max_peak_t=datetime(2016-08-23 15:02); demo_clustering1 | where PreciseTimeStamp between(min_peak_t..max_peak_t) | count
In the following query, sample 20 exceptions out of 972:
let min_peak_t=datetime(2016-08-23 15:00); let max_peak_t=datetime(2016-08-23 15:02); demo_clustering1 | where PreciseTimeStamp between(min_peak_t..max_peak_t) | take 20
Use autocluster() for single record set clustering
Even though there are less than a thousand exceptions, it's still hard to find common segments, since there are multiple values in each column. You can use the
autocluster() plugin to instantly extract a short list of common segments and find the interesting clusters within the spike's two minutes, as seen in the following query:
let min_peak_t=datetime(2016-08-23 15:00); let max_peak_t=datetime(2016-08-23 15:02); demo_clustering1 | where PreciseTimeStamp between(min_peak_t..max_peak_t) | evaluate autocluster()
You can see from the results above, the most dominant segment contains 65.74% of the total exception records and shares four dimensions. The next segment is much less common. It contains only 9.67% of the records, and shares three dimensions. The other segments are even less common.
Autocluster uses a proprietary algorithm for mining multiple dimensions and extracting interesting segments. "Interesting" means that each segment has significant coverage of both the records set and the features set. The segments are also diverged, meaning that each one is different from the others. One or more of these segments may be relevant for the RCA process. To minimize segment review and assessment, autocluster extracts only a small segment list.
Use basket() for single record set clustering
You can also use the
basket() plugin as seen in the following query:
let min_peak_t=datetime(2016-08-23 15:00); let max_peak_t=datetime(2016-08-23 15:02); demo_clustering1 | where PreciseTimeStamp between(min_peak_t..max_peak_t) | evaluate basket()
Basket implements the "Apriori" algorithm for item set mining. It extracts all segments whose coverage of the record set is above a threshold (default 5%). You can see that more segments were extracted with similar ones, such as segments 0, 1 or 2, 3.
Both plugins are powerful and easy to use. Their limitation is that they cluster a single record set in an unsupervised manner with no labels. It's unclear whether the extracted patterns characterize the selected record set, anomalous records, or the global record set.
Clustering the difference between two records sets
diffpatterns() plugin overcomes the limitation of
Diffpatterns takes two record sets and extracts the main segments that are different. One set usually contains the anomalous record set being investigated. One is analyzed by
basket. The other set contains the reference record set, the baseline.
In the query below,
diffpatterns finds interesting clusters within the spike's two minutes, which are different from the clusters within the baseline. The baseline window is defined as the eight minutes before 15:00, when the spike started. You extend by a binary column (AB), and specify whether a specific record belongs to the baseline or to the anomalous set.
Diffpatterns implements a supervised learning algorithm, where the two class labels were generated by the anomalous versus the baseline flag (AB).
let min_peak_t=datetime(2016-08-23 15:00); let max_peak_t=datetime(2016-08-23 15:02); let min_baseline_t=datetime(2016-08-23 14:50); let max_baseline_t=datetime(2016-08-23 14:58); // Leave a gap between the baseline and the spike to avoid the transition zone. let splitime=(max_baseline_t+min_peak_t)/2.0; demo_clustering1 | where (PreciseTimeStamp between(min_baseline_t..max_baseline_t)) or (PreciseTimeStamp between(min_peak_t..max_peak_t)) | extend AB=iff(PreciseTimeStamp > splitime, 'Anomaly', 'Baseline') | evaluate diffpatterns(AB, 'Anomaly', 'Baseline')
The most dominant segment is the same segment that was extracted by
autocluster. Its coverage on the two-minute anomalous window is also 65.74%. However, its coverage on the eight-minute baseline window is only 1.7%. The difference is 64.04%. This difference seems to be related to the anomalous spike. To verify this assumption, split the original chart into the records that belong to this problematic segment, and records from the other segments. See the query below.
let min_t = toscalar(demo_clustering1 | summarize min(PreciseTimeStamp)); let max_t = toscalar(demo_clustering1 | summarize max(PreciseTimeStamp)); demo_clustering1 | extend seg = iff(Region == "eau" and ScaleUnit == "su7" and DeploymentId == "b5d1d4df547d4a04ac15885617edba57" and ServiceHost == "e7f60c5d-4944-42b3-922a-92e98a8e7dec", "Problem", "Normal") | make-series num=count() on PreciseTimeStamp from min_t to max_t step 10m by seg | render timechart
This chart allows us to see that the spike on Tuesday afternoon was because of exceptions from this specific segment, discovered by using the
The Azure Data Explorer Machine Learning plugins are helpful for many scenarios. The
basket implement an unsupervised learning algorithm and are easy to use.
Diffpatterns implements a supervised learning algorithm and, although more complex, it's more powerful for extracting differentiation segments for RCA.
These plugins are used interactively in ad-hoc scenarios and in automatic near real-time monitoring services. In Azure Data Explorer, time series anomaly detection is followed by a diagnosis process. The process is highly optimized to meet necessary performance standards.