Best practices for using the Multivariate Anomaly Detection API

This article will provide guidance around recommended practices to follow when using the multivariate Anomaly Detection (MVAD) APIs. In this tutorial, you'll:

  • API usage: Learn how to use MVAD without errors.
  • Data engineering: Learn how to best cook your data so that MVAD performs with better accuracy.
  • Common pitfalls: Learn how to avoid common pitfalls that customers meet.
  • FAQ: Learn answers to frequently asked questions.

API usage

Follow the instructions in this section to avoid errors while using MVAD. If you still get errors, please refer to the full list of error codes for explanations and actions to take.

Input parameters

Required parameters

These three parameters are required in training and inference API requests:

  • source - The link to your zip file located in the Azure Blob Storage with Shared Access Signatures (SAS).
  • startTime - The start time of data used for training or inference. If it's earlier than the actual earliest timestamp in the data, the actual earliest timestamp will be used as the starting point.
  • endTime - The end time of data used for training or inference which must be later than or equal to startTime. If endTime is later than the actual latest timestamp in the data, the actual latest timestamp will be used as the ending point. If endTime equals to startTime, it means inference of one single data point which is often used in streaming scenarios.

Optional parameters for training API

Other parameters for training API are optional:

  • slidingWindow - How many data points are used to determine anomalies. An integer between 28 and 2,880. The default value is 300. If slidingWindow is k for model training, then at least k points should be accessible from the source file during inference to get valid results.

    MVAD takes a segment of data points to decide if the next data point is an anomaly. The length of the segment is slidingWindow. Please keep two things in mind when choosing a slidingWindow value:

    1. The properties of your data: whether it's periodic and the sampling rate. When your data is periodic, you could set the length of 1 - 3 cycles as the slidingWindow. When your data is at a high frequency (small granularity) like minute-level or second-level, you could set a relatively higher value of slidingWindow.
    2. The trade-off between training/inference time and potential performance impact. A larger slidingWindow may cause longer training/inference time. There is no guarantee that larger slidingWindows will lead to accuracy gains. A small slidingWindow may cause the model difficult to converge to an optimal solution. For example, it is hard to detect anomalies when slidingWindow has only two points.
  • alignMode - How to align multiple variables (time series) on timestamps. There are two options for this parameter, Inner and Outer, and the default value is Outer.

    This parameter is critical when there is misalignment between timestamp sequences of the variables. The model needs to align the variables onto the same timestamp sequence before further processing.

    Inner means the model will report detection results only on timestamps on which every variable has a value, i.e. the intersection of all variables. Outer means the model will report detection results on timestamps on which any variable has a value, i.e. the union of all variables.

    Here is an example to explain different alignModel values.


    timestamp value
    2020-11-01 1
    2020-11-02 2
    2020-11-04 4
    2020-11-05 5


    timestamp value
    2020-11-01 1
    2020-11-02 2
    2020-11-03 3
    2020-11-04 4

    Inner join two variables

    timestamp Variable-1 Variable-2
    2020-11-01 1 1
    2020-11-02 2 2
    2020-11-04 4 4

    Outer join two variables

    timestamp Variable-1 Variable-2
    2020-11-01 1 1
    2020-11-02 2 2
    2020-11-03 nan 3
    2020-11-04 4 4
    2020-11-05 5 nan
  • fillNAMethod - How to fill nan in the merged table. There might be missing values in the merged table and they should be properly handled. We provide several methods to fill them up. The options are Linear, Previous, Subsequent, Zero, and Fixed and the default value is Linear.

    Option Method
    Linear Fill nan values by linear interpolation
    Previous Propagate last valid value to fill gaps. Example: [1, 2, nan, 3, nan, 4] -> [1, 2, 2, 3, 3, 4]
    Subsequent Use next valid value to fill gaps. Example: [1, 2, nan, 3, nan, 4] -> [1, 2, 3, 3, 4, 4]
    Zero Fill nan values with 0.
    Fixed Fill nan values with a specified valid value that should be provided in paddingValue.
  • paddingValue - Padding value is used to fill nan when fillNAMethod is Fixed and must be provided in that case. In other cases it is optional.

  • displayName - This is an optional parameter which is used to identify models. For example, you can use it to mark parameters, data sources, and any other meta data about the model and its input data. The default value is an empty string.

Input data schema

MVAD detects anomalies from a group of metrics, and we call each metric a variable or a time series.

  • You could download the sample data file from Microsoft to check the accepted schema from:

  • Each variable must have two and only two fields, timestamp and value, and should be stored in a comma-separated values (CSV) file.

  • The column names of the CSV file should be precisely timestamp and value, case-sensitive.

  • The timestamp values should conform to ISO 8601; the value could be integers or decimals with any number of decimal places. A good example of the content of a CSV file:

    timestamp value
    2019-04-01T00:00:00Z 5
    2019-04-01T00:01:00Z 3.6
    2019-04-01T00:02:00Z 4
    ... ...


    If your timestamps have hours, minutes, and/or seconds, ensure that they're properly rounded up before calling the APIs.

    For example, if your data frequency is supposed to be one data point every 30 seconds, but you're seeing timestamps like "12:00:01" and "12:00:28", it's a strong signal that you should pre-process the timestamps to new values like "12:00:00" and "12:00:30".

    For details, please refer to the "Timestamp round-up" section in the best practices document.

  • The name of the csv file will be used as the variable name and should be unique. For example, "temperature.csv" and "humidity.csv".

  • Variables for training and variables for inference should be consistent. For example, if you are using series_1, series_2, series_3, series_4, and series_5 for training, you should provide exactly the same variables for inference.

  • CSV files should be compressed into a zip file and uploaded to an Azure blob container. The zip file can have whatever name you want.

Folder structure

A common mistake in data preparation is extra folders in the zip file. For example, assume the name of the zip file is Then after decompressing the files to a new folder ./series, the correct path to CSV files is ./series/series_1.csv and a wrong path could be ./series/foo/bar/series_1.csv.

The correct example of the directory tree after decompressing the zip file in Windows

└── series
    ├── series_1.csv
    ├── series_2.csv
    ├── series_3.csv
    ├── series_4.csv
    └── series_5.csv

An incorrect example of the directory tree after decompressing the zip file in Windows

└── series
    └── series
        ├── series_1.csv
        ├── series_2.csv
        ├── series_3.csv
        ├── series_4.csv
        └── series_5.csv

Data engineering

Now you're able to run your code with MVAD APIs without any error. What could be done to improve your model accuracy?

Data quality

  • As the model learns normal patterns from historical data, the training data should represent the overall normal state of the system. It's hard for the model to learn these types of patterns if the training data is full of anomalies. An empirical threshold of abnormal rate is 1% and below for good accuracy.
  • In general, the missing value ratio of training data should be under 20%. Too much missing data may end up with automatically filled values (usually linear values or constant values) being learned as normal patterns. That may result in real (not missing) data points being detected as anomalies.

Data quantity

  • The underlying model of MVAD has millions of parameters. It needs a minimum number of data points to learn an optimal set of parameters. The empirical rule is that you need to provide 5,000 or more data points (timestamps) per variable to train the model for good accuracy. In general, the more the training data, better the accuracy. However, in cases when you're not able to accrue that much data, we still encourage you to experiment with less data and see if the compromised accuracy is still acceptable.

  • Every time when you call the inference API, you need to ensure that the source data file contains just enough data points. That is normally slidingWindow + number of data points that really need inference results. For example, in a streaming case when every time you want to inference on ONE new timestamp, the data file could contain only the leading slidingWindow plus ONE data point; then you could move on and create another zip file with the same number of data points (slidingWindow + 1) but moving ONE step to the "right" side and submit for another inference job.

    Anything beyond that or "before" the leading sliding window won't impact the inference result at all and may only cause performance downgrade.Anything below that may lead to an NotEnoughInput error.

Timestamp round-up

In a group of variables (time series), each variable may be collected from an independent source. The timestamps of different variables may be inconsistent with each other and with the known frequencies. Here's a simple example.


timestamp value
12:00:01 1.0
12:00:35 1.5
12:01:02 0.9
12:01:31 2.2
12:02:08 1.3


timestamp value
12:00:03 2.2
12:00:37 2.6
12:01:09 1.4
12:01:34 1.7
12:02:04 2.0

We have two variables collected from two sensors which send one data point every 30 seconds. However, the sensors aren't sending data points at a strict even frequency, but sometimes earlier and sometimes later. Because MVAD will take into consideration correlations between different variables, timestamps must be properly aligned so that the metrics can correctly reflect the condition of the system. In the above example, timestamps of variable 1 and variable 2 must be properly 'rounded' to their frequency before alignment.

Let's see what happens if they're not pre-processed. If we set alignMode to be Outer (which means union of two sets), the merged table will be

timestamp Variable-1 Variable-2
12:00:01 1.0 nan
12:00:03 nan 2.2
12:00:35 1.5 nan
12:00:37 nan 2.6
12:01:02 0.9 nan
12:01:09 nan 1.4
12:01:31 2.2 nan
12:01:34 nan 1.7
12:02:04 nan 2.0
12:02:08 1.3 nan

nan indicates missing values. Obviously, the merged table isn't what you might have expected. Variable 1 and variable 2 interleave, and the MVAD model can't extract information about correlations between them. If we set alignMode to Inner, the merged table will be empty as there's no common timestamp in variable 1 and variable 2.

Therefore, the timestamps of variable 1 and variable 2 should be pre-processed (rounded to the nearest 30-second timestamps) and the new time series are


timestamp value
12:00:00 1.0
12:00:30 1.5
12:01:00 0.9
12:01:30 2.2
12:02:00 1.3


timestamp value
12:00:00 2.2
12:00:30 2.6
12:01:00 1.4
12:01:30 1.7
12:02:00 2.0

Now the merged table is more reasonable.

timestamp Variable-1 Variable-2
12:00:00 1.0 2.2
12:00:30 1.5 2.6
12:01:00 0.9 1.4
12:01:30 2.2 1.7
12:02:00 1.3 2.0

Values of different variables at close timestamps are well aligned, and the MVAD model can now extract correlation information.


There are some limitations in both the training and inference APIs, you should be aware of these limitations to avoid errors.

General Limitations

  • Sliding window: 28-2880 timestamps, default is 300. For periodic data, set the length of 2-4 cycles as the sliding window.
  • Variable numbers: For training and batch inference, at most 301 variables.

Training Limitations

  • Timestamps: At most 1000000. Too few timestamps may decrease model quality. Recommend having more than 5,000 timestamps.
  • Granularity: The minimum granularity is per_second.

Batch inference limitations

  • Timestamps: At most 20000, at least 1 sliding window length.

Streaming inference limitations

  • Timestamps: At most 2880, at least 1 sliding window length.
  • Detecting timestamps: From 1 to 10.

Model quality

How to deal with false positive and false negative in real scenarios?

We have provided severity which indicates the significance of anomalies. False positives may be filtered out by setting up a threshold on the severity. Sometimes too many false positives may appear when there are pattern shifts in the inference data. In such cases a model may need to be retrained on new data. If the training data contains too many anomalies, there could be false negatives in the detection results. This is because the model learns patterns from the training data and anomalies may bring bias to the model. Thus proper data cleaning may help reduce false negatives.

How to estimate which model is best to use according to training loss and validation loss?

Generally speaking, it's hard to decide which model is the best without a labeled dataset. However, we can leverage the training and validation losses to have a rough estimation and discard those bad models. First, we need to observe whether training losses converge. Divergent losses often indicate poor quality of the model. Second, loss values may help identify whether underfitting or overfitting occurs. Models that are underfitting or overfitting may not have desired performance. Third, although the definition of the loss function doesn't reflect the detection performance directly, loss values may be an auxiliary tool to estimate model quality. Low loss value is a necessary condition for a good model, thus we may discard models with high loss values.

Common pitfalls

Apart from the error code table, we've learned from customers like you some common pitfalls while using MVAD APIs. This table will help you to avoid these issues.

Pitfall Consequence Explanation and solution
Timestamps in training data and/or inference data weren't rounded up to align with the respective data frequency of each variable. The timestamps of the inference results aren't as expected: either too few timestamps or too many timestamps. Please refer to Timestamp round-up.
Too many anomalous data points in the training data Model accuracy is impacted negatively because it treats anomalous data points as normal patterns during training. Empirically, keep the abnormal rate at or below 1% will help.
Too little training data Model accuracy is compromised. Empirically, training a MVAD model requires 15,000 or more data points (timestamps) per variable to keep a good accuracy.
Taking all data points with isAnomaly=true as anomalies Too many false positives You should use both isAnomaly and severity (or score) to sift out anomalies that aren't severe and (optionally) use grouping to check the duration of the anomalies to suppress random noises. Please refer to the FAQ section below for the difference between severity and score.
Sub-folders are zipped into the data file for training or inference. The csv data files inside sub-folders are ignored during training and/or inference. No sub-folders are allowed in the zip file. Please refer to Folder structure for details.
Too much data in the inference data file: for example, compressing all historical data in the inference data zip file You may not see any errors but you'll experience degraded performance when you try to upload the zip file to Azure Blob as well as when you try to run inference. Please refer to Data quantity for details.
Creating Anomaly Detector resources on Azure regions that don't support MVAD yet and calling MVAD APIs You'll get a "resource not found" error while calling the MVAD APIs. During preview stage, MVAD is available on limited regions only. Please bookmark What's new in Anomaly Detector to keep up to date with MVAD region roll-outs. You could also file a GitHub issue or contact us at to request for specific regions.


How does MVAD sliding window work?

Let's use two examples to learn how MVAD's sliding window works. Suppose you have set slidingWindow = 1,440, and your input data is at one-minute granularity.

  • Streaming scenario: You want to predict whether the ONE data point at "2021-01-02T00:00:00Z" is anomalous. Your startTime and endTime will be the same value ("2021-01-02T00:00:00Z"). Your inference data source, however, must contain at least 1,440 + 1 timestamps. Because MVAD will take the leading data before the target data point ("2021-01-02T00:00:00Z") to decide whether the target is an anomaly. The length of the needed leading data is slidingWindow or 1,440 in this case. 1,440 = 60 * 24, so your input data must start from at latest "2021-01-01T00:00:00Z".

  • Batch scenario: You have multiple target data points to predict. Your endTime will be greater than your startTime. Inference in such scenarios is performed in a "moving window" manner. For example, MVAD will use data from 2021-01-01T00:00:00Z to 2021-01-01T23:59:00Z (inclusive) to determine whether data at 2021-01-02T00:00:00Z is anomalous. Then it moves forward and uses data from 2021-01-01T00:01:00Z to 2021-01-02T00:00:00Z (inclusive) to determine whether data at 2021-01-02T00:01:00Z is anomalous. It moves on in the same manner (taking 1,440 data points to compare) until the last timestamp specified by endTime (or the actual latest timestamp). Therefore, your inference data source must contain data starting from startTime - slidingWindow and ideally contains in total of size slidingWindow + (endTime - startTime).

What's the difference between severity and score?

Normally we recommend you to use severity as the filter to sift out 'anomalies' that aren't so important to your business. Depending on your scenario and data pattern, those anomalies that are less important often have relatively lower severity values or standalone (discontinuous) high severity values like random spikes.

In cases where you've found a need of more sophisticated rules than thresholds against severity or duration of continuous high severity values, you may want to use score to build more powerful filters. Understanding how MVAD is using score to determine anomalies may help:

We consider whether a data point is anomalous from both global and local perspective. If score at a timestamp is higher than a certain threshold, then the timestamp is marked as an anomaly. If score is lower than the threshold but is relatively higher in a segment, it's also marked as an anomaly.

Next steps