DataDriftDetector Class

Defines a data drift monitor that can be used to run data drift jobs in Azure Machine Learning.

The DataDriftDetector class enables you to identify drift between a given baseline and target dataset. A DataDriftDetector object is created in a workspace by either specifying the baseline and target datasets directly. For more information, see https://aka.ms/datadrift.

Datadriftdetector constructor.

The DataDriftDetector constructor is used to retrieve a cloud representation of a DataDriftDetector object associated with the provided workspace.

Inheritance
builtins.object
DataDriftDetector

Constructor

DataDriftDetector(workspace, name=None, baseline_dataset=None, target_dataset=None, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

workspace
Workspace
Required

The workspace in which to create the DataDriftDetector object.

name
str
default value: None

A unique name for the DataDriftDetector object.

baseline_dataset
TabularDataset
default value: None

Dataset to compare the target dataset against.

target_dataset
TabularDataset
default value: None

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target
ComputeTarget or str
default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency
str
default value: None

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list
list[str]
default value: None

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

alert_config
AlertConfiguration
default value: None

Optional configuration object for DataDriftDetector alerts.

drift_threshold
float
default value: None

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency
int
default value: None

Delay in hours for data to appear in dataset.

workspace
Workspace
Required

The workspace in which to create the DataDriftDetector object.

name
str
Required

A unique name for the DataDriftDetector object.

baseline_dataset
TabularDataset
Required

Dataset to compare the target dataset against.

target_dataset
TabularDataset
Required

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target
ComputeTarget or str
Required

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency
str
Required

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list
list[str]
Required

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

alert_config
AlertConfiguration
Required

Optional configuration object for DataDriftDetector alerts.

drift_threshold
float
Required

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency
int
Required

Delay in hours for data to appear in dataset.

Remarks

A DataDriftDetector object represents a data drift job definition that can be used to run three job run types:

  • an adhoc run for analyzing a specific day's worth of data; see the run method.

  • a scheduled run in a pipeline; see the enable_schedule method.

  • a backfill run to see how data changes over time; see the backfill method.

The typical pattern for creating a DataDriftDetector is:

The following example shows how to create a dataset-based DataDriftDetector object.


   from azureml.datadrift import DataDriftDetector, AlertConfiguration

   alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling

   monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
                                                         compute_target='cpu-cluster',         # compute target for scheduled pipeline and backfills
                                                         frequency='Week',                     # how often to analyze target data
                                                         feature_list=None,                    # list of features to detect drift on
                                                         drift_threshold=None,                 # threshold from 0 to 1 for email alerting
                                                         latency=0,                            # SLA in hours for target data to arrive in the dataset
                                                         alert_config=alert_config)            # email addresses to send alert

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb

The DataDriftDetector constructor retrieves an existing data drift object associated with the workspace.

Methods

backfill

Run a backfill job over a given specified start and end date.

See https://aka.ms/datadrift for details on data drift backfill runs.

NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.

create_from_datasets

Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.

delete

Delete the schedule for the DataDriftDetector object.

disable_schedule

Disable the schedule for the DataDriftDetector object.

enable_schedule

Create a schedule to run dataset-based DataDriftDetector job.

get_by_name

Retrieve a unique DataDriftDetector object for a given workspace and name.

get_output

Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.

list

Get a list of DataDriftDetector objects for the specified workspace and optional dataset.

NOTE: Passing in only the workspace parameter will return all DataDriftDetector objects, defined in the workspace.

run

Run a single point in time data drift analysis.

show

Show data drift trend in given time range.

By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.

update

Update the schedule associated with the DataDriftDetector object.

Optional parameter values can be set to None, otherwise they default to their existing values.

backfill

Run a backfill job over a given specified start and end date.

See https://aka.ms/datadrift for details on data drift backfill runs.

NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.

backfill(start_date, end_date, compute_target=None, create_compute_target=False)

Parameters

start_date
datetime
Required

The start date of the backfill job.

end_date
datetime
Required

The end date of the backfill job, inclusive.

compute_target
ComputeTarget or str
default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if none is specified.

create_compute_target
bool
default value: False

Indicates whether an Azure Machine Learning compute target is automatically created.

Returns

A DataDriftDetector run.

Return type

Run

create_from_datasets

Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.

static create_from_datasets(workspace, name, baseline_dataset, target_dataset, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

workspace
Workspace
Required

The workspace to create the DataDriftDetector in.

name
str
Required

A unique name for the DataDriftDetector object.

baseline_dataset
TabularDataset
Required

Dataset to compare the target dataset against.

target_dataset
TabularDataset
Required

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target
ComputeTarget or str
default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency
str
default value: None

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list
list[str]
default value: None

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

alert_config
AlertConfiguration
default value: None

Optional configuration object for DataDriftDetector alerts.

drift_threshold
float
default value: None

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency
int
default value: None

Delay in hours for data to appear in dataset.

Returns

A DataDriftDetector object.

Return type

Exceptions

<xref:KeyError>, <xref:TypeError>, <xref:ValueError>

Remarks

Dataset-based DataDriftDetectors enable you to calculate data drift between a baseline dataset, which must be a TabularDataset, and a target dataset, which must be a time series dataset. A time series dataset is simply a TabularDataset with the fine_grain_timestamp property. The DataDriftDetector can then run adhoc or scheduled jobs to determine if the target dataset has drifted from the baseline dataset.


   from azureml.core import Workspace, Dataset
   from azureml.datadrift import DataDriftDetector

   ws = Workspace.from_config()
   baseline = Dataset.get_by_name(ws, 'my_baseline_dataset')
   target = Dataset.get_by_name(ws, 'my_target_dataset')

   detector = DataDriftDetector.create_from_datasets(workspace=ws,
                                                     name="my_unique_detector_name",
                                                     baseline_dataset=baseline,
                                                     target_dataset=target,
                                                     compute_target_name='my_compute_target',
                                                     frequency="Day",
                                                     feature_list=['my_feature_1', 'my_feature_2'],
                                                     alert_config=AlertConfiguration(email_addresses=['user@contoso.com']),
                                                     drift_threshold=0.3,
                                                     latency=1)

delete

Delete the schedule for the DataDriftDetector object.

delete(wait_for_completion=True)

Parameters

wait_for_completion
bool
default value: True

Whether to wait for the delete operation to complete.

disable_schedule

Disable the schedule for the DataDriftDetector object.

disable_schedule(wait_for_completion=True)

Parameters

wait_for_completion
bool
default value: True

Whether to wait for the disable operation to complete.

enable_schedule

Create a schedule to run dataset-based DataDriftDetector job.

enable_schedule(create_compute_target=False, wait_for_completion=True)

Parameters

create_compute_target
bool
default value: False

Indicates whether an Azure Machine Learning compute target is created automatically.

wait_for_completion
bool
default value: True

Whether to wait for the enable operation to complete.

get_by_name

Retrieve a unique DataDriftDetector object for a given workspace and name.

static get_by_name(workspace, name)

Parameters

workspace
Workspace
Required

The workspace where the DataDriftDetector was created.

name
str
Required

The name of the DataDriftDetector object to return.

Returns

A DataDriftDetector object.

Return type

get_output

Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.

get_output(start_time=None, end_time=None, run_id=None)

Parameters

start_time
datetime, <xref:optional>
default value: None

The start time of the results window in UTC. If None (the default) is specified, then the most recent 10th cycle's results are used as the start time. For example, if frequency of the data drift schedule is day, then start_time is 10 days. If frequency is week, then start_time is 10 weeks.

end_time
datetime, <xref:optional>
default value: None

The end time of the results window in UTC. If None (the default) is specified, then the current day UTC is used as the end time.

run_id
int, <xref:optional>
default value: None

A specific run ID.

Returns

A tuple of a list of drift results and a list of individual dataset and columnar metrics.

Return type

Remarks

This method returns a tuple of drift results and metrics for either a time window or run ID based on the type of run: an adhoc run, a scheduled run, and a backfill run.

  • To retrieve adhoc run results, there is only one way: run_id should be a valid GUID.

  • To retrieve scheduled runs and backfill run results, there are two different ways: either assign a valid GUID to run_id or assign a specific start_time and/or end_time (inclusive) while keeping run_id as None.

  • If run_id, start_time, and end_time are not None in the same method call, a parameter validation exception is raised.

NOTE: Specify either start_time and end_time parameters or the run_id parameter, but not both.

It's possible that there are multiple results for the same target date (target date means target dataset start date for dataset-based drift). Therefore, it's necessary to identify and handle duplicate results. For dataset-based drift, if results are for the same target date, then they are duplicated results. The get_output method will dedup any duplicated results by one rule: always pick up the latest generated results.

The get_output method can be used to retrieve all outputs or partial outputs of scheduled runs in a specific time range between start_time and end_time (boundary included). You can also limit the results of an individual adhoc by specifying the run_id.

Use the following guidelines to help interpret results returned from the get_output method:

  • Principle for filtering is "overlapping": as long as there is an overlap between the actual result time (dataset-based: target dataset [start date, end date]) and the given [start_time, end_time], then the result will be picked up.

  • If there are multiple outputs for one target date because the drift calculation was executed several times against that day, then only the latest output will be picked by default.

  • Given there are multiple types of a data drift instance, the result contents could be various.

For dataset-based results, the output will look like:


   results : [{'drift_type': 'DatasetBased',
               'result':[{'has_drift': True, 'drift_threshold': 0.3,
                          'start_date': '2019-04-03', 'end_date': '2019-04-04',
                          'base_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                          'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'}]}]
   metrics : [{'drift_type': 'DatasetBased',
               'metrics': [{'schema_version': '0.1',
                            'start_date': '2019-04-03', 'end_date': '2019-04-04',
                            'baseline_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                            'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'
                            'dataset_metrics': [{'name': 'datadrift_coefficient', 'value': 0.53459}],
                            'column_metrics': [{'feature1': [{'name': 'datadrift_contribution',
                                                              'value': 288.0},
                                                             {'name': 'wasserstein_distance',
                                                              'value': 4.858040000000001},
                                                             {'name': 'energy_distance',
                                                              'value': 2.7204799576545313}]}]}]}]

list

Get a list of DataDriftDetector objects for the specified workspace and optional dataset.

NOTE: Passing in only the workspace parameter will return all DataDriftDetector objects, defined in the workspace.

static list(workspace, baseline_dataset=None, target_dataset=None)

Parameters

workspace
Workspace
Required

The workspace where the DataDriftDetector objects were created.

baseline_dataset
TabularDataset
default value: None

Baseline dataset to filter the return list.

target_dataset
TabularDataset
default value: None

Target dataset to filter the return list.

Returns

A list of DataDriftDetector objects.

Return type

run

Run a single point in time data drift analysis.

run(target_date, compute_target=None, create_compute_target=False, feature_list=None, drift_threshold=None)

Parameters

target_date
datetime
Required

The target date of scoring data in UTC.

compute_target
ComputeTarget or str
default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. If not specified, a compute target is created automatically.

create_compute_target
bool
default value: False

Indicates whether an Azure Machine Learning compute target is created automatically.

feature_list
list[str]
default value: None

Optional whitelisted features to run the datadrift detection on.

drift_threshold
float
default value: None

Optional threshold to enable DataDriftDetector alerts on.

Returns

A DataDriftDetector run.

Return type

Run

show

Show data drift trend in given time range.

By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.

show(start_time=None, end_time=None)

Parameters

start_time
datetime, <xref:optional>
default value: None

The start of the presentation time window in UTC. The default None means to pick up the most recent 10th cycle's results.

end_time
datetime, <xref:optional>
default value: None

The end of the presentation data time window in UTC. The default None means the current day.

Returns

A dictionary of all figures. The key is service_name.

Return type

dict()

update

Update the schedule associated with the DataDriftDetector object.

Optional parameter values can be set to None, otherwise they default to their existing values.

update(compute_target=Ellipsis, feature_list=Ellipsis, schedule_start=Ellipsis, alert_config=Ellipsis, drift_threshold=Ellipsis, wait_for_completion=True)

Parameters

compute_target
ComputeTarget or str
default value: Ellipsis

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if this parameter is not specified.

feature_list
list[str]
default value: Ellipsis

Whitelisted features to run the datadrift detection on.

schedule_start
datetime
default value: Ellipsis

The start time of the data drift schedule in UTC.

alert_config
AlertConfiguration
default value: Ellipsis

Optional configuration object for DataDriftDetector alerts.

drift_threshold
float
default value: Ellipsis

The threshold to enable DataDriftDetector alerts on.

wait_for_completion
bool
default value: True

Whether to wait for the enable/disable/delete operations to complete.

Returns

self

Return type

Attributes

alert_config

Get the alert configuration for the DataDriftDetector object.

Returns

An AlertConfiguration object.

Return type

baseline_dataset

Get the baseline dataset associated with the DataDriftDetector object.

Returns

Dataset type of the baseline dataset.

Return type

compute_target

Get the compute target attached to the DataDriftDetector object.

Returns

The compute target.

Return type

drift_threshold

Get the drift threshold for the DataDriftDetector object.

Returns

The drift threshold.

Return type

drift_type

Get the type of the DataDriftDetector, 'DatasetBased' is the only value supported for now.

Returns

The type of DataDriftDetector object.

Return type

str

enabled

Get the boolean value indicating whether the DataDriftDetector object is enabled.

Returns

A boolean value; True for enabled.

Return type

feature_list

Get the list of whitelisted features for the DataDriftDetector object.

Returns

A list of feature names.

Return type

frequency

Get the frequency of the DataDriftDetector schedule.

Returns

A string of either "Day", "Week", or "Month"

Return type

str

interval

Get the interval of the DataDriftDetector schedule.

Returns

An integer value of time unit.

Return type

int

latency

Get the latency of the DataDriftDetector schedule jobs (in hours).

Returns

The number of hours representing the latency.

Return type

int

name

Get the name of the DataDriftDetector object.

Returns

The DataDriftDetector name.

Return type

str

schedule_start

Get the start time of the schedule.

Returns

A datetime object of schedule start time in UTC.

Return type

state

Denotes the state of the DataDriftDetector schedule.

Returns

One of 'Disabled', 'Enabled', 'Deleted', 'Disabling', 'Enabling', 'Deleting', 'Failed', 'DisableFailed', 'EnableFailed', 'DeleteFailed'.

Return type

str

target_dataset

Get the target dataset associated with the DataDriftDetector object.

Returns

The dataset type of the baseline dataset.

Return type

workspace

Get the workspace of the DataDriftDetector object.

Returns

The workspace the DataDriftDetector object was created in.

Return type