DataDriftDetector Class

Reference

Defines a data drift monitor that can be used to run data drift jobs in Azure Machine Learning.

The DataDriftDetector class enables you to identify drift between a given baseline and target dataset. A DataDriftDetector object is created in a workspace by either specifying the baseline and target datasets directly. For more information, see https://aka.ms/datadrift.

Datadriftdetector constructor.

The DataDriftDetector constructor is used to retrieve a cloud representation of a DataDriftDetector object associated with the provided workspace.

Inheritance: builtins.object

DataDriftDetector

Constructor

DataDriftDetector(workspace, name=None, baseline_dataset=None, target_dataset=None, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

Name	Description
workspace Required	Workspace The workspace in which to create the DataDriftDetector object.
name	str A unique name for the DataDriftDetector object. default value: None
baseline_dataset	TabularDataset Dataset to compare the target dataset against. default value: None
target_dataset	TabularDataset Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series. default value: None
compute_target	ComputeTarget or str Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified. default value: None
frequency	str Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month". default value: None
feature_list	list[str] Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if `feature_list` is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200. default value: None
alert_config	AlertConfiguration Optional configuration object for DataDriftDetector alerts. default value: None
drift_threshold	float Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default). default value: None
latency	int Delay in hours for data to appear in dataset. default value: None
workspace Required	Workspace The workspace in which to create the DataDriftDetector object.
name Required	str A unique name for the DataDriftDetector object.
baseline_dataset Required	TabularDataset Dataset to compare the target dataset against.
target_dataset Required	TabularDataset Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.
compute_target Required	ComputeTarget or str Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.
frequency Required	str Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".
feature_list Required	list[str] Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if `feature_list` is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.
alert_config Required	AlertConfiguration Optional configuration object for DataDriftDetector alerts.
drift_threshold Required	float Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).
latency Required	int Delay in hours for data to appear in dataset.

Remarks

A DataDriftDetector object represents a data drift job definition that can be used to run three job run types:

an adhoc run for analyzing a specific day's worth of data; see the run method.
a scheduled run in a pipeline; see the enable_schedule method.
a backfill run to see how data changes over time; see the backfill method.

The typical pattern for creating a DataDriftDetector is:

To create a dataset-based DataDriftDetector object, use create_from_datasets

The following example shows how to create a dataset-based DataDriftDetector object.


   from azureml.datadrift import DataDriftDetector, AlertConfiguration

   alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling

   monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
                                                         compute_target='cpu-cluster',         # compute target for scheduled pipeline and backfills
                                                         frequency='Week',                     # how often to analyze target data
                                                         feature_list=None,                    # list of features to detect drift on
                                                         drift_threshold=None,                 # threshold from 0 to 1 for email alerting
                                                         latency=0,                            # SLA in hours for target data to arrive in the dataset
                                                         alert_config=alert_config)            # email addresses to send alert

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb

The DataDriftDetector constructor retrieves an existing data drift object associated with the workspace.

Methods

backfill	Run a backfill job over a given specified start and end date. See https://aka.ms/datadrift for details on data drift backfill runs. NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.
create_from_datasets	Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.
delete	Delete the schedule for the DataDriftDetector object.
disable_schedule	Disable the schedule for the DataDriftDetector object.
enable_schedule	Create a schedule to run dataset-based DataDriftDetector job.
get_by_name	Retrieve a unique DataDriftDetector object for a given workspace and name.
get_output	Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.
list	Get a list of DataDriftDetector objects for the specified workspace and optional dataset. NOTE: Passing in only the `workspace` parameter will return all DataDriftDetector objects, defined in the workspace.
run	Run a single point in time data drift analysis.
show	Show data drift trend in given time range. By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.
update	Update the schedule associated with the DataDriftDetector object. Optional parameter values can be set to `None`, otherwise they default to their existing values.

backfill

Run a backfill job over a given specified start and end date.

See https://aka.ms/datadrift for details on data drift backfill runs.

NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.

backfill(start_date, end_date, compute_target=None, create_compute_target=False)

Parameters

Name	Description
start_date Required	datetime The start date of the backfill job.
end_date Required	datetime The end date of the backfill job, inclusive.
compute_target	ComputeTarget or str Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if none is specified. default value: None
create_compute_target	bool Indicates whether an Azure Machine Learning compute target is automatically created. default value: False

Returns

Type	Description
Run	A DataDriftDetector run.

create_from_datasets

Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.

static create_from_datasets(workspace, name, baseline_dataset, target_dataset, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

Name	Description
workspace Required	Workspace The workspace to create the DataDriftDetector in.
name Required	str A unique name for the DataDriftDetector object.
baseline_dataset Required	TabularDataset Dataset to compare the target dataset against.
target_dataset Required	TabularDataset Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.
compute_target	ComputeTarget or str Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified. default value: None
frequency	str Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month". default value: None
feature_list	list[str] Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if `feature_list` is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200. default value: None
alert_config	AlertConfiguration Optional configuration object for DataDriftDetector alerts. default value: None
drift_threshold	float Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default). default value: None
latency	int Delay in hours for data to appear in dataset. default value: None

Returns

Type	Description
DataDriftDetector	A DataDriftDetector object.

Exceptions

Type	Description
<xref:KeyError>, <xref:TypeError>, <xref:ValueError>

Remarks

Dataset-based DataDriftDetectors enable you to calculate data drift between a baseline dataset, which must be a TabularDataset, and a target dataset, which must be a time series dataset. A time series dataset is simply a TabularDataset with the fine_grain_timestamp property. The DataDriftDetector can then run adhoc or scheduled jobs to determine if the target dataset has drifted from the baseline dataset.


   from azureml.core import Workspace, Dataset
   from azureml.datadrift import DataDriftDetector

   ws = Workspace.from_config()
   baseline = Dataset.get_by_name(ws, 'my_baseline_dataset')
   target = Dataset.get_by_name(ws, 'my_target_dataset')

   detector = DataDriftDetector.create_from_datasets(workspace=ws,
                                                     name="my_unique_detector_name",
                                                     baseline_dataset=baseline,
                                                     target_dataset=target,
                                                     compute_target_name='my_compute_target',
                                                     frequency="Day",
                                                     feature_list=['my_feature_1', 'my_feature_2'],
                                                     alert_config=AlertConfiguration(email_addresses=['user@contoso.com']),
                                                     drift_threshold=0.3,
                                                     latency=1)

delete

Delete the schedule for the DataDriftDetector object.

delete(wait_for_completion=True)

Parameters

Name	Description
wait_for_completion	bool Whether to wait for the delete operation to complete. default value: True

disable_schedule

Disable the schedule for the DataDriftDetector object.

disable_schedule(wait_for_completion=True)

Parameters

Name	Description
wait_for_completion	bool Whether to wait for the disable operation to complete. default value: True

enable_schedule

Create a schedule to run dataset-based DataDriftDetector job.

enable_schedule(create_compute_target=False, wait_for_completion=True)

Parameters

Name	Description
create_compute_target	bool Indicates whether an Azure Machine Learning compute target is created automatically. default value: False
wait_for_completion	bool Whether to wait for the enable operation to complete. default value: True

get_by_name

Retrieve a unique DataDriftDetector object for a given workspace and name.

static get_by_name(workspace, name)

Parameters

Name	Description
workspace Required	Workspace The workspace where the DataDriftDetector was created.
name Required	str The name of the DataDriftDetector object to return.

Returns

Type	Description
DataDriftDetector	A DataDriftDetector object.

get_output

Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.

get_output(start_time=None, end_time=None, run_id=None)

Parameters

Name	Description
start_time	datetime, <xref:optional> The start time of the results window in UTC. If None (the default) is specified, then the most recent 10th cycle's results are used as the start time. For example, if frequency of the data drift schedule is day, then `start_time` is 10 days. If frequency is week, then `start_time` is 10 weeks. default value: None
end_time	datetime, <xref:optional> The end time of the results window in UTC. If None (the default) is specified, then the current day UTC is used as the end time. default value: None
run_id	int, <xref:optional> A specific run ID. default value: None

Returns

Type	Description
tuple(list, list)	A tuple of a list of drift results and a list of individual dataset and columnar metrics.

Remarks

This method returns a tuple of drift results and metrics for either a time window or run ID based on the type of run: an adhoc run, a scheduled run, and a backfill run.

To retrieve adhoc run results, there is only one way: run_id should be a valid GUID.
To retrieve scheduled runs and backfill run results, there are two different ways: either assign a valid GUID to run_id or assign a specific start_time and/or end_time (inclusive) while keeping run_id as None.
If run_id, start_time, and end_time are not None in the same method call, a parameter validation exception is raised.

NOTE: Specify either start_time and end_time parameters or the run_id parameter, but not both.

It's possible that there are multiple results for the same target date (target date means target dataset start date for dataset-based drift). Therefore, it's necessary to identify and handle duplicate results. For dataset-based drift, if results are for the same target date, then they are duplicated results. The get_output method will dedup any duplicated results by one rule: always pick up the latest generated results.

The get_output method can be used to retrieve all outputs or partial outputs of scheduled runs in a specific time range between start_time and end_time (boundary included). You can also limit the results of an individual adhoc by specifying the run_id.

Use the following guidelines to help interpret results returned from the get_output method:

Principle for filtering is "overlapping": as long as there is an overlap between the actual result time (dataset-based: target dataset [start date, end date]) and the given [start_time, end_time], then the result will be picked up.
If there are multiple outputs for one target date because the drift calculation was executed several times against that day, then only the latest output will be picked by default.
Given there are multiple types of a data drift instance, the result contents could be various.

For dataset-based results, the output will look like:


   results : [{'drift_type': 'DatasetBased',
               'result':[{'has_drift': True, 'drift_threshold': 0.3,
                          'start_date': '2019-04-03', 'end_date': '2019-04-04',
                          'base_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                          'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'}]}]
   metrics : [{'drift_type': 'DatasetBased',
               'metrics': [{'schema_version': '0.1',
                            'start_date': '2019-04-03', 'end_date': '2019-04-04',
                            'baseline_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                            'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'
                            'dataset_metrics': [{'name': 'datadrift_coefficient', 'value': 0.53459}],
                            'column_metrics': [{'feature1': [{'name': 'datadrift_contribution',
                                                              'value': 288.0},
                                                             {'name': 'wasserstein_distance',
                                                              'value': 4.858040000000001},
                                                             {'name': 'energy_distance',
                                                              'value': 2.7204799576545313}]}]}]}]

list

Get a list of DataDriftDetector objects for the specified workspace and optional dataset.

NOTE: Passing in only the workspace parameter will return all DataDriftDetector objects, defined in the workspace.

static list(workspace, baseline_dataset=None, target_dataset=None)

Parameters

Name	Description
workspace Required	Workspace The workspace where the DataDriftDetector objects were created.
baseline_dataset	TabularDataset Baseline dataset to filter the return list. default value: None
target_dataset	TabularDataset Target dataset to filter the return list. default value: None

Returns

Type	Description
list[DataDriftDetector]	A list of DataDriftDetector objects.

run

Run a single point in time data drift analysis.

run(target_date, compute_target=None, create_compute_target=False, feature_list=None, drift_threshold=None)

Parameters

Name	Description
target_date Required	datetime The target date of scoring data in UTC.
compute_target	ComputeTarget or str Optional Azure Machine Learning ComputeTarget or ComputeTarget name. If not specified, a compute target is created automatically. default value: None
create_compute_target	bool Indicates whether an Azure Machine Learning compute target is created automatically. default value: False
feature_list	list[str] Optional whitelisted features to run the datadrift detection on. default value: None
drift_threshold	float Optional threshold to enable DataDriftDetector alerts on. default value: None

Returns

Type	Description
Run	A DataDriftDetector run.

show

Show data drift trend in given time range.

By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.

show(start_time=None, end_time=None)

Parameters

Name	Description
start_time	datetime, <xref:optional> The start of the presentation time window in UTC. The default None means to pick up the most recent 10th cycle's results. default value: None
end_time	datetime, <xref:optional> The end of the presentation data time window in UTC. The default None means the current day. default value: None

Returns

Type	Description
dict()	A dictionary of all figures. The key is service_name.

update

Update the schedule associated with the DataDriftDetector object.

Optional parameter values can be set to None, otherwise they default to their existing values.

update(compute_target=Ellipsis, feature_list=Ellipsis, schedule_start=Ellipsis, alert_config=Ellipsis, drift_threshold=Ellipsis, wait_for_completion=True)

Parameters

Name	Description
compute_target	ComputeTarget or str Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if this parameter is not specified. default value: Ellipsis
feature_list	list[str] Whitelisted features to run the datadrift detection on. default value: Ellipsis
schedule_start	datetime The start time of the data drift schedule in UTC. default value: Ellipsis
alert_config	AlertConfiguration Optional configuration object for DataDriftDetector alerts. default value: Ellipsis
drift_threshold	float The threshold to enable DataDriftDetector alerts on. default value: Ellipsis
wait_for_completion	bool Whether to wait for the enable/disable/delete operations to complete. default value: True

Returns

Type	Description
DataDriftDetector	self

Attributes

alert_config

Get the alert configuration for the DataDriftDetector object.

Returns

Type	Description
AlertConfiguration	An AlertConfiguration object.

baseline_dataset

Get the baseline dataset associated with the DataDriftDetector object.

Returns

Type	Description
TabularDataset	Dataset type of the baseline dataset.

compute_target

Get the compute target attached to the DataDriftDetector object.

Returns

Type	Description
ComputeTarget	The compute target.

drift_threshold

Get the drift threshold for the DataDriftDetector object.

Returns

Type	Description
float	The drift threshold.

drift_type

Get the type of the DataDriftDetector, 'DatasetBased' is the only value supported for now.

Returns

Type	Description
str	The type of DataDriftDetector object.

enabled

Get the boolean value indicating whether the DataDriftDetector object is enabled.

Returns

Type	Description
bool	A boolean value; True for enabled.

feature_list

Get the list of whitelisted features for the DataDriftDetector object.

Returns

Type	Description
list[str]	A list of feature names.

frequency

Get the frequency of the DataDriftDetector schedule.

Returns

Type	Description
str	A string of either "Day", "Week", or "Month"

interval

Get the interval of the DataDriftDetector schedule.

Returns

Type	Description
int	An integer value of time unit.

latency

Get the latency of the DataDriftDetector schedule jobs (in hours).

Returns

Type	Description
int	The number of hours representing the latency.

name

Get the name of the DataDriftDetector object.

Returns

Type	Description
str	The DataDriftDetector name.

schedule_start

Get the start time of the schedule.

Returns

Type	Description
datetime	A datetime object of schedule start time in UTC.

state

Denotes the state of the DataDriftDetector schedule.

Returns

Type	Description
str	One of 'Disabled', 'Enabled', 'Deleted', 'Disabling', 'Enabling', 'Deleting', 'Failed', 'DisableFailed', 'EnableFailed', 'DeleteFailed'.

target_dataset

Get the target dataset associated with the DataDriftDetector object.

Returns

Type	Description
TabularDataset	The dataset type of the baseline dataset.

workspace

Get the workspace of the DataDriftDetector object.

Returns

Type	Description
Workspace	The workspace the DataDriftDetector object was created in.

DataDriftDetector Class

Constructor

Parameters

Remarks

Methods

backfill

Parameters

Returns

create_from_datasets

Parameters

Returns

Exceptions

Remarks

delete

Parameters

disable_schedule

Parameters

enable_schedule

Parameters

get_by_name

Parameters

Returns

get_output

Parameters

Returns

Remarks

list

Parameters

Returns

run

Parameters

Returns

show

Parameters

Returns

update

Parameters

Returns

Attributes

alert_config

Returns

baseline_dataset

Returns

compute_target

Returns

drift_threshold

Returns

drift_type

Returns

enabled

Returns

feature_list

Returns

frequency

Returns

interval

Returns

latency

Returns

name

Returns

schedule_start

Returns

state

Returns

target_dataset

Returns

workspace

Returns

Feedback

Feedback

Additional resources