DataDriftDetector Class

Defines a data drift monitor that can be used to run data drift jobs in Azure Machine Learning.

The DataDriftDetector class enables you to identify drift between a given baseline and target dataset. A DataDriftDetector object is created in a workspace by either specifying the baseline and target datasets directly. For more information, see https://aka.ms/datadrift.

Datadriftdetector constructor.

The DataDriftDetector constructor is used to retrieve a cloud representation of a DataDriftDetector object associated with the provided workspace.

Inheritance
builtins.object
DataDriftDetector

Constructor

DataDriftDetector(workspace, name=None, baseline_dataset=None, target_dataset=None, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

Name Description
workspace
Required

The workspace in which to create the DataDriftDetector object.

name
str

A unique name for the DataDriftDetector object.

Default value: None
baseline_dataset

Dataset to compare the target dataset against.

Default value: None
target_dataset

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

Default value: None
compute_target

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

Default value: None
frequency
str

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

Default value: None
feature_list

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

Default value: None
alert_config

Optional configuration object for DataDriftDetector alerts.

Default value: None
drift_threshold

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

Default value: None
latency
int

Delay in hours for data to appear in dataset.

Default value: None
workspace
Required

The workspace in which to create the DataDriftDetector object.

name
Required
str

A unique name for the DataDriftDetector object.

baseline_dataset
Required

Dataset to compare the target dataset against.

target_dataset
Required

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target
Required

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency
Required
str

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list
Required

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

alert_config
Required

Optional configuration object for DataDriftDetector alerts.

drift_threshold
Required

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency
Required
int

Delay in hours for data to appear in dataset.

Remarks

A DataDriftDetector object represents a data drift job definition that can be used to run three job run types:

  • an adhoc run for analyzing a specific day's worth of data; see the run method.

  • a scheduled run in a pipeline; see the enable_schedule method.

  • a backfill run to see how data changes over time; see the backfill method.

The typical pattern for creating a DataDriftDetector is:

The following example shows how to create a dataset-based DataDriftDetector object.


   from azureml.datadrift import DataDriftDetector, AlertConfiguration

   alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling

   monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
                                                         compute_target='cpu-cluster',         # compute target for scheduled pipeline and backfills
                                                         frequency='Week',                     # how often to analyze target data
                                                         feature_list=None,                    # list of features to detect drift on
                                                         drift_threshold=None,                 # threshold from 0 to 1 for email alerting
                                                         latency=0,                            # SLA in hours for target data to arrive in the dataset
                                                         alert_config=alert_config)            # email addresses to send alert

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb

The DataDriftDetector constructor retrieves an existing data drift object associated with the workspace.

Methods

backfill

Run a backfill job over a given specified start and end date.

See https://aka.ms/datadrift for details on data drift backfill runs.

NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.

create_from_datasets

Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.

delete

Delete the schedule for the DataDriftDetector object.

disable_schedule

Disable the schedule for the DataDriftDetector object.

enable_schedule

Create a schedule to run dataset-based DataDriftDetector job.

get_by_name

Retrieve a unique DataDriftDetector object for a given workspace and name.

get_output

Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.

list

Get a list of DataDriftDetector objects for the specified workspace and optional dataset.

NOTE: Passing in only the workspace parameter will return all DataDriftDetector objects, defined in the workspace.

run

Run a single point in time data drift analysis.

show

Show data drift trend in given time range.

By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.

update

Update the schedule associated with the DataDriftDetector object.

Optional parameter values can be set to None, otherwise they default to their existing values.

backfill

Run a backfill job over a given specified start and end date.

See https://aka.ms/datadrift for details on data drift backfill runs.

NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.

backfill(start_date, end_date, compute_target=None, create_compute_target=False)

Parameters

Name Description
start_date
Required

The start date of the backfill job.

end_date
Required

The end date of the backfill job, inclusive.

compute_target

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if none is specified.

Default value: None
create_compute_target

Indicates whether an Azure Machine Learning compute target is automatically created.

Default value: False

Returns

Type Description
Run

A DataDriftDetector run.

create_from_datasets

Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.

static create_from_datasets(workspace, name, baseline_dataset, target_dataset, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

Name Description
workspace
Required

The workspace to create the DataDriftDetector in.

name
Required
str

A unique name for the DataDriftDetector object.

baseline_dataset
Required

Dataset to compare the target dataset against.

target_dataset
Required

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

Default value: None
frequency
str

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

Default value: None
feature_list

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

Default value: None
alert_config

Optional configuration object for DataDriftDetector alerts.

Default value: None
drift_threshold

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

Default value: None
latency
int

Delay in hours for data to appear in dataset.

Default value: None

Returns

Type Description

A DataDriftDetector object.

Exceptions

Type Description
<xref:KeyError>, <xref:TypeError>, <xref:ValueError>

Remarks

Dataset-based DataDriftDetectors enable you to calculate data drift between a baseline dataset, which must be a TabularDataset, and a target dataset, which must be a time series dataset. A time series dataset is simply a TabularDataset with the fine_grain_timestamp property. The DataDriftDetector can then run adhoc or scheduled jobs to determine if the target dataset has drifted from the baseline dataset.


   from azureml.core import Workspace, Dataset
   from azureml.datadrift import DataDriftDetector

   ws = Workspace.from_config()
   baseline = Dataset.get_by_name(ws, 'my_baseline_dataset')
   target = Dataset.get_by_name(ws, 'my_target_dataset')

   detector = DataDriftDetector.create_from_datasets(workspace=ws,
                                                     name="my_unique_detector_name",
                                                     baseline_dataset=baseline,
                                                     target_dataset=target,
                                                     compute_target_name='my_compute_target',
                                                     frequency="Day",
                                                     feature_list=['my_feature_1', 'my_feature_2'],
                                                     alert_config=AlertConfiguration(email_addresses=['user@contoso.com']),
                                                     drift_threshold=0.3,
                                                     latency=1)

delete

Delete the schedule for the DataDriftDetector object.

delete(wait_for_completion=True)

Parameters

Name Description
wait_for_completion

Whether to wait for the delete operation to complete.

Default value: True

disable_schedule

Disable the schedule for the DataDriftDetector object.

disable_schedule(wait_for_completion=True)

Parameters

Name Description
wait_for_completion

Whether to wait for the disable operation to complete.

Default value: True

enable_schedule

Create a schedule to run dataset-based DataDriftDetector job.

enable_schedule(create_compute_target=False, wait_for_completion=True)

Parameters

Name Description
create_compute_target

Indicates whether an Azure Machine Learning compute target is created automatically.

Default value: False
wait_for_completion

Whether to wait for the enable operation to complete.

Default value: True

get_by_name

Retrieve a unique DataDriftDetector object for a given workspace and name.

static get_by_name(workspace, name)

Parameters

Name Description
workspace
Required

The workspace where the DataDriftDetector was created.

name
Required
str

The name of the DataDriftDetector object to return.

Returns

Type Description

A DataDriftDetector object.

get_output

Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.

get_output(start_time=None, end_time=None, run_id=None)

Parameters

Name Description
start_time
datetime, <xref:optional>

The start time of the results window in UTC. If None (the default) is specified, then the most recent 10th cycle's results are used as the start time. For example, if frequency of the data drift schedule is day, then start_time is 10 days. If frequency is week, then start_time is 10 weeks.

Default value: None
end_time
datetime, <xref:optional>

The end time of the results window in UTC. If None (the default) is specified, then the current day UTC is used as the end time.

Default value: None
run_id
int, <xref:optional>

A specific run ID.

Default value: None

Returns

Type Description

A tuple of a list of drift results and a list of individual dataset and columnar metrics.

Remarks

This method returns a tuple of drift results and metrics for either a time window or run ID based on the type of run: an adhoc run, a scheduled run, and a backfill run.

  • To retrieve adhoc run results, there is only one way: run_id should be a valid GUID.

  • To retrieve scheduled runs and backfill run results, there are two different ways: either assign a valid GUID to run_id or assign a specific start_time and/or end_time (inclusive) while keeping run_id as None.

  • If run_id, start_time, and end_time are not None in the same method call, a parameter validation exception is raised.

NOTE: Specify either start_time and end_time parameters or the run_id parameter, but not both.

It's possible that there are multiple results for the same target date (target date means target dataset start date for dataset-based drift). Therefore, it's necessary to identify and handle duplicate results. For dataset-based drift, if results are for the same target date, then they are duplicated results. The get_output method will dedup any duplicated results by one rule: always pick up the latest generated results.

The get_output method can be used to retrieve all outputs or partial outputs of scheduled runs in a specific time range between start_time and end_time (boundary included). You can also limit the results of an individual adhoc by specifying the run_id.

Use the following guidelines to help interpret results returned from the get_output method:

  • Principle for filtering is "overlapping": as long as there is an overlap between the actual result time (dataset-based: target dataset [start date, end date]) and the given [start_time, end_time], then the result will be picked up.

  • If there are multiple outputs for one target date because the drift calculation was executed several times against that day, then only the latest output will be picked by default.

  • Given there are multiple types of a data drift instance, the result contents could be various.

For dataset-based results, the output will look like:


   results : [{'drift_type': 'DatasetBased',
               'result':[{'has_drift': True, 'drift_threshold': 0.3,
                          'start_date': '2019-04-03', 'end_date': '2019-04-04',
                          'base_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                          'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'}]}]
   metrics : [{'drift_type': 'DatasetBased',
               'metrics': [{'schema_version': '0.1',
                            'start_date': '2019-04-03', 'end_date': '2019-04-04',
                            'baseline_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                            'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'
                            'dataset_metrics': [{'name': 'datadrift_coefficient', 'value': 0.53459}],
                            'column_metrics': [{'feature1': [{'name': 'datadrift_contribution',
                                                              'value': 288.0},
                                                             {'name': 'wasserstein_distance',
                                                              'value': 4.858040000000001},
                                                             {'name': 'energy_distance',
                                                              'value': 2.7204799576545313}]}]}]}]

list

Get a list of DataDriftDetector objects for the specified workspace and optional dataset.

NOTE: Passing in only the workspace parameter will return all DataDriftDetector objects, defined in the workspace.

static list(workspace, baseline_dataset=None, target_dataset=None)

Parameters

Name Description
workspace
Required

The workspace where the DataDriftDetector objects were created.

baseline_dataset

Baseline dataset to filter the return list.

Default value: None
target_dataset

Target dataset to filter the return list.

Default value: None

Returns

Type Description

A list of DataDriftDetector objects.

run

Run a single point in time data drift analysis.

run(target_date, compute_target=None, create_compute_target=False, feature_list=None, drift_threshold=None)

Parameters

Name Description
target_date
Required

The target date of scoring data in UTC.

compute_target

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. If not specified, a compute target is created automatically.

Default value: None
create_compute_target

Indicates whether an Azure Machine Learning compute target is created automatically.

Default value: False
feature_list

Optional whitelisted features to run the datadrift detection on.

Default value: None
drift_threshold

Optional threshold to enable DataDriftDetector alerts on.

Default value: None

Returns

Type Description
Run

A DataDriftDetector run.

show

Show data drift trend in given time range.

By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.

show(start_time=None, end_time=None)

Parameters

Name Description
start_time
datetime, <xref:optional>

The start of the presentation time window in UTC. The default None means to pick up the most recent 10th cycle's results.

Default value: None
end_time
datetime, <xref:optional>

The end of the presentation data time window in UTC. The default None means the current day.

Default value: None

Returns

Type Description
dict()

A dictionary of all figures. The key is service_name.

update

Update the schedule associated with the DataDriftDetector object.

Optional parameter values can be set to None, otherwise they default to their existing values.

update(compute_target=Ellipsis, feature_list=Ellipsis, schedule_start=Ellipsis, alert_config=Ellipsis, drift_threshold=Ellipsis, wait_for_completion=True)

Parameters

Name Description
compute_target

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if this parameter is not specified.

Default value: Ellipsis
feature_list

Whitelisted features to run the datadrift detection on.

Default value: Ellipsis
schedule_start

The start time of the data drift schedule in UTC.

Default value: Ellipsis
alert_config

Optional configuration object for DataDriftDetector alerts.

Default value: Ellipsis
drift_threshold

The threshold to enable DataDriftDetector alerts on.

Default value: Ellipsis
wait_for_completion

Whether to wait for the enable/disable/delete operations to complete.

Default value: True

Returns

Type Description

self

Attributes

alert_config

Get the alert configuration for the DataDriftDetector object.

Returns

Type Description

An AlertConfiguration object.

baseline_dataset

Get the baseline dataset associated with the DataDriftDetector object.

Returns

Type Description

Dataset type of the baseline dataset.

compute_target

Get the compute target attached to the DataDriftDetector object.

Returns

Type Description

The compute target.

drift_threshold

Get the drift threshold for the DataDriftDetector object.

Returns

Type Description

The drift threshold.

drift_type

Get the type of the DataDriftDetector, 'DatasetBased' is the only value supported for now.

Returns

Type Description
str

The type of DataDriftDetector object.

enabled

Get the boolean value indicating whether the DataDriftDetector object is enabled.

Returns

Type Description

A boolean value; True for enabled.

feature_list

Get the list of whitelisted features for the DataDriftDetector object.

Returns

Type Description

A list of feature names.

frequency

Get the frequency of the DataDriftDetector schedule.

Returns

Type Description
str

A string of either "Day", "Week", or "Month"

interval

Get the interval of the DataDriftDetector schedule.

Returns

Type Description
int

An integer value of time unit.

latency

Get the latency of the DataDriftDetector schedule jobs (in hours).

Returns

Type Description
int

The number of hours representing the latency.

name

Get the name of the DataDriftDetector object.

Returns

Type Description
str

The DataDriftDetector name.

schedule_start

Get the start time of the schedule.

Returns

Type Description

A datetime object of schedule start time in UTC.

state

Denotes the state of the DataDriftDetector schedule.

Returns

Type Description
str

One of 'Disabled', 'Enabled', 'Deleted', 'Disabling', 'Enabling', 'Deleting', 'Failed', 'DisableFailed', 'EnableFailed', 'DeleteFailed'.

target_dataset

Get the target dataset associated with the DataDriftDetector object.

Returns

Type Description

The dataset type of the baseline dataset.

workspace

Get the workspace of the DataDriftDetector object.

Returns

Type Description

The workspace the DataDriftDetector object was created in.