DataDriftDetector Class

Reference

Defines a data drift monitor that can be used to run data drift jobs in Azure Machine Learning.

The DataDriftDetector class enables you to identify drift between a given baseline and target dataset. A DataDriftDetector object is created in a workspace by either specifying the baseline and target datasets directly. For more information, see https://aka.ms/datadrift.

Datadriftdetector constructor.

The DataDriftDetector constructor is used to retrieve a cloud representation of a DataDriftDetector object associated with the provided workspace.

Inheritance: builtins.object

DataDriftDetector

Constructor

DataDriftDetector(workspace, name=None, baseline_dataset=None, target_dataset=None, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

workspace: Workspace

Required

The workspace in which to create the DataDriftDetector object.

name: str

default value: None

A unique name for the DataDriftDetector object.

baseline_dataset: TabularDataset

default value: None

Dataset to compare the target dataset against.

target_dataset: TabularDataset

default value: None

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target: ComputeTarget or str

default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency: str

default value: None

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list: list[str]

default value: None

Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs will run on all features if feature_list is not specified. The feature list can contain characters, numbers, dashes, and whitespaces. The length of the list must be less than 200.

alert_config: AlertConfiguration

default value: None

Optional configuration object for DataDriftDetector alerts.

drift_threshold: float

default value: None

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency: int

default value: None

Delay in hours for data to appear in dataset.

workspace: Workspace

Required

The workspace in which to create the DataDriftDetector object.

name: str

Required

A unique name for the DataDriftDetector object.

baseline_dataset: TabularDataset

Required

Dataset to compare the target dataset against.

target_dataset: TabularDataset

Required

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target: ComputeTarget or str

Required

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency: str

Required

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list: list[str]

Required

alert_config: AlertConfiguration

Required

Optional configuration object for DataDriftDetector alerts.

drift_threshold: float

Required

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency: int

Required

Delay in hours for data to appear in dataset.

Remarks

A DataDriftDetector object represents a data drift job definition that can be used to run three job run types:

an adhoc run for analyzing a specific day's worth of data; see the run method.
a scheduled run in a pipeline; see the enable_schedule method.
a backfill run to see how data changes over time; see the backfill method.

The typical pattern for creating a DataDriftDetector is:

To create a dataset-based DataDriftDetector object, use create_from_datasets

The following example shows how to create a dataset-based DataDriftDetector object.


   from azureml.datadrift import DataDriftDetector, AlertConfiguration

   alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling

   monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
                                                         compute_target='cpu-cluster',         # compute target for scheduled pipeline and backfills
                                                         frequency='Week',                     # how often to analyze target data
                                                         feature_list=None,                    # list of features to detect drift on
                                                         drift_threshold=None,                 # threshold from 0 to 1 for email alerting
                                                         latency=0,                            # SLA in hours for target data to arrive in the dataset
                                                         alert_config=alert_config)            # email addresses to send alert

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb

The DataDriftDetector constructor retrieves an existing data drift object associated with the workspace.

Methods

backfill	Run a backfill job over a given specified start and end date. See https://aka.ms/datadrift for details on data drift backfill runs. NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.
create_from_datasets	Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.
delete	Delete the schedule for the DataDriftDetector object.
disable_schedule	Disable the schedule for the DataDriftDetector object.
enable_schedule	Create a schedule to run dataset-based DataDriftDetector job.
get_by_name	Retrieve a unique DataDriftDetector object for a given workspace and name.
get_output	Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.
list	Get a list of DataDriftDetector objects for the specified workspace and optional dataset. NOTE: Passing in only the `workspace` parameter will return all DataDriftDetector objects, defined in the workspace.
run	Run a single point in time data drift analysis.
show	Show data drift trend in given time range. By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.
update	Update the schedule associated with the DataDriftDetector object. Optional parameter values can be set to `None`, otherwise they default to their existing values.

backfill

Run a backfill job over a given specified start and end date.

See https://aka.ms/datadrift for details on data drift backfill runs.

NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.

backfill(start_date, end_date, compute_target=None, create_compute_target=False)

Parameters

start_date: datetime

Required

The start date of the backfill job.

end_date: datetime

Required

The end date of the backfill job, inclusive.

compute_target: ComputeTarget or str

default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if none is specified.

create_compute_target: bool

default value: False

Indicates whether an Azure Machine Learning compute target is automatically created.

Returns

A DataDriftDetector run.

Return type

Run

create_from_datasets

Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.

static create_from_datasets(workspace, name, baseline_dataset, target_dataset, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)

Parameters

workspace: Workspace

Required

The workspace to create the DataDriftDetector in.

name: str

Required

A unique name for the DataDriftDetector object.

baseline_dataset: TabularDataset

Required

Dataset to compare the target dataset against.

target_dataset: TabularDataset

Required

Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.

compute_target: ComputeTarget or str

default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.

frequency: str

default value: None

Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".

feature_list: list[str]

default value: None

alert_config: AlertConfiguration

default value: None

Optional configuration object for DataDriftDetector alerts.

drift_threshold: float

default value: None

Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).

latency: int

default value: None

Delay in hours for data to appear in dataset.

Returns

A DataDriftDetector object.

Return type

DataDriftDetector

Exceptions

<xref:KeyError>, <xref:TypeError>, <xref:ValueError>

Remarks

Dataset-based DataDriftDetectors enable you to calculate data drift between a baseline dataset, which must be a TabularDataset, and a target dataset, which must be a time series dataset. A time series dataset is simply a TabularDataset with the fine_grain_timestamp property. The DataDriftDetector can then run adhoc or scheduled jobs to determine if the target dataset has drifted from the baseline dataset.


   from azureml.core import Workspace, Dataset
   from azureml.datadrift import DataDriftDetector

   ws = Workspace.from_config()
   baseline = Dataset.get_by_name(ws, 'my_baseline_dataset')
   target = Dataset.get_by_name(ws, 'my_target_dataset')

   detector = DataDriftDetector.create_from_datasets(workspace=ws,
                                                     name="my_unique_detector_name",
                                                     baseline_dataset=baseline,
                                                     target_dataset=target,
                                                     compute_target_name='my_compute_target',
                                                     frequency="Day",
                                                     feature_list=['my_feature_1', 'my_feature_2'],
                                                     alert_config=AlertConfiguration(email_addresses=['user@contoso.com']),
                                                     drift_threshold=0.3,
                                                     latency=1)

delete

Delete the schedule for the DataDriftDetector object.

delete(wait_for_completion=True)

Parameters

wait_for_completion: bool

default value: True

Whether to wait for the delete operation to complete.

disable_schedule

Disable the schedule for the DataDriftDetector object.

disable_schedule(wait_for_completion=True)

Parameters

wait_for_completion: bool

default value: True

Whether to wait for the disable operation to complete.

enable_schedule

Create a schedule to run dataset-based DataDriftDetector job.

enable_schedule(create_compute_target=False, wait_for_completion=True)

Parameters

create_compute_target: bool

default value: False

Indicates whether an Azure Machine Learning compute target is created automatically.

wait_for_completion: bool

default value: True

Whether to wait for the enable operation to complete.

get_by_name

Retrieve a unique DataDriftDetector object for a given workspace and name.

static get_by_name(workspace, name)

Parameters

workspace: Workspace

Required

The workspace where the DataDriftDetector was created.

name: str

Required

The name of the DataDriftDetector object to return.

Returns

A DataDriftDetector object.

Return type

DataDriftDetector

get_output

Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.

get_output(start_time=None, end_time=None, run_id=None)

Parameters

start_time: datetime, <xref:optional>

default value: None

The start time of the results window in UTC. If None (the default) is specified, then the most recent 10th cycle's results are used as the start time. For example, if frequency of the data drift schedule is day, then start_time is 10 days. If frequency is week, then start_time is 10 weeks.

end_time: datetime, <xref:optional>

default value: None

The end time of the results window in UTC. If None (the default) is specified, then the current day UTC is used as the end time.

run_id: int, <xref:optional>

default value: None

A specific run ID.

Returns

A tuple of a list of drift results and a list of individual dataset and columnar metrics.

Return type

tuple(list, list)

Remarks

This method returns a tuple of drift results and metrics for either a time window or run ID based on the type of run: an adhoc run, a scheduled run, and a backfill run.

To retrieve adhoc run results, there is only one way: run_id should be a valid GUID.
To retrieve scheduled runs and backfill run results, there are two different ways: either assign a valid GUID to run_id or assign a specific start_time and/or end_time (inclusive) while keeping run_id as None.
If run_id, start_time, and end_time are not None in the same method call, a parameter validation exception is raised.

NOTE: Specify either start_time and end_time parameters or the run_id parameter, but not both.

It's possible that there are multiple results for the same target date (target date means target dataset start date for dataset-based drift). Therefore, it's necessary to identify and handle duplicate results. For dataset-based drift, if results are for the same target date, then they are duplicated results. The get_output method will dedup any duplicated results by one rule: always pick up the latest generated results.

The get_output method can be used to retrieve all outputs or partial outputs of scheduled runs in a specific time range between start_time and end_time (boundary included). You can also limit the results of an individual adhoc by specifying the run_id.

Use the following guidelines to help interpret results returned from the get_output method:

Principle for filtering is "overlapping": as long as there is an overlap between the actual result time (dataset-based: target dataset [start date, end date]) and the given [start_time, end_time], then the result will be picked up.
If there are multiple outputs for one target date because the drift calculation was executed several times against that day, then only the latest output will be picked by default.
Given there are multiple types of a data drift instance, the result contents could be various.

For dataset-based results, the output will look like:


   results : [{'drift_type': 'DatasetBased',
               'result':[{'has_drift': True, 'drift_threshold': 0.3,
                          'start_date': '2019-04-03', 'end_date': '2019-04-04',
                          'base_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                          'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'}]}]
   metrics : [{'drift_type': 'DatasetBased',
               'metrics': [{'schema_version': '0.1',
                            'start_date': '2019-04-03', 'end_date': '2019-04-04',
                            'baseline_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
                            'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'
                            'dataset_metrics': [{'name': 'datadrift_coefficient', 'value': 0.53459}],
                            'column_metrics': [{'feature1': [{'name': 'datadrift_contribution',
                                                              'value': 288.0},
                                                             {'name': 'wasserstein_distance',
                                                              'value': 4.858040000000001},
                                                             {'name': 'energy_distance',
                                                              'value': 2.7204799576545313}]}]}]}]

list

Get a list of DataDriftDetector objects for the specified workspace and optional dataset.

NOTE: Passing in only the workspace parameter will return all DataDriftDetector objects, defined in the workspace.

static list(workspace, baseline_dataset=None, target_dataset=None)

Parameters

workspace: Workspace

Required

The workspace where the DataDriftDetector objects were created.

baseline_dataset: TabularDataset

default value: None

Baseline dataset to filter the return list.

target_dataset: TabularDataset

default value: None

Target dataset to filter the return list.

Returns

A list of DataDriftDetector objects.

Return type

list[DataDriftDetector]

run

Run a single point in time data drift analysis.

run(target_date, compute_target=None, create_compute_target=False, feature_list=None, drift_threshold=None)

Parameters

target_date: datetime

Required

The target date of scoring data in UTC.

compute_target: ComputeTarget or str

default value: None

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. If not specified, a compute target is created automatically.

create_compute_target: bool

default value: False

Indicates whether an Azure Machine Learning compute target is created automatically.

feature_list: list[str]

default value: None

Optional whitelisted features to run the datadrift detection on.

drift_threshold: float

default value: None

Optional threshold to enable DataDriftDetector alerts on.

Returns

A DataDriftDetector run.

Return type

Run

show

Show data drift trend in given time range.

By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.

show(start_time=None, end_time=None)

Parameters

start_time: datetime, <xref:optional>

default value: None

The start of the presentation time window in UTC. The default None means to pick up the most recent 10th cycle's results.

end_time: datetime, <xref:optional>

default value: None

The end of the presentation data time window in UTC. The default None means the current day.

Returns

A dictionary of all figures. The key is service_name.

Return type

dict()

update

Update the schedule associated with the DataDriftDetector object.

Optional parameter values can be set to None, otherwise they default to their existing values.

update(compute_target=Ellipsis, feature_list=Ellipsis, schedule_start=Ellipsis, alert_config=Ellipsis, drift_threshold=Ellipsis, wait_for_completion=True)

Parameters

compute_target: ComputeTarget or str

default value: Ellipsis

Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if this parameter is not specified.

feature_list: list[str]

default value: Ellipsis

Whitelisted features to run the datadrift detection on.

schedule_start: datetime

default value: Ellipsis

The start time of the data drift schedule in UTC.

alert_config: AlertConfiguration

default value: Ellipsis

Optional configuration object for DataDriftDetector alerts.

drift_threshold: float

default value: Ellipsis

The threshold to enable DataDriftDetector alerts on.

wait_for_completion: bool

default value: True

Whether to wait for the enable/disable/delete operations to complete.

Returns

self

Return type

DataDriftDetector

Workspace