DataDriftDetector Class
Defines a data drift monitor that can be used to run data drift jobs in Azure Machine Learning.
The DataDriftDetector class enables you to identify drift between a given baseline and target dataset. A DataDriftDetector object is created in a workspace by either specifying the baseline and target datasets directly. For more information, see https://aka.ms/datadrift.
Datadriftdetector constructor.
The DataDriftDetector constructor is used to retrieve a cloud representation of a DataDriftDetector object associated with the provided workspace.
- Inheritance
-
builtins.objectDataDriftDetector
Constructor
DataDriftDetector(workspace, name=None, baseline_dataset=None, target_dataset=None, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace in which to create the DataDriftDetector object. |
name
|
A unique name for the DataDriftDetector object. Default value: None
|
baseline_dataset
|
Dataset to compare the target dataset against. Default value: None
|
target_dataset
|
Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series. Default value: None
|
compute_target
|
ComputeTarget or
str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified. Default value: None
|
frequency
|
Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month". Default value: None
|
feature_list
|
Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs
will run on all features if Default value: None
|
alert_config
|
Optional configuration object for DataDriftDetector alerts. Default value: None
|
drift_threshold
|
Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default). Default value: None
|
latency
|
Delay in hours for data to appear in dataset. Default value: None
|
workspace
Required
|
The workspace in which to create the DataDriftDetector object. |
name
Required
|
A unique name for the DataDriftDetector object. |
baseline_dataset
Required
|
Dataset to compare the target dataset against. |
target_dataset
Required
|
Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series. |
compute_target
Required
|
ComputeTarget or
str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified. |
frequency
Required
|
Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month". |
feature_list
Required
|
Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs
will run on all features if |
alert_config
Required
|
Optional configuration object for DataDriftDetector alerts. |
drift_threshold
Required
|
Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default). |
latency
Required
|
Delay in hours for data to appear in dataset. |
Remarks
A DataDriftDetector object represents a data drift job definition that can be used to run three job run types:
an adhoc run for analyzing a specific day's worth of data; see the run method.
a scheduled run in a pipeline; see the enable_schedule method.
a backfill run to see how data changes over time; see the backfill method.
The typical pattern for creating a DataDriftDetector is:
- To create a dataset-based DataDriftDetector object, use create_from_datasets
The following example shows how to create a dataset-based DataDriftDetector object.
from azureml.datadrift import DataDriftDetector, AlertConfiguration
alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling
monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
compute_target='cpu-cluster', # compute target for scheduled pipeline and backfills
frequency='Week', # how often to analyze target data
feature_list=None, # list of features to detect drift on
drift_threshold=None, # threshold from 0 to 1 for email alerting
latency=0, # SLA in hours for target data to arrive in the dataset
alert_config=alert_config) # email addresses to send alert
Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb
The DataDriftDetector constructor retrieves an existing data drift object associated with the workspace.
Methods
backfill |
Run a backfill job over a given specified start and end date. See https://aka.ms/datadrift for details on data drift backfill runs. NOTE: Backfill is only supported on dataset-based DataDriftDetector objects. |
create_from_datasets |
Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset. |
delete |
Delete the schedule for the DataDriftDetector object. |
disable_schedule |
Disable the schedule for the DataDriftDetector object. |
enable_schedule |
Create a schedule to run dataset-based DataDriftDetector job. |
get_by_name |
Retrieve a unique DataDriftDetector object for a given workspace and name. |
get_output |
Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window. |
list |
Get a list of DataDriftDetector objects for the specified workspace and optional dataset. NOTE: Passing in only the |
run |
Run a single point in time data drift analysis. |
show |
Show data drift trend in given time range. By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks. |
update |
Update the schedule associated with the DataDriftDetector object. Optional parameter values can be set to |
backfill
Run a backfill job over a given specified start and end date.
See https://aka.ms/datadrift for details on data drift backfill runs.
NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.
backfill(start_date, end_date, compute_target=None, create_compute_target=False)
Parameters
Name | Description |
---|---|
start_date
Required
|
The start date of the backfill job. |
end_date
Required
|
The end date of the backfill job, inclusive. |
compute_target
|
ComputeTarget or
str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if none is specified. Default value: None
|
create_compute_target
|
Indicates whether an Azure Machine Learning compute target is automatically created. Default value: False
|
Returns
Type | Description |
---|---|
A DataDriftDetector run. |
create_from_datasets
Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.
static create_from_datasets(workspace, name, baseline_dataset, target_dataset, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace to create the DataDriftDetector in. |
name
Required
|
A unique name for the DataDriftDetector object. |
baseline_dataset
Required
|
Dataset to compare the target dataset against. |
target_dataset
Required
|
Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series. |
compute_target
|
ComputeTarget or
str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified. Default value: None
|
frequency
|
Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month". Default value: None
|
feature_list
|
Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs
will run on all features if Default value: None
|
alert_config
|
Optional configuration object for DataDriftDetector alerts. Default value: None
|
drift_threshold
|
Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default). Default value: None
|
latency
|
Delay in hours for data to appear in dataset. Default value: None
|
Returns
Type | Description |
---|---|
A DataDriftDetector object. |
Exceptions
Type | Description |
---|---|
<xref:KeyError>, <xref:TypeError>, <xref:ValueError>
|
Remarks
Dataset-based DataDriftDetectors enable you to calculate data drift between a baseline dataset, which must be a TabularDataset, and a target dataset, which must be a time series dataset. A time series dataset is simply a TabularDataset with the fine_grain_timestamp property. The DataDriftDetector can then run adhoc or scheduled jobs to determine if the target dataset has drifted from the baseline dataset.
from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
ws = Workspace.from_config()
baseline = Dataset.get_by_name(ws, 'my_baseline_dataset')
target = Dataset.get_by_name(ws, 'my_target_dataset')
detector = DataDriftDetector.create_from_datasets(workspace=ws,
name="my_unique_detector_name",
baseline_dataset=baseline,
target_dataset=target,
compute_target_name='my_compute_target',
frequency="Day",
feature_list=['my_feature_1', 'my_feature_2'],
alert_config=AlertConfiguration(email_addresses=['user@contoso.com']),
drift_threshold=0.3,
latency=1)
delete
Delete the schedule for the DataDriftDetector object.
delete(wait_for_completion=True)
Parameters
Name | Description |
---|---|
wait_for_completion
|
Whether to wait for the delete operation to complete. Default value: True
|
disable_schedule
Disable the schedule for the DataDriftDetector object.
disable_schedule(wait_for_completion=True)
Parameters
Name | Description |
---|---|
wait_for_completion
|
Whether to wait for the disable operation to complete. Default value: True
|
enable_schedule
Create a schedule to run dataset-based DataDriftDetector job.
enable_schedule(create_compute_target=False, wait_for_completion=True)
Parameters
Name | Description |
---|---|
create_compute_target
|
Indicates whether an Azure Machine Learning compute target is created automatically. Default value: False
|
wait_for_completion
|
Whether to wait for the enable operation to complete. Default value: True
|
get_by_name
Retrieve a unique DataDriftDetector object for a given workspace and name.
static get_by_name(workspace, name)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace where the DataDriftDetector was created. |
name
Required
|
The name of the DataDriftDetector object to return. |
Returns
Type | Description |
---|---|
A DataDriftDetector object. |
get_output
Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.
get_output(start_time=None, end_time=None, run_id=None)
Parameters
Name | Description |
---|---|
start_time
|
datetime, <xref:optional>
The start time of the results window in UTC. If None (the default) is specified, then
the most recent 10th cycle's results are used as the start time. For example, if frequency of the
data drift schedule is day, then Default value: None
|
end_time
|
datetime, <xref:optional>
The end time of the results window in UTC. If None (the default) is specified, then the current day UTC is used as the end time. Default value: None
|
run_id
|
int, <xref:optional>
A specific run ID. Default value: None
|
Returns
Type | Description |
---|---|
A tuple of a list of drift results and a list of individual dataset and columnar metrics. |
Remarks
This method returns a tuple of drift results and metrics for either a time window or run ID based on the type of run: an adhoc run, a scheduled run, and a backfill run.
To retrieve adhoc run results, there is only one way:
run_id
should be a valid GUID.To retrieve scheduled runs and backfill run results, there are two different ways: either assign a valid GUID to
run_id
or assign a specificstart_time
and/orend_time
(inclusive) while keepingrun_id
as None.If
run_id
,start_time
, andend_time
are not None in the same method call, a parameter validation exception is raised.
NOTE: Specify either start_time
and end_time
parameters or the run_id
parameter, but
not both.
It's possible that there are multiple results for the same target date (target date means target dataset
start date for dataset-based drift). Therefore, it's necessary to identify and handle duplicate results.
For dataset-based drift, if results are for the same target date, then they are duplicated results.
The get_output
method will dedup any duplicated results by one rule:
always pick up the latest generated results.
The get_output
method can be used to retrieve all outputs or partial outputs of scheduled runs in a
specific time range between start_time
and end_time
(boundary included). You can also limit the
results of an individual adhoc by specifying the run_id
.
Use the following guidelines to help interpret results returned from the get_output
method:
Principle for filtering is "overlapping": as long as there is an overlap between the actual result time (dataset-based: target dataset [start date, end date]) and the given [
start_time
,end_time
], then the result will be picked up.If there are multiple outputs for one target date because the drift calculation was executed several times against that day, then only the latest output will be picked by default.
Given there are multiple types of a data drift instance, the result contents could be various.
For dataset-based results, the output will look like:
results : [{'drift_type': 'DatasetBased',
'result':[{'has_drift': True, 'drift_threshold': 0.3,
'start_date': '2019-04-03', 'end_date': '2019-04-04',
'base_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'}]}]
metrics : [{'drift_type': 'DatasetBased',
'metrics': [{'schema_version': '0.1',
'start_date': '2019-04-03', 'end_date': '2019-04-04',
'baseline_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'
'dataset_metrics': [{'name': 'datadrift_coefficient', 'value': 0.53459}],
'column_metrics': [{'feature1': [{'name': 'datadrift_contribution',
'value': 288.0},
{'name': 'wasserstein_distance',
'value': 4.858040000000001},
{'name': 'energy_distance',
'value': 2.7204799576545313}]}]}]}]
list
Get a list of DataDriftDetector objects for the specified workspace and optional dataset.
NOTE: Passing in only the workspace
parameter will return all DataDriftDetector objects,
defined in the workspace.
static list(workspace, baseline_dataset=None, target_dataset=None)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace where the DataDriftDetector objects were created. |
baseline_dataset
|
Baseline dataset to filter the return list. Default value: None
|
target_dataset
|
Target dataset to filter the return list. Default value: None
|
Returns
Type | Description |
---|---|
A list of DataDriftDetector objects. |
run
Run a single point in time data drift analysis.
run(target_date, compute_target=None, create_compute_target=False, feature_list=None, drift_threshold=None)
Parameters
Name | Description |
---|---|
target_date
Required
|
The target date of scoring data in UTC. |
compute_target
|
ComputeTarget or
str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. If not specified, a compute target is created automatically. Default value: None
|
create_compute_target
|
Indicates whether an Azure Machine Learning compute target is created automatically. Default value: False
|
feature_list
|
Optional whitelisted features to run the datadrift detection on. Default value: None
|
drift_threshold
|
Optional threshold to enable DataDriftDetector alerts on. Default value: None
|
Returns
Type | Description |
---|---|
A DataDriftDetector run. |
show
Show data drift trend in given time range.
By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.
show(start_time=None, end_time=None)
Parameters
Name | Description |
---|---|
start_time
|
datetime, <xref:optional>
The start of the presentation time window in UTC. The default None means to pick up the most recent 10th cycle's results. Default value: None
|
end_time
|
datetime, <xref:optional>
The end of the presentation data time window in UTC. The default None means the current day. Default value: None
|
Returns
Type | Description |
---|---|
dict()
|
A dictionary of all figures. The key is service_name. |
update
Update the schedule associated with the DataDriftDetector object.
Optional parameter values can be set to None
, otherwise they default to their existing values.
update(compute_target=Ellipsis, feature_list=Ellipsis, schedule_start=Ellipsis, alert_config=Ellipsis, drift_threshold=Ellipsis, wait_for_completion=True)
Parameters
Name | Description |
---|---|
compute_target
|
ComputeTarget or
str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if this parameter is not specified. Default value: Ellipsis
|
feature_list
|
Whitelisted features to run the datadrift detection on. Default value: Ellipsis
|
schedule_start
|
The start time of the data drift schedule in UTC. Default value: Ellipsis
|
alert_config
|
Optional configuration object for DataDriftDetector alerts. Default value: Ellipsis
|
drift_threshold
|
The threshold to enable DataDriftDetector alerts on. Default value: Ellipsis
|
wait_for_completion
|
Whether to wait for the enable/disable/delete operations to complete. Default value: True
|
Returns
Type | Description |
---|---|
self |
Attributes
alert_config
Get the alert configuration for the DataDriftDetector object.
Returns
Type | Description |
---|---|
An AlertConfiguration object. |
baseline_dataset
Get the baseline dataset associated with the DataDriftDetector object.
Returns
Type | Description |
---|---|
Dataset type of the baseline dataset. |
compute_target
Get the compute target attached to the DataDriftDetector object.
Returns
Type | Description |
---|---|
The compute target. |
drift_threshold
Get the drift threshold for the DataDriftDetector object.
Returns
Type | Description |
---|---|
The drift threshold. |
drift_type
Get the type of the DataDriftDetector, 'DatasetBased' is the only value supported for now.
Returns
Type | Description |
---|---|
The type of DataDriftDetector object. |
enabled
Get the boolean value indicating whether the DataDriftDetector object is enabled.
Returns
Type | Description |
---|---|
A boolean value; True for enabled. |
feature_list
Get the list of whitelisted features for the DataDriftDetector object.
Returns
Type | Description |
---|---|
A list of feature names. |
frequency
Get the frequency of the DataDriftDetector schedule.
Returns
Type | Description |
---|---|
A string of either "Day", "Week", or "Month" |
interval
Get the interval of the DataDriftDetector schedule.
Returns
Type | Description |
---|---|
An integer value of time unit. |
latency
Get the latency of the DataDriftDetector schedule jobs (in hours).
Returns
Type | Description |
---|---|
The number of hours representing the latency. |
name
Get the name of the DataDriftDetector object.
Returns
Type | Description |
---|---|
The DataDriftDetector name. |
schedule_start
Get the start time of the schedule.
Returns
Type | Description |
---|---|
A datetime object of schedule start time in UTC. |
state
Denotes the state of the DataDriftDetector schedule.
Returns
Type | Description |
---|---|
One of 'Disabled', 'Enabled', 'Deleted', 'Disabling', 'Enabling', 'Deleting', 'Failed', 'DisableFailed', 'EnableFailed', 'DeleteFailed'. |
target_dataset
Get the target dataset associated with the DataDriftDetector object.
Returns
Type | Description |
---|---|
The dataset type of the baseline dataset. |
workspace
Get the workspace of the DataDriftDetector object.
Returns
Type | Description |
---|---|
The workspace the DataDriftDetector object was created in. |