DataDriftDetector Class
Defines a data drift monitor that can be used to run data drift jobs in Azure Machine Learning.
The DataDriftDetector class enables you to identify drift between a given baseline and target dataset. A DataDriftDetector object is created in a workspace by either specifying the baseline and target datasets directly. For more information, see https://aka.ms/datadrift.
Datadriftdetector constructor.
The DataDriftDetector constructor is used to retrieve a cloud representation of a DataDriftDetector object associated with the provided workspace.
- Inheritance
-
builtins.objectDataDriftDetector
Constructor
DataDriftDetector(workspace, name=None, baseline_dataset=None, target_dataset=None, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)
Parameters
- target_dataset
- TabularDataset
Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.
- compute_target
- ComputeTarget or str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.
- frequency
- str
Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".
Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs
will run on all features if feature_list
is not specified. The feature list can contain characters,
numbers, dashes, and whitespaces. The length of the list must be less than 200.
- alert_config
- AlertConfiguration
Optional configuration object for DataDriftDetector alerts.
- drift_threshold
- float
Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).
- target_dataset
- TabularDataset
Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.
- compute_target
- ComputeTarget or str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.
- frequency
- str
Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".
Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs
will run on all features if feature_list
is not specified. The feature list can contain characters,
numbers, dashes, and whitespaces. The length of the list must be less than 200.
- alert_config
- AlertConfiguration
Optional configuration object for DataDriftDetector alerts.
- drift_threshold
- float
Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).
Remarks
A DataDriftDetector object represents a data drift job definition that can be used to run three job run types:
an adhoc run for analyzing a specific day's worth of data; see the run method.
a scheduled run in a pipeline; see the enable_schedule method.
a backfill run to see how data changes over time; see the backfill method.
The typical pattern for creating a DataDriftDetector is:
- To create a dataset-based DataDriftDetector object, use create_from_datasets
The following example shows how to create a dataset-based DataDriftDetector object.
from azureml.datadrift import DataDriftDetector, AlertConfiguration
alert_config = AlertConfiguration(['user@contoso.com']) # replace with your email to recieve alerts from the scheduled pipeline after enabling
monitor = DataDriftDetector.create_from_datasets(ws, 'weather-monitor', baseline, target,
compute_target='cpu-cluster', # compute target for scheduled pipeline and backfills
frequency='Week', # how often to analyze target data
feature_list=None, # list of features to detect drift on
drift_threshold=None, # threshold from 0 to 1 for email alerting
latency=0, # SLA in hours for target data to arrive in the dataset
alert_config=alert_config) # email addresses to send alert
Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datadrift-tutorial/datadrift-tutorial.ipynb
The DataDriftDetector constructor retrieves an existing data drift object associated with the workspace.
Methods
backfill |
Run a backfill job over a given specified start and end date. See https://aka.ms/datadrift for details on data drift backfill runs. NOTE: Backfill is only supported on dataset-based DataDriftDetector objects. |
create_from_datasets |
Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset. |
delete |
Delete the schedule for the DataDriftDetector object. |
disable_schedule |
Disable the schedule for the DataDriftDetector object. |
enable_schedule |
Create a schedule to run dataset-based DataDriftDetector job. |
get_by_name |
Retrieve a unique DataDriftDetector object for a given workspace and name. |
get_output |
Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window. |
list |
Get a list of DataDriftDetector objects for the specified workspace and optional dataset. NOTE: Passing in only the |
run |
Run a single point in time data drift analysis. |
show |
Show data drift trend in given time range. By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks. |
update |
Update the schedule associated with the DataDriftDetector object. Optional parameter values can be set to |
backfill
Run a backfill job over a given specified start and end date.
See https://aka.ms/datadrift for details on data drift backfill runs.
NOTE: Backfill is only supported on dataset-based DataDriftDetector objects.
backfill(start_date, end_date, compute_target=None, create_compute_target=False)
Parameters
- compute_target
- ComputeTarget or str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if none is specified.
- create_compute_target
- bool
Indicates whether an Azure Machine Learning compute target is automatically created.
Returns
A DataDriftDetector run.
Return type
create_from_datasets
Create a new DataDriftDetector object from a baseline tabular dataset and a target time series dataset.
static create_from_datasets(workspace, name, baseline_dataset, target_dataset, compute_target=None, frequency=None, feature_list=None, alert_config=None, drift_threshold=None, latency=None)
Parameters
- target_dataset
- TabularDataset
Dataset to run either adhoc or scheduled DataDrift jobs for. Must be a time series.
- compute_target
- ComputeTarget or str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if one is not specified.
- frequency
- str
Optional frequency indicating how often the pipeline is run. Supports "Day", "Week", or "Month".
Optional whitelisted features to run the datadrift detection on. DataDriftDetector jobs
will run on all features if feature_list
is not specified. The feature list can contain characters,
numbers, dashes, and whitespaces. The length of the list must be less than 200.
- alert_config
- AlertConfiguration
Optional configuration object for DataDriftDetector alerts.
- drift_threshold
- float
Optional threshold to enable DataDriftDetector alerts on. The value must be between 0 and 1. A value of 0.2 is used when None is specified (the default).
Returns
A DataDriftDetector object.
Return type
Exceptions
Remarks
Dataset-based DataDriftDetectors enable you to calculate data drift between a baseline dataset, which must be a TabularDataset, and a target dataset, which must be a time series dataset. A time series dataset is simply a TabularDataset with the fine_grain_timestamp property. The DataDriftDetector can then run adhoc or scheduled jobs to determine if the target dataset has drifted from the baseline dataset.
from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
ws = Workspace.from_config()
baseline = Dataset.get_by_name(ws, 'my_baseline_dataset')
target = Dataset.get_by_name(ws, 'my_target_dataset')
detector = DataDriftDetector.create_from_datasets(workspace=ws,
name="my_unique_detector_name",
baseline_dataset=baseline,
target_dataset=target,
compute_target_name='my_compute_target',
frequency="Day",
feature_list=['my_feature_1', 'my_feature_2'],
alert_config=AlertConfiguration(email_addresses=['user@contoso.com']),
drift_threshold=0.3,
latency=1)
delete
Delete the schedule for the DataDriftDetector object.
delete(wait_for_completion=True)
Parameters
disable_schedule
Disable the schedule for the DataDriftDetector object.
disable_schedule(wait_for_completion=True)
Parameters
enable_schedule
Create a schedule to run dataset-based DataDriftDetector job.
enable_schedule(create_compute_target=False, wait_for_completion=True)
Parameters
- create_compute_target
- bool
Indicates whether an Azure Machine Learning compute target is created automatically.
get_by_name
Retrieve a unique DataDriftDetector object for a given workspace and name.
static get_by_name(workspace, name)
Parameters
Returns
A DataDriftDetector object.
Return type
get_output
Get a tuple of the drift results and metrics for a specific DataDriftDetector over a given time window.
get_output(start_time=None, end_time=None, run_id=None)
Parameters
- start_time
- datetime, <xref:optional>
The start time of the results window in UTC. If None (the default) is specified, then
the most recent 10th cycle's results are used as the start time. For example, if frequency of the
data drift schedule is day, then start_time
is 10 days. If frequency is week, then start_time
is 10 weeks.
- end_time
- datetime, <xref:optional>
The end time of the results window in UTC. If None (the default) is specified, then the current day UTC is used as the end time.
Returns
A tuple of a list of drift results and a list of individual dataset and columnar metrics.
Return type
Remarks
This method returns a tuple of drift results and metrics for either a time window or run ID based on the type of run: an adhoc run, a scheduled run, and a backfill run.
To retrieve adhoc run results, there is only one way:
run_id
should be a valid GUID.To retrieve scheduled runs and backfill run results, there are two different ways: either assign a valid GUID to
run_id
or assign a specificstart_time
and/orend_time
(inclusive) while keepingrun_id
as None.If
run_id
,start_time
, andend_time
are not None in the same method call, a parameter validation exception is raised.
NOTE: Specify either start_time
and end_time
parameters or the run_id
parameter, but
not both.
It's possible that there are multiple results for the same target date (target date means target dataset
start date for dataset-based drift). Therefore, it's necessary to identify and handle duplicate results.
For dataset-based drift, if results are for the same target date, then they are duplicated results.
The get_output
method will dedup any duplicated results by one rule:
always pick up the latest generated results.
The get_output
method can be used to retrieve all outputs or partial outputs of scheduled runs in a
specific time range between start_time
and end_time
(boundary included). You can also limit the
results of an individual adhoc by specifying the run_id
.
Use the following guidelines to help interpret results returned from the get_output
method:
Principle for filtering is "overlapping": as long as there is an overlap between the actual result time (dataset-based: target dataset [start date, end date]) and the given [
start_time
,end_time
], then the result will be picked up.If there are multiple outputs for one target date because the drift calculation was executed several times against that day, then only the latest output will be picked by default.
Given there are multiple types of a data drift instance, the result contents could be various.
For dataset-based results, the output will look like:
results : [{'drift_type': 'DatasetBased',
'result':[{'has_drift': True, 'drift_threshold': 0.3,
'start_date': '2019-04-03', 'end_date': '2019-04-04',
'base_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'}]}]
metrics : [{'drift_type': 'DatasetBased',
'metrics': [{'schema_version': '0.1',
'start_date': '2019-04-03', 'end_date': '2019-04-04',
'baseline_dataset_id': '4ac144ef-c86d-4c81-b7e5-ea6bbcd2dc7d',
'target_dataset_id': '13445141-aaaa-bbbb-cccc-ea23542bcaf9'
'dataset_metrics': [{'name': 'datadrift_coefficient', 'value': 0.53459}],
'column_metrics': [{'feature1': [{'name': 'datadrift_contribution',
'value': 288.0},
{'name': 'wasserstein_distance',
'value': 4.858040000000001},
{'name': 'energy_distance',
'value': 2.7204799576545313}]}]}]}]
list
Get a list of DataDriftDetector objects for the specified workspace and optional dataset.
NOTE: Passing in only the workspace
parameter will return all DataDriftDetector objects,
defined in the workspace.
static list(workspace, baseline_dataset=None, target_dataset=None)
Parameters
Returns
A list of DataDriftDetector objects.
Return type
run
Run a single point in time data drift analysis.
run(target_date, compute_target=None, create_compute_target=False, feature_list=None, drift_threshold=None)
Parameters
- compute_target
- ComputeTarget or str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. If not specified, a compute target is created automatically.
- create_compute_target
- bool
Indicates whether an Azure Machine Learning compute target is created automatically.
Optional whitelisted features to run the datadrift detection on.
Returns
A DataDriftDetector run.
Return type
show
Show data drift trend in given time range.
By default, this method shows the most recent 10 cycles. For example, if frequency is day, then it will be the most recent 10 days. If frequency is week, then it will be the most recent 10 weeks.
show(start_time=None, end_time=None)
Parameters
- start_time
- datetime, <xref:optional>
The start of the presentation time window in UTC. The default None means to pick up the most recent 10th cycle's results.
- end_time
- datetime, <xref:optional>
The end of the presentation data time window in UTC. The default None means the current day.
Returns
A dictionary of all figures. The key is service_name.
Return type
update
Update the schedule associated with the DataDriftDetector object.
Optional parameter values can be set to None
, otherwise they default to their existing values.
update(compute_target=Ellipsis, feature_list=Ellipsis, schedule_start=Ellipsis, alert_config=Ellipsis, drift_threshold=Ellipsis, wait_for_completion=True)
Parameters
- compute_target
- ComputeTarget or str
Optional Azure Machine Learning ComputeTarget or ComputeTarget name. DataDriftDetector will create a compute target if this parameter is not specified.
Whitelisted features to run the datadrift detection on.
- alert_config
- AlertConfiguration
Optional configuration object for DataDriftDetector alerts.
- wait_for_completion
- bool
Whether to wait for the enable/disable/delete operations to complete.
Returns
self
Return type
Attributes
alert_config
Get the alert configuration for the DataDriftDetector object.
Returns
An AlertConfiguration object.
Return type
baseline_dataset
Get the baseline dataset associated with the DataDriftDetector object.
Returns
Dataset type of the baseline dataset.
Return type
compute_target
Get the compute target attached to the DataDriftDetector object.
Returns
The compute target.
Return type
drift_threshold
Get the drift threshold for the DataDriftDetector object.
Returns
The drift threshold.
Return type
drift_type
Get the type of the DataDriftDetector, 'DatasetBased' is the only value supported for now.
Returns
The type of DataDriftDetector object.
Return type
enabled
Get the boolean value indicating whether the DataDriftDetector object is enabled.
Returns
A boolean value; True for enabled.
Return type
feature_list
Get the list of whitelisted features for the DataDriftDetector object.
Returns
A list of feature names.
Return type
frequency
Get the frequency of the DataDriftDetector schedule.
Returns
A string of either "Day", "Week", or "Month"
Return type
interval
Get the interval of the DataDriftDetector schedule.
Returns
An integer value of time unit.
Return type
latency
Get the latency of the DataDriftDetector schedule jobs (in hours).
Returns
The number of hours representing the latency.
Return type
name
schedule_start
Get the start time of the schedule.
Returns
A datetime object of schedule start time in UTC.
Return type
state
Denotes the state of the DataDriftDetector schedule.
Returns
One of 'Disabled', 'Enabled', 'Deleted', 'Disabling', 'Enabling', 'Deleting', 'Failed', 'DisableFailed', 'EnableFailed', 'DeleteFailed'.
Return type
target_dataset
Get the target dataset associated with the DataDriftDetector object.
Returns
The dataset type of the baseline dataset.
Return type
workspace
Get the workspace of the DataDriftDetector object.
Returns
The workspace the DataDriftDetector object was created in.
Return type
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for