DatasetDefinition Class
Defines a series of steps that specify how to read and transform data in a Dataset.
Note
This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.
A Dataset registered in an Azure Machine Learning workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. The current definition is the latest one created.
For unregistered Datasets, only one definition exists.
Dataset definitions support all the transformations listed for the <xref:azureml.dataprep.Dataflow> class: see http://aka.ms/azureml/howto/transformdata. To learn more about Dataset Definitions, go to https://aka.ms/azureml/howto/versiondata.
Initialize the Dataset definition object.
- Inheritance
-
azureml.dataprep.api.engineless_dataflow.EnginelessDataflowDatasetDefinition
Constructor
DatasetDefinition(workspace=None, dataset_id=None, version_id=None, dataflow=None, dataflow_json=None, notes=None, etag=None, created_time=None, modified_time=None, state=None, deprecated_by_dataset_id=None, deprecated_by_definition_version=None, data_path=None, dataset=None, file_type='Unknown')
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace the Dataset is registered in. |
dataset_id
Required
|
The Dataset identifier. |
version_id
Required
|
The definition version. |
dataflow
Required
|
The Dataflow object. |
dataflow_json
Required
|
The Dataflow json. |
notes
Required
|
Optional information about the definition. |
etag
Required
|
Etag. |
created_time
Required
|
The creation time of the definition. |
modified_time
Required
|
The last modified time of the definition. |
deprecated_by_dataset_id
Required
|
The ID of the Dataset that deprecates this definition. |
deprecated_by_definition_version
Required
|
The version of the definition that deprecates this definition. |
data_path
Required
|
The data path. |
dataset
Required
|
The parent Dataset object. |
Methods
archive |
Archive the dataset definition. |
create_snapshot |
Create a snapshot of the registered Dataset. |
deprecate |
Deprecate the Dataset, with a pointer to the new Dataset. |
reactivate |
Reactivate the dataset definition. Works on dataset definitions that have been deprecated or archived. |
to_pandas_dataframe |
Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition. |
to_spark_dataframe |
Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow. |
archive
Archive the dataset definition.
archive()
Returns
Type | Description |
---|---|
None. |
Remarks
After archival, any attempt to retrieve the dataset will result in an error. If archived by accident, use reactivate to activate it.
create_snapshot
Create a snapshot of the registered Dataset.
create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)
Parameters
Name | Description |
---|---|
snapshot_name
Required
|
The snapshot name. Snapshot names should be unique within a Dataset. |
compute_target
|
ComputeTarget or
str
The compute target to perform the snapshot profile creation. If omitted, the local compute is used. Default value: None
|
create_data_snapshot
|
If True, a materialized copy of the data will be created. Default value: False
|
target_datastore
|
The target datastore where to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace. Default value: None
|
Returns
Type | Description |
---|---|
A DatasetSnapshot object. |
Remarks
Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.
deprecate
Deprecate the Dataset, with a pointer to the new Dataset.
deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)
Parameters
Name | Description |
---|---|
deprecate_by_dataset_id
Required
|
The dataset ID which is responsible for the deprecation of current dataset. |
deprecated_by_definition_version
|
The dataset definition version which is responsible for the deprecation of current dataset definition. Default value: None
|
Returns
Type | Description |
---|---|
None. |
Remarks
Deprecated dataset definitions will log warnings when they are consumed. To completely block a dataset definition from being consumed, archive it.
If a dataset definition is deprecated by accident, use reactivate to activate it.
reactivate
Reactivate the dataset definition.
Works on dataset definitions that have been deprecated or archived.
reactivate()
Returns
Type | Description |
---|---|
None. |
to_pandas_dataframe
Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.
to_pandas_dataframe()
Returns
Type | Description |
---|---|
A Pandas DataFrame. |
Remarks
Return a Pandas DataFrame fully materialized in memory.
to_spark_dataframe
Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.
to_spark_dataframe()
Returns
Type | Description |
---|---|
A Spark DataFrame. |
Remarks
The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.