DatasetDefinition Class
Defines a series of steps that specify how to read and transform data in a Dataset.
Note
This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.
A Dataset registered in an Azure Machine Learning workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. The current definition is the latest one created.
For unregistered Datasets, only one definition exists.
Dataset definitions support all the transformations listed for the <xref:azureml.dataprep.Dataflow> class: see http://aka.ms/azureml/howto/transformdata. To learn more about Dataset Definitions, go to https://aka.ms/azureml/howto/versiondata.
Initialize the Dataset definition object.
- Inheritance
-
azureml.dataprep.api.engineless_dataflow.EnginelessDataflowDatasetDefinition
Constructor
DatasetDefinition(workspace=None, dataset_id=None, version_id=None, dataflow=None, dataflow_json=None, notes=None, etag=None, created_time=None, modified_time=None, state=None, deprecated_by_dataset_id=None, deprecated_by_definition_version=None, data_path=None, dataset=None, file_type='Unknown')
Parameters
- dataflow_json
The Dataflow json.
- deprecated_by_definition_version
- str
The version of the definition that deprecates this definition.
Methods
archive |
Archive the dataset definition. |
create_snapshot |
Create a snapshot of the registered Dataset. |
deprecate |
Deprecate the Dataset, with a pointer to the new Dataset. |
reactivate |
Reactivate the dataset definition. Works on dataset definitions that have been deprecated or archived. |
to_pandas_dataframe |
Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition. |
to_spark_dataframe |
Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow. |
archive
Archive the dataset definition.
archive()
Returns
None.
Return type
Remarks
After archival, any attempt to retrieve the dataset will result in an error. If archived by accident, use reactivate to activate it.
create_snapshot
Create a snapshot of the registered Dataset.
create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)
Parameters
- compute_target
- ComputeTarget or str
The compute target to perform the snapshot profile creation. If omitted, the local compute is used.
- create_data_snapshot
- bool
If True, a materialized copy of the data will be created.
- target_datastore
- Union[AbstractAzureStorageDatastore, str]
The target datastore where to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.
Returns
A DatasetSnapshot object.
Return type
Remarks
Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.
deprecate
Deprecate the Dataset, with a pointer to the new Dataset.
deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)
Parameters
- deprecate_by_dataset_id
- uuid
The dataset ID which is responsible for the deprecation of current dataset.
- deprecated_by_definition_version
- str
The dataset definition version which is responsible for the deprecation of current dataset definition.
Returns
None.
Return type
Remarks
Deprecated dataset definitions will log warnings when they are consumed. To completely block a dataset definition from being consumed, archive it.
If a dataset definition is deprecated by accident, use reactivate to activate it.
reactivate
Reactivate the dataset definition.
Works on dataset definitions that have been deprecated or archived.
reactivate()
Returns
None.
Return type
to_pandas_dataframe
Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.
to_pandas_dataframe()
Returns
A Pandas DataFrame.
Return type
Remarks
Return a Pandas DataFrame fully materialized in memory.
to_spark_dataframe
Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.
to_spark_dataframe()
Returns
A Spark DataFrame.
Return type
Remarks
The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for