DatasetDefinition Class

Defines a series of steps that specify how to read and transform data in a Dataset.

Note

This class is deprecated. For more information, see https://aka.ms/dataset-deprecation.

A Dataset registered in an Azure Machine Learning workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. The current definition is the latest one created.

For unregistered Datasets, only one definition exists.

Dataset definitions support all the transformations listed for the <xref:azureml.dataprep.Dataflow> class: see http://aka.ms/azureml/howto/transformdata. To learn more about Dataset Definitions, go to https://aka.ms/azureml/howto/versiondata.

Initialize the Dataset definition object.

Inheritance
azureml.dataprep.api.engineless_dataflow.EnginelessDataflow
DatasetDefinition

Constructor

DatasetDefinition(workspace=None, dataset_id=None, version_id=None, dataflow=None, dataflow_json=None, notes=None, etag=None, created_time=None, modified_time=None, state=None, deprecated_by_dataset_id=None, deprecated_by_definition_version=None, data_path=None, dataset=None, file_type='Unknown')

Parameters

Name Description
workspace
Required
str

The workspace the Dataset is registered in.

dataset_id
Required
str

The Dataset identifier.

version_id
Required
str

The definition version.

dataflow
Required
str

The Dataflow object.

dataflow_json
Required

The Dataflow json.

notes
Required
str

Optional information about the definition.

etag
Required
str

Etag.

created_time
Required

The creation time of the definition.

modified_time
Required

The last modified time of the definition.

deprecated_by_dataset_id
Required
str

The ID of the Dataset that deprecates this definition.

deprecated_by_definition_version
Required
str

The version of the definition that deprecates this definition.

data_path
Required

The data path.

dataset
Required

The parent Dataset object.

Methods

archive

Archive the dataset definition.

create_snapshot

Create a snapshot of the registered Dataset.

deprecate

Deprecate the Dataset, with a pointer to the new Dataset.

reactivate

Reactivate the dataset definition.

Works on dataset definitions that have been deprecated or archived.

to_pandas_dataframe

Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.

to_spark_dataframe

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.

archive

Archive the dataset definition.

archive()

Returns

Type Description

None.

Remarks

After archival, any attempt to retrieve the dataset will result in an error. If archived by accident, use reactivate to activate it.

create_snapshot

Create a snapshot of the registered Dataset.

create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)

Parameters

Name Description
snapshot_name
Required
str

The snapshot name. Snapshot names should be unique within a Dataset.

compute_target

The compute target to perform the snapshot profile creation. If omitted, the local compute is used.

Default value: None
create_data_snapshot

If True, a materialized copy of the data will be created.

Default value: False
target_datastore

The target datastore where to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.

Default value: None

Returns

Type Description

A DatasetSnapshot object.

Remarks

Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.

deprecate

Deprecate the Dataset, with a pointer to the new Dataset.

deprecate(deprecate_by_dataset_id, deprecated_by_definition_version=None)

Parameters

Name Description
deprecate_by_dataset_id
Required

The dataset ID which is responsible for the deprecation of current dataset.

deprecated_by_definition_version
str

The dataset definition version which is responsible for the deprecation of current dataset definition.

Default value: None

Returns

Type Description

None.

Remarks

Deprecated dataset definitions will log warnings when they are consumed. To completely block a dataset definition from being consumed, archive it.

If a dataset definition is deprecated by accident, use reactivate to activate it.

reactivate

Reactivate the dataset definition.

Works on dataset definitions that have been deprecated or archived.

reactivate()

Returns

Type Description

None.

to_pandas_dataframe

Create a Pandas dataframe by executing the transformation pipeline defined by this dataset definition.

to_pandas_dataframe()

Returns

Type Description

A Pandas DataFrame.

Remarks

Return a Pandas DataFrame fully materialized in memory.

to_spark_dataframe

Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataflow.

to_spark_dataframe()

Returns

Type Description

A Spark DataFrame.

Remarks

The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.