AbstractDataset Class

Base class of datasets in Azure Machine Learning.

Please reference TabularDatasetFactory class and FileDatasetFactory class to create instances of dataset.

Class AbstractDataset constructor.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class and FileDatasetFactory class.

Constructor

AbstractDataset()

Methods

add_tags	Add key value pairs to the tags dictionary of this dataset.
as_named_input	Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.
get_all	Get all the registered datasets in the workspace.
get_by_id	Get a Dataset which is saved to the workspace.
get_by_name	Get a registered Dataset from workspace by its registration name.
get_partition_key_values	Return unique key values of partition_keys. validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None `# get all partition key value pairs partitions = ds.get_partition_key_values() # Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}] partitions = ds.get_partition_key_values(['country']) # Return [{'country': 'US'}]`
register	Register the dataset to the provided workspace.
remove_tags	Remove the specified keys from tags dictionary of this dataset.
unregister_all_versions	Unregister all versions under the registration name of this dataset from the workspace.
update	Perform an in-place update of the dataset.

add_tags

Add key value pairs to the tags dictionary of this dataset.

add_tags(tags=None)

Parameters

Name	Description
tags Required	dict[str, str] The dictionary of tags to add.

Returns

Type	Description
Union[TabularDataset, FileDataset]	The updated dataset object.

as_named_input

Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.

as_named_input(name)

Parameters

Name	Description
name Required	str The name of the dataset for the run.

Returns

Type	Description
DatasetConsumptionConfig	The configuration object describing how the Dataset should be materialized in the run.

Remarks

The name here will only be applicable inside an Azure Machine Learning run. The name must only contain alphanumeric and underscore characters so it can be made available as an environment variable. You can use this name to retrieve the dataset in the context of a run using two approaches:

Environment Variable:

The name will be the environment variable name and the materialized dataset will be made available as the value of the environment variable. If the dataset is downloaded or mounted, the value will be the downloaded/mounted path. For example:


   # in your job submission notebook/script:
   dataset.as_named_input('foo').as_download('/tmp/dataset')

   # in the script that will be executed in the run
   import os
   path = os.environ['foo'] # path will be /tmp/dataset

Note

If the dataset is set to direct mode, then the value will be the dataset ID. You can then

retrieve the dataset object by doing Dataset.get_by_id(os.environ['foo'])

Run.input_datasets:

This is a dictionary where the key will be the dataset name you specified in this method and the value will be the materialized dataset. For downloaded and mounted dataset, the value will be the downloaded/mounted path. For direct mode, the value will be the same dataset object you specified in your job submission script.


   # in your job submission notebook/script:
   dataset.as_named_input('foo') # direct mode

   # in the script that will be executed in the run
   run = Run.get_context()
   run.input_datasets['foo'] # this returns the dataset object from above.

get_all

Get all the registered datasets in the workspace.

static get_all(workspace)

Parameters

Name	Description
workspace Required	Workspace The existing AzureML workspace in which the Datasets were registered.

Returns

Type	Description
dict[str, Union[TabularDataset, FileDataset]]	A dictionary of TabularDataset and FileDataset objects keyed by their registration name.

get_by_id

Get a Dataset which is saved to the workspace.

static get_by_id(workspace, id, **kwargs)

Parameters

Name	Description
workspace Required	Workspace The existing AzureML workspace in which the Dataset is saved.
id Required	str The id of dataset.

Returns

Type	Description
Union[TabularDataset, FileDataset]	The dataset object. If dataset is registered, its registration name and version will also be returned.

get_by_name

Get a registered Dataset from workspace by its registration name.

static get_by_name(workspace, name, version='latest', **kwargs)

Parameters

Name	Description
workspace Required	Workspace The existing AzureML workspace in which the Dataset was registered.
name Required	str The registration name.
version Required	int The registration version. Defaults to 'latest'.

Returns

Type	Description
Union[TabularDataset, FileDataset]	The registered dataset object.

get_partition_key_values

Return unique key values of partition_keys.

validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None


   # get all partition key value pairs
   partitions = ds.get_partition_key_values()
   # Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}]

   partitions = ds.get_partition_key_values(['country'])
   # Return [{'country': 'US'}]

get_partition_key_values(partition_keys=None)

Parameters

Name	Description
partition_keys Required	list[str] partition keys

register

register(workspace, name, description=None, tags=None, create_new_version=False)

Parameters

Name	Description
workspace Required	Workspace The workspace to register the dataset.
name Required	str The name to register the dataset with.
description Required	str A text description of the dataset. Defaults to None.
tags Required	dict[str, str] Dictionary of key value tags to give the dataset. Defaults to None.
create_new_version Required	bool Boolean to register the dataset as a new version under the specified name.

Returns

Type	Description
Union[TabularDataset, FileDataset]	The registered dataset object.

remove_tags

Remove the specified keys from tags dictionary of this dataset.

remove_tags(tags=None)

Parameters

Name	Description
tags Required	list[str] The list of keys to remove.

Returns

Type	Description
Union[TabularDataset, FileDataset]	The updated dataset object.

unregister_all_versions

Unregister all versions under the registration name of this dataset from the workspace.

unregister_all_versions()

Remarks

The operation does not change any source data.

update

Perform an in-place update of the dataset.

update(description=None, tags=None)

Parameters

Name	Description
description Required	str The new description to use for the dataset. This description replaces the existing description. Defaults to existing description. To clear description, enter empty string.
tags Required	dict[str, str] A dictionary of tags to update the dataset with. These tags replace existing tags for the dataset. Defaults to existing tags. To clear tags, enter empty dictionary.

Returns

Type	Description
Union[TabularDataset, FileDataset]	The updated dataset object.

Attributes

data_changed_time

Return the source data changed time.

Returns

Type	Description
datetime	The time when the most recent change happened to source data.

Remarks

Data changed time is available for file-based data source. None will be returned when the data source is not supported for checking when change has happened.

description

Return the registration description.

Returns

Type	Description
str	Dataset description.

id

Return the identifier of the dataset.

Returns

Type	Description
str	Dataset id. If the dataset is not saved to any workspace, the id will be None.

name

Return the registration name.

Returns

Type	Description
str	Dataset name.

partition_keys

Return the partition keys.

Returns

Type	Description
list[str]	the partition keys

Type	Description
str	Dataset tags.

version

Return the registration version.

Returns

Type	Description
int	Dataset version.

Feedback

Was this page helpful?

Share via

AbstractDataset Class

Constructor

Methods

add_tags

Parameters

Returns

as_named_input

Parameters

Returns

Remarks

get_all

Parameters

Returns

get_by_id

Parameters

Returns

get_by_name

Parameters

Returns

get_partition_key_values

Parameters

register

Parameters

Returns

remove_tags

Parameters

Returns

unregister_all_versions

Remarks

update

Parameters

Returns

Attributes

data_changed_time

Returns

Remarks

description

Returns

id

Returns

name

Returns

partition_keys

Returns

tags

Returns

version

Returns

Feedback