AbstractDataset Class

Base class of datasets in Azure Machine Learning.

Please reference TabularDatasetFactory class and FileDatasetFactory class to create instances of dataset.

Class AbstractDataset constructor.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class and FileDatasetFactory class.

Inheritance
builtins.object
AbstractDataset

Constructor

AbstractDataset()

Methods

add_tags

Add key value pairs to the tags dictionary of this dataset.

as_named_input

Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.

get_all

Get all the registered datasets in the workspace.

get_by_id

Get a Dataset which is saved to the workspace.

get_by_name

Get a registered Dataset from workspace by its registration name.

get_partition_key_values

Return unique key values of partition_keys.

validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None


   # get all partition key value pairs
   partitions = ds.get_partition_key_values()
   # Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}]

   partitions = ds.get_partition_key_values(['country'])
   # Return [{'country': 'US'}]
register

Register the dataset to the provided workspace.

remove_tags

Remove the specified keys from tags dictionary of this dataset.

unregister_all_versions

Unregister all versions under the registration name of this dataset from the workspace.

update

Perform an in-place update of the dataset.

add_tags

Add key value pairs to the tags dictionary of this dataset.

add_tags(tags=None)

Parameters

tags
dict[str, str]
Required

The dictionary of tags to add.

Returns

The updated dataset object.

Return type

as_named_input

Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.

as_named_input(name)

Parameters

name
str
Required

The name of the dataset for the run.

Returns

The configuration object describing how the Dataset should be materialized in the run.

Return type

Remarks

The name here will only be applicable inside an Azure Machine Learning run. The name must only contain alphanumeric and underscore characters so it can be made available as an environment variable. You can use this name to retrieve the dataset in the context of a run using two approaches:

  • Environment Variable:

    The name will be the environment variable name and the materialized dataset will be made available as the value of the environment variable. If the dataset is downloaded or mounted, the value will be the downloaded/mounted path. For example:


   # in your job submission notebook/script:
   dataset.as_named_input('foo').as_download('/tmp/dataset')

   # in the script that will be executed in the run
   import os
   path = os.environ['foo'] # path will be /tmp/dataset

Note

If the dataset is set to direct mode, then the value will be the dataset ID. You can then

retrieve the dataset object by doing Dataset.get_by_id(os.environ['foo'])

  • Run.input_datasets:

    This is a dictionary where the key will be the dataset name you specified in this method and the value will be the materialized dataset. For downloaded and mounted dataset, the value will be the downloaded/mounted path. For direct mode, the value will be the same dataset object you specified in your job submission script.


   # in your job submission notebook/script:
   dataset.as_named_input('foo') # direct mode

   # in the script that will be executed in the run
   run = Run.get_context()
   run.input_datasets['foo'] # this returns the dataset object from above.

get_all

Get all the registered datasets in the workspace.

static get_all(workspace)

Parameters

workspace
Workspace
Required

The existing AzureML workspace in which the Datasets were registered.

Returns

A dictionary of TabularDataset and FileDataset objects keyed by their registration name.

Return type

get_by_id

Get a Dataset which is saved to the workspace.

static get_by_id(workspace, id, **kwargs)

Parameters

workspace
Workspace
Required

The existing AzureML workspace in which the Dataset is saved.

id
str
Required

The id of dataset.

Returns

The dataset object. If dataset is registered, its registration name and version will also be returned.

Return type

get_by_name

Get a registered Dataset from workspace by its registration name.

static get_by_name(workspace, name, version='latest', **kwargs)

Parameters

workspace
Workspace
Required

The existing AzureML workspace in which the Dataset was registered.

name
str
Required

The registration name.

version
int
Required

The registration version. Defaults to 'latest'.

Returns

The registered dataset object.

Return type

get_partition_key_values

Return unique key values of partition_keys.

validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None


   # get all partition key value pairs
   partitions = ds.get_partition_key_values()
   # Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}]

   partitions = ds.get_partition_key_values(['country'])
   # Return [{'country': 'US'}]
get_partition_key_values(partition_keys=None)

Parameters

partition_keys
list[str]
Required

partition keys

register

Register the dataset to the provided workspace.

register(workspace, name, description=None, tags=None, create_new_version=False)

Parameters

workspace
Workspace
Required

The workspace to register the dataset.

name
str
Required

The name to register the dataset with.

description
str
Required

A text description of the dataset. Defaults to None.

tags
dict[str, str]
Required

Dictionary of key value tags to give the dataset. Defaults to None.

create_new_version
bool
Required

Boolean to register the dataset as a new version under the specified name.

Returns

The registered dataset object.

Return type

remove_tags

Remove the specified keys from tags dictionary of this dataset.

remove_tags(tags=None)

Parameters

tags
list[str]
Required

The list of keys to remove.

Returns

The updated dataset object.

Return type

unregister_all_versions

Unregister all versions under the registration name of this dataset from the workspace.

unregister_all_versions()

Remarks

The operation does not change any source data.

update

Perform an in-place update of the dataset.

update(description=None, tags=None)

Parameters

description
str
Required

The new description to use for the dataset. This description replaces the existing description. Defaults to existing description. To clear description, enter empty string.

tags
dict[str, str]
Required

A dictionary of tags to update the dataset with. These tags replace existing tags for the dataset. Defaults to existing tags. To clear tags, enter empty dictionary.

Returns

The updated dataset object.

Return type

Attributes

data_changed_time

Return the source data changed time.

Returns

The time when the most recent change happened to source data.

Return type

Remarks

Data changed time is available for file-based data source. None will be returned when the data source is not supported for checking when change has happened.

description

Return the registration description.

Returns

Dataset description.

Return type

str

id

Return the identifier of the dataset.

Returns

Dataset id. If the dataset is not saved to any workspace, the id will be None.

Return type

str

name

Return the registration name.

Returns

Dataset name.

Return type

str

partition_keys

Return the partition keys.

Returns

the partition keys

Return type

tags

Return the registration tags.

Returns

Dataset tags.

Return type

str

version

Return the registration version.

Returns

Dataset version.

Return type

int