AbstractDataset Class

Base class of datasets in Azure Machine Learning.

Please reference TabularDatasetFactory class and FileDatasetFactory class to create instances of dataset.

Class AbstractDataset constructor.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class and FileDatasetFactory class.

Inheritance
builtins.object
AbstractDataset

Constructor

AbstractDataset()

Methods

add_tags

Add key value pairs to the tags dictionary of this dataset.

as_named_input

Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.

get_all

Get all the registered datasets in the workspace.

get_by_id

Get a Dataset which is saved to the workspace.

get_by_name

Get a registered Dataset from workspace by its registration name.

get_partition_key_values

Return unique key values of partition_keys.

validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None


   # get all partition key value pairs
   partitions = ds.get_partition_key_values()
   # Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}]

   partitions = ds.get_partition_key_values(['country'])
   # Return [{'country': 'US'}]
register

Register the dataset to the provided workspace.

remove_tags

Remove the specified keys from tags dictionary of this dataset.

unregister_all_versions

Unregister all versions under the registration name of this dataset from the workspace.

update

Perform an in-place update of the dataset.

add_tags

Add key value pairs to the tags dictionary of this dataset.

add_tags(tags=None)

Parameters

Name Description
tags
Required

The dictionary of tags to add.

Returns

Type Description

The updated dataset object.

as_named_input

Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.

as_named_input(name)

Parameters

Name Description
name
Required
str

The name of the dataset for the run.

Returns

Type Description

The configuration object describing how the Dataset should be materialized in the run.

Remarks

The name here will only be applicable inside an Azure Machine Learning run. The name must only contain alphanumeric and underscore characters so it can be made available as an environment variable. You can use this name to retrieve the dataset in the context of a run using two approaches:

  • Environment Variable:

    The name will be the environment variable name and the materialized dataset will be made available as the value of the environment variable. If the dataset is downloaded or mounted, the value will be the downloaded/mounted path. For example:


   # in your job submission notebook/script:
   dataset.as_named_input('foo').as_download('/tmp/dataset')

   # in the script that will be executed in the run
   import os
   path = os.environ['foo'] # path will be /tmp/dataset

Note

If the dataset is set to direct mode, then the value will be the dataset ID. You can then

retrieve the dataset object by doing Dataset.get_by_id(os.environ['foo'])

  • Run.input_datasets:

    This is a dictionary where the key will be the dataset name you specified in this method and the value will be the materialized dataset. For downloaded and mounted dataset, the value will be the downloaded/mounted path. For direct mode, the value will be the same dataset object you specified in your job submission script.


   # in your job submission notebook/script:
   dataset.as_named_input('foo') # direct mode

   # in the script that will be executed in the run
   run = Run.get_context()
   run.input_datasets['foo'] # this returns the dataset object from above.

get_all

Get all the registered datasets in the workspace.

static get_all(workspace)

Parameters

Name Description
workspace
Required

The existing AzureML workspace in which the Datasets were registered.

Returns

Type Description

A dictionary of TabularDataset and FileDataset objects keyed by their registration name.

get_by_id

Get a Dataset which is saved to the workspace.

static get_by_id(workspace, id, **kwargs)

Parameters

Name Description
workspace
Required

The existing AzureML workspace in which the Dataset is saved.

id
Required
str

The id of dataset.

Returns

Type Description

The dataset object. If dataset is registered, its registration name and version will also be returned.

get_by_name

Get a registered Dataset from workspace by its registration name.

static get_by_name(workspace, name, version='latest', **kwargs)

Parameters

Name Description
workspace
Required

The existing AzureML workspace in which the Dataset was registered.

name
Required
str

The registration name.

version
Required
int

The registration version. Defaults to 'latest'.

Returns

Type Description

The registered dataset object.

get_partition_key_values

Return unique key values of partition_keys.

validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None


   # get all partition key value pairs
   partitions = ds.get_partition_key_values()
   # Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}]

   partitions = ds.get_partition_key_values(['country'])
   # Return [{'country': 'US'}]
get_partition_key_values(partition_keys=None)

Parameters

Name Description
partition_keys
Required

partition keys

register

Register the dataset to the provided workspace.

register(workspace, name, description=None, tags=None, create_new_version=False)

Parameters

Name Description
workspace
Required

The workspace to register the dataset.

name
Required
str

The name to register the dataset with.

description
Required
str

A text description of the dataset. Defaults to None.

tags
Required

Dictionary of key value tags to give the dataset. Defaults to None.

create_new_version
Required

Boolean to register the dataset as a new version under the specified name.

Returns

Type Description

The registered dataset object.

remove_tags

Remove the specified keys from tags dictionary of this dataset.

remove_tags(tags=None)

Parameters

Name Description
tags
Required

The list of keys to remove.

Returns

Type Description

The updated dataset object.

unregister_all_versions

Unregister all versions under the registration name of this dataset from the workspace.

unregister_all_versions()

Remarks

The operation does not change any source data.

update

Perform an in-place update of the dataset.

update(description=None, tags=None)

Parameters

Name Description
description
Required
str

The new description to use for the dataset. This description replaces the existing description. Defaults to existing description. To clear description, enter empty string.

tags
Required

A dictionary of tags to update the dataset with. These tags replace existing tags for the dataset. Defaults to existing tags. To clear tags, enter empty dictionary.

Returns

Type Description

The updated dataset object.

Attributes

data_changed_time

Return the source data changed time.

Returns

Type Description

The time when the most recent change happened to source data.

Remarks

Data changed time is available for file-based data source. None will be returned when the data source is not supported for checking when change has happened.

description

Return the registration description.

Returns

Type Description
str

Dataset description.

id

Return the identifier of the dataset.

Returns

Type Description
str

Dataset id. If the dataset is not saved to any workspace, the id will be None.

name

Return the registration name.

Returns

Type Description
str

Dataset name.

partition_keys

Return the partition keys.

Returns

Type Description

the partition keys

tags

Return the registration tags.

Returns

Type Description
str

Dataset tags.

version

Return the registration version.

Returns

Type Description
int

Dataset version.