AbstractDataset Class
Base class of datasets in Azure Machine Learning.
Please reference TabularDatasetFactory class and FileDatasetFactory class to create instances of dataset.
Class AbstractDataset constructor.
This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class and FileDatasetFactory class.
- Inheritance
-
builtins.objectAbstractDataset
Constructor
AbstractDataset()
Methods
add_tags |
Add key value pairs to the tags dictionary of this dataset. |
as_named_input |
Provide a name for this dataset which will be used to retrieve the materialized dataset in the run. |
get_all |
Get all the registered datasets in the workspace. |
get_by_id |
Get a Dataset which is saved to the workspace. |
get_by_name |
Get a registered Dataset from workspace by its registration name. |
get_partition_key_values |
Return unique key values of partition_keys. validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None
|
register |
Register the dataset to the provided workspace. |
remove_tags |
Remove the specified keys from tags dictionary of this dataset. |
unregister_all_versions |
Unregister all versions under the registration name of this dataset from the workspace. |
update |
Perform an in-place update of the dataset. |
add_tags
Add key value pairs to the tags dictionary of this dataset.
add_tags(tags=None)
Parameters
Name | Description |
---|---|
tags
Required
|
The dictionary of tags to add. |
Returns
Type | Description |
---|---|
The updated dataset object. |
as_named_input
Provide a name for this dataset which will be used to retrieve the materialized dataset in the run.
as_named_input(name)
Parameters
Name | Description |
---|---|
name
Required
|
The name of the dataset for the run. |
Returns
Type | Description |
---|---|
The configuration object describing how the Dataset should be materialized in the run. |
Remarks
The name here will only be applicable inside an Azure Machine Learning run. The name must only contain alphanumeric and underscore characters so it can be made available as an environment variable. You can use this name to retrieve the dataset in the context of a run using two approaches:
Environment Variable:
The name will be the environment variable name and the materialized dataset will be made available as the value of the environment variable. If the dataset is downloaded or mounted, the value will be the downloaded/mounted path. For example:
# in your job submission notebook/script:
dataset.as_named_input('foo').as_download('/tmp/dataset')
# in the script that will be executed in the run
import os
path = os.environ['foo'] # path will be /tmp/dataset
Note
If the dataset is set to direct mode, then the value will be the dataset ID. You can then
retrieve the dataset object by doing Dataset.get_by_id(os.environ['foo'])
Run.input_datasets:
This is a dictionary where the key will be the dataset name you specified in this method and the value will be the materialized dataset. For downloaded and mounted dataset, the value will be the downloaded/mounted path. For direct mode, the value will be the same dataset object you specified in your job submission script.
# in your job submission notebook/script:
dataset.as_named_input('foo') # direct mode
# in the script that will be executed in the run
run = Run.get_context()
run.input_datasets['foo'] # this returns the dataset object from above.
get_all
Get all the registered datasets in the workspace.
static get_all(workspace)
Parameters
Name | Description |
---|---|
workspace
Required
|
The existing AzureML workspace in which the Datasets were registered. |
Returns
Type | Description |
---|---|
A dictionary of TabularDataset and FileDataset objects keyed by their registration name. |
get_by_id
Get a Dataset which is saved to the workspace.
static get_by_id(workspace, id, **kwargs)
Parameters
Name | Description |
---|---|
workspace
Required
|
The existing AzureML workspace in which the Dataset is saved. |
id
Required
|
The id of dataset. |
Returns
Type | Description |
---|---|
The dataset object. If dataset is registered, its registration name and version will also be returned. |
get_by_name
Get a registered Dataset from workspace by its registration name.
static get_by_name(workspace, name, version='latest', **kwargs)
Parameters
Name | Description |
---|---|
workspace
Required
|
The existing AzureML workspace in which the Dataset was registered. |
name
Required
|
The registration name. |
version
Required
|
The registration version. Defaults to 'latest'. |
Returns
Type | Description |
---|---|
The registered dataset object. |
get_partition_key_values
Return unique key values of partition_keys.
validate if partition_keys is a valid subset of full set of partition keys, return unique key values of partition_keys, default to return the unique key combinations by taking the full set of partition keys of this dataset if partition_keys is None
# get all partition key value pairs
partitions = ds.get_partition_key_values()
# Return [{'country': 'US', 'state': 'WA', 'partition_date': datetime('2020-1-1')}]
partitions = ds.get_partition_key_values(['country'])
# Return [{'country': 'US'}]
get_partition_key_values(partition_keys=None)
Parameters
Name | Description |
---|---|
partition_keys
Required
|
partition keys |
register
Register the dataset to the provided workspace.
register(workspace, name, description=None, tags=None, create_new_version=False)
Parameters
Name | Description |
---|---|
workspace
Required
|
The workspace to register the dataset. |
name
Required
|
The name to register the dataset with. |
description
Required
|
A text description of the dataset. Defaults to None. |
tags
Required
|
Dictionary of key value tags to give the dataset. Defaults to None. |
create_new_version
Required
|
Boolean to register the dataset as a new version under the specified name. |
Returns
Type | Description |
---|---|
The registered dataset object. |
remove_tags
Remove the specified keys from tags dictionary of this dataset.
remove_tags(tags=None)
Parameters
Name | Description |
---|---|
tags
Required
|
The list of keys to remove. |
Returns
Type | Description |
---|---|
The updated dataset object. |
unregister_all_versions
Unregister all versions under the registration name of this dataset from the workspace.
unregister_all_versions()
Remarks
The operation does not change any source data.
update
Perform an in-place update of the dataset.
update(description=None, tags=None)
Parameters
Name | Description |
---|---|
description
Required
|
The new description to use for the dataset. This description replaces the existing description. Defaults to existing description. To clear description, enter empty string. |
tags
Required
|
A dictionary of tags to update the dataset with. These tags replace existing tags for the dataset. Defaults to existing tags. To clear tags, enter empty dictionary. |
Returns
Type | Description |
---|---|
The updated dataset object. |
Attributes
data_changed_time
Return the source data changed time.
Returns
Type | Description |
---|---|
The time when the most recent change happened to source data. |
Remarks
Data changed time is available for file-based data source. None will be returned when the data source is not supported for checking when change has happened.
description
id
Return the identifier of the dataset.
Returns
Type | Description |
---|---|
Dataset id. If the dataset is not saved to any workspace, the id will be None. |