Dataset Class
Represents a resource for exploring, transforming, and managing data in Azure Machine Learning.
A Dataset is a reference to data in a Datastore or behind public web urls.
For methods deprecated in this class, please check AbstractDataset class for the improved APIs.
The following Datasets types are supported:
TabularDataset represents data in a tabular format created by parsing the provided file or list of files.
FileDataset references single or multiple files in datastores or from public URLs.
To get started with datasets, see the article Add & register datasets, or see the notebooks https://aka.ms/tabulardataset-samplenotebook and https://aka.ms/filedataset-samplenotebook.
Initialize the Dataset object.
To obtain a Dataset that has already been registered with the workspace, use the get method.
- Inheritance
-
builtins.objectDataset
Constructor
Dataset(definition, workspace=None, name=None, id=None)
Parameters
- definition
- <xref:azureml.data.DatasetDefinition>
The Dataset definition.
Remarks
The Dataset class exposes two convenience class attributes (File
and Tabular
) you
can use for creating a Dataset without working with the corresponding factory methods. For
example, to create a dataset using these attributes:
Dataset.Tabular.from_delimited_files()
Dataset.File.from_files()
You can also create a new TabularDataset or FileDataset by directly calling the corresponding factory methods of the class defined in TabularDatasetFactory and FileDatasetFactory.
The following example shows how to create a TabularDataset pointing to a single path in a datastore.
from azureml.core import Dataset
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/iris.csv')])
# preview the first 3 rows of the dataset
dataset.take(3).to_pandas_dataframe()
Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/train-with-datasets/train-with-datasets.ipynb
Variables
- azureml.core.Dataset.File
A class attribute that provides access to the FileDatasetFactory methods for creating new FileDataset objects. Usage: Dataset.File.from_files().
- azureml.core.Dataset.Tabular
A class attribute that provides access to the TabularDatasetFactory methods for creating new TabularDataset objects. Usage: Dataset.Tabular.from_delimited_files().
Methods
archive |
Archive an active or deprecated dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
auto_read_files |
Analyzes the file(s) at the specified path and returns a new Dataset. Note This method is deprecated and will no longer be supported. Recommend to use the Dataset.Tabular.from_* methods to read files. For more information, see https://aka.ms/dataset-deprecation. |
compare_profiles |
Compare the current Dataset's profile with another dataset profile. This shows the differences in summary statistics between two datasets. The parameter 'rhs_dataset' stands for "right-hand side", and is simply the second dataset. The first dataset (the current dataset object) is considered the "left-hand side". Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
create_snapshot |
Create a snapshot of the registered Dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
delete_snapshot |
Delete snapshot of the Dataset by name. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
deprecate |
Deprecate an active dataset in a workspace by another dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
diff |
Diff the current Dataset with rhs_dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
from_binary_files |
Create an unregistered, in-memory Dataset from binary files. Note This method is deprecated and will no longer be supported. Recommend to use Dataset.File.from_files instead. For more information, see https://aka.ms/dataset-deprecation. |
from_delimited_files |
Create an unregistered, in-memory Dataset from delimited files. Note This method is deprecated and will no longer be supported. Recommend to use Dataset.Tabular.from_delimited_files instead. For more information, see https://aka.ms/dataset-deprecation.
|
from_excel_files |
Create an unregistered, in-memory Dataset from Excel files. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
from_json_files |
Create an unregistered, in-memory Dataset from JSON files. Note This method is deprecated and will no longer be supported. Recommend to use Dataset.Tabular.from_json_lines_files instead to read from JSON lines file. For more information, see https://aka.ms/dataset-deprecation. |
from_pandas_dataframe |
Create an unregistered, in-memory Dataset from a pandas dataframe. Note This method is deprecated and will no longer be supported. Recommend to use Dataset.Tabular.register_pandas_dataframe instead. For more information, see https://aka.ms/dataset-deprecation. |
from_parquet_files |
Create an unregistered, in-memory Dataset from parquet files. Note This method is deprecated and will no longer be supported. Recommend to use Dataset.Tabular.from_parquet_files instead. For more information, see https://aka.ms/dataset-deprecation. |
from_sql_query |
Create an unregistered, in-memory Dataset from a SQL query. Note This method is deprecated and will no longer be supported. Recommend to use Dataset.Tabular.from_sql_query instead. For more information, see https://aka.ms/dataset-deprecation. |
generate_profile |
Generate new profile for the Dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
get |
Get a Dataset that already exists in the workspace by specifying either its name or ID. Note This method is deprecated and will no longer be supported. Recommend to use get_by_name and get_by_id instead. For more information, see https://aka.ms/dataset-deprecation. |
get_all |
Get all the registered datasets in the workspace. |
get_all_snapshots |
Get all snapshots of the Dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
get_by_id |
Get a Dataset which is saved to the workspace. |
get_by_name |
Get a registered Dataset from workspace by its registration name. |
get_definition |
Get a specific definition of the Dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
get_definitions |
Get all the definitions of the Dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
get_profile |
Get summary statistics on the Dataset computed earlier. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
get_snapshot |
Get snapshot of the Dataset by name. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
head |
Pull the specified number of records specified from this Dataset and returns them as a DataFrame. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
list |
List all the Datasets in the workspace, including ones with Note This method is deprecated and will no longer be supported. Recommend to use get_all instead. For more information, see https://aka.ms/dataset-deprecation. |
reactivate |
Reactivate an archived or deprecated dataset. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
register |
Register the Dataset in the workspace, making it available to other users of the workspace. Note This method is deprecated and will no longer be supported. Recommend to use register instead. For more information, see https://aka.ms/dataset-deprecation. |
sample |
Generate a new sample from the source Dataset, using the sampling strategy and parameters provided. Note This method is deprecated and will no longer be supported. Create a TabularDataset by calling the static methods on Dataset.Tabular and use the take_sample method there. For more information, see https://aka.ms/dataset-deprecation. |
to_pandas_dataframe |
Create a Pandas dataframe by executing the transformation pipeline defined by this Dataset definition. Note This method is deprecated and will no longer be supported. Create a TabularDataset by calling the static methods on Dataset.Tabular and use the to_pandas_dataframe method there. For more information, see https://aka.ms/dataset-deprecation. |
to_spark_dataframe |
Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataset definition. Note This method is deprecated and will no longer be supported. Create a TabularDataset by calling the static methods on Dataset.Tabular and use the to_spark_dataframe method there. For more information, see https://aka.ms/dataset-deprecation. |
update |
Update the Dataset mutable attributes in the workspace and return the updated Dataset from the workspace. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
update_definition |
Update the Dataset definition. Note This method is deprecated and will no longer be supported. For more information, see https://aka.ms/dataset-deprecation. |
archive
Archive an active or deprecated dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
archive()
Returns
None.
Return type
Remarks
After archival, any attempt to consume the Dataset will result in an error. If archived by accident, reactivate will activate it.
auto_read_files
Analyzes the file(s) at the specified path and returns a new Dataset.
Note
This method is deprecated and will no longer be supported.
Recommend to use the Dataset.Tabular.from_* methods to read files. For more information, see https://aka.ms/dataset-deprecation.
static auto_read_files(path, include_path=False, partition_format=None)
Parameters
- path
- DataReference or str
A data path in a registered datastore, a local path, or an HTTP URL(CSV/TSV).
- include_path
- bool
Whether to include a column containing the path of the file from which the data was read. Useful when reading multiple files, and want to know which file a particular record originated from. Also useful if there is information in file path or name that you want in a column.
- partition_format
- str
Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../Accounts/2019/01/01/data.csv' where data is partitioned by department name and time, we can define '/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Department' of string type and 'PartitionDate' of datetime type.
Returns
Dataset object.
Return type
Remarks
Use this method when to have file formats and delimiters detected automatically.
After creating a Dataset, you should use get_profile to list detected column types and summary statistics for each column.
The returned Dataset is not registered with the workspace.
compare_profiles
Compare the current Dataset's profile with another dataset profile.
This shows the differences in summary statistics between two datasets. The parameter 'rhs_dataset' stands for "right-hand side", and is simply the second dataset. The first dataset (the current dataset object) is considered the "left-hand side".
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
compare_profiles(rhs_dataset, profile_arguments={}, include_columns=None, exclude_columns=None, histogram_compare_method=HistogramCompareMethod.WASSERSTEIN)
Parameters
- rhs_dataset
- Dataset
A second Dataset, also called a "right-hand side" Dataset for comparision.
- histogram_compare_method
- HistogramCompareMethod
Enum describing the comparison method, ex: Wasserstein or Energy
Returns
Difference between the two dataset profiles.
Return type
Remarks
This is for registered Datasets only. Raises an exception if the current Dataset's profile does not exist. For unregistered Datasets use profile.compare method.
create_snapshot
Create a snapshot of the registered Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
create_snapshot(snapshot_name, compute_target=None, create_data_snapshot=False, target_datastore=None)
Parameters
- compute_target
- Union[ComputeTarget, str]
Optional compute target to perform the snapshot profile creation. If omitted, the local compute is used.
- target_datastore
- Union[AbstractAzureStorageDatastore, str]
Target datastore to save snapshot. If omitted, the snapshot will be created in the default storage of the workspace.
Returns
Dataset snapshot object.
Return type
Remarks
Snapshots capture point in time summary statistics of the underlying data and an optional copy of the data itself. To learn more about creating snapshots, go to https://aka.ms/azureml/howto/createsnapshots.
delete_snapshot
Delete snapshot of the Dataset by name.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
delete_snapshot(snapshot_name)
Parameters
Returns
None.
Return type
Remarks
Use this to free up storage consumed by data saved in snapshots that you no longer need.
deprecate
Deprecate an active dataset in a workspace by another dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
deprecate(deprecate_by_dataset_id)
Parameters
- deprecate_by_dataset_id
- str
The Dataset ID which is the intended replacement for this Dataset.
Returns
None.
Return type
Remarks
Deprecated Datasets will log warnings when they are consumed. Deprecating a dataset deprecates all its definitions.
Deprecated Datasets can still be consumed. To completely block a Dataset from being consumed, archive it.
If deprecated by accident, reactivate will activate it.
diff
Diff the current Dataset with rhs_dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
diff(rhs_dataset, compute_target=None, columns=None)
Parameters
- compute_target
- Union[ComputeTarget, str]
compute target to perform the diff. If omitted, the local compute is used.
Returns
Dataset action run object.
Return type
from_binary_files
Create an unregistered, in-memory Dataset from binary files.
Note
This method is deprecated and will no longer be supported.
Recommend to use Dataset.File.from_files instead. For more information, see https://aka.ms/dataset-deprecation.
static from_binary_files(path)
Parameters
Returns
The Dataset object.
Return type
Remarks
Use this method to read files as streams of binary data. Returns one file stream object per file read. Use this method when you're reading images, videos, audio or other binary data.
get_profile and create_snapshot will not work as expected for a Dataset created by this method.
The returned Dataset is not registered with the workspace.
from_delimited_files
Create an unregistered, in-memory Dataset from delimited files.
Note
This method is deprecated and will no longer be supported.
Recommend to use Dataset.Tabular.from_delimited_files instead. For more information, see https://aka.ms/dataset-deprecation.
# Create a dataset from delimited files with header option as ALL_FILES_HAVE_SAME_HEADERS
dataset = Dataset.Tabular.from_delimited_files(path=(datastore, 'data/crime-spring.csv'),
header='ALL_FILES_HAVE_SAME_HEADERS')
df = dataset.to_pandas_dataframe()
static from_delimited_files(path, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, encoding=FileEncoding.UTF8, quoting=False, infer_column_types=True, skip_rows=0, skip_mode=SkipLinesBehavior.NO_ROWS, comment=None, include_path=False, archive_options=None, partition_format=None)
Parameters
- path
- DataReference or str
A data path in a registered datastore, a local path, or an HTTP URL.
- header
- PromoteHeadersBehavior
Controls how column headers are promoted when reading from files.
- quoting
- bool
Specify how to handle new line characters within quotes. The default (False) is to interpret new line characters as starting new rows, irrespective of whether the new line characters are within quotes or not. If set to True, new line characters inside quotes will not result in new rows, and file reading speed will slow down.
- comment
- str
Character used to indicate comment lines in the files being read. Lines beginning with this string will be skipped.
- include_path
- bool
Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
- archive_options
- <xref:azureml.dataprep.ArchiveOptions>
Options for archive file, including archive type and entry glob pattern. We only support ZIP as archive type at the moment. For example, specifying
archive_options = ArchiveOptions(archive_type = ArchiveType.ZIP, entry_glob = '*10-20.csv')
reads all files with name ending with "10-20.csv" in ZIP.
- partition_format
- str
Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../Accounts/2019/01/01/data.csv' where data is partitioned by department name and time, we can define '/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' to create columns 'Department' of string type and 'PartitionDate' of datetime type.
Returns
Dataset object.
Return type
Remarks
Use this method to read delimited text files when you want to control the options used.
After creating a Dataset, you should use get_profile to list detected column types and summary statistics for each column.
The returned Dataset is not registered with the workspace.
from_excel_files
Create an unregistered, in-memory Dataset from Excel files.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
static from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True, partition_format=None)
Parameters
- sheet_name
- str
The name of the Excel sheet to load. By default we read the first sheet from each Excel file.
- include_path
- bool
Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
- partition_format
- str
Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../Accounts/2019/01/01/data.xlsx' where data is partitioned by department name and time, we can define '/{Department}/{PartitionDate:yyyy/MM/dd}/data.xlsx' to create columns 'Department' of string type and 'PartitionDate' of datetime type.
Returns
Dataset object.
Return type
Remarks
Use this method to read Excel files in .xlsx format. Data can be read from one sheet in each Excel file. After creating a Dataset, you should use get_profile to list detected column types and summary statistics for each column. The returned Dataset is not registered with the workspace.
from_json_files
Create an unregistered, in-memory Dataset from JSON files.
Note
This method is deprecated and will no longer be supported.
Recommend to use Dataset.Tabular.from_json_lines_files instead to read from JSON lines file. For more information, see https://aka.ms/dataset-deprecation.
static from_json_files(path, encoding=FileEncoding.UTF8, flatten_nested_arrays=False, include_path=False, partition_format=None)
Parameters
- path
- DataReference or str
The path to the file(s) or folder(s) that you want to load and parse. It can either be a local path or an Azure Blob url. Globbing is supported. For example, you can use path = "./data*" to read all files with name starting with "data".
- flatten_nested_arrays
- bool
Property controlling program's handling of nested arrays. If you choose to flatten nested JSON arrays, it could result in a much larger number of rows.
- include_path
- bool
Whether to include a column containing the path from which the data was read. This is useful when you are reading multiple files, and might want to know which file a particular record originated from, or to keep useful information in file path.
- partition_format
- str
Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../Accounts/2019/01/01/data.json' and data is partitioned by department name and time, we can define '/{Department}/{PartitionDate:yyyy/MM/dd}/data.json' to create columns 'Department' of string type and 'PartitionDate' of datetime type.
Returns
The local Dataset object.
Return type
from_pandas_dataframe
Create an unregistered, in-memory Dataset from a pandas dataframe.
Note
This method is deprecated and will no longer be supported.
Recommend to use Dataset.Tabular.register_pandas_dataframe instead. For more information, see https://aka.ms/dataset-deprecation.
static from_pandas_dataframe(dataframe, path=None, in_memory=False)
Parameters
Returns
A Dataset object.
Return type
Remarks
Use this method to convert a Pandas dataframe to a Dataset object. A Dataset created by this method can not be registered, as the data is from memory.
If in_memory
is False, the Pandas DataFrame is converted to a CSV file locally. If pat
is of type
DataReference, then the Pandas frame will be uploaded to the data store, and the Dataset will be based
off the DataReference. If ``path` is a local folder, the Dataset will be created off of the local file
which cannot be deleted.
Raises an exception if the current DataReference is not a folder path.
from_parquet_files
Create an unregistered, in-memory Dataset from parquet files.
Note
This method is deprecated and will no longer be supported.
Recommend to use Dataset.Tabular.from_parquet_files instead. For more information, see https://aka.ms/dataset-deprecation.
static from_parquet_files(path, include_path=False, partition_format=None)
Parameters
- include_path
- bool
Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
- partition_format
- str
Specify the partition format in path and create string columns from format '{x}' and datetime column from format '{x:yyyy/MM/dd/HH/mm/ss}', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../Accounts/2019/01/01/data.parquet' where data is partitioned by department name and time, we can define '/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' to create columns 'Department' of string type and 'PartitionDate' of datetime type.
Returns
Dataset object.
Return type
Remarks
Use this method to read Parquet files.
After creating a Dataset, you should use get_profile to list detected column types and summary statistics for each column.
The returned Dataset is not registered with the workspace.
from_sql_query
Create an unregistered, in-memory Dataset from a SQL query.
Note
This method is deprecated and will no longer be supported.
Recommend to use Dataset.Tabular.from_sql_query instead. For more information, see https://aka.ms/dataset-deprecation.
static from_sql_query(data_source, query)
Parameters
Returns
The local Dataset object.
Return type
generate_profile
Generate new profile for the Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
generate_profile(compute_target=None, workspace=None, arguments=None)
Parameters
- compute_target
- Union[ComputeTarget, str]
An optional compute target to perform the snapshot profile creation. If omitted, the local compute is used.
Profile arguments. Valid arguments are:
'include_stype_counts' of type bool. Check if values look like some well known semantic types such as email address, IP Address (V4/V6), US phone number, US zipcode, Latitude/Longitude. Enabling this impacts performance.
'number_of_histogram_bins' of type int. Represents the number of histogram bins to use for numeric data. The default value is 10.
Returns
Dataset action run object.
Return type
Remarks
Synchronous call, will block till it completes. Call get_result to get the result of the action.
get
Get a Dataset that already exists in the workspace by specifying either its name or ID.
Note
This method is deprecated and will no longer be supported.
Recommend to use get_by_name and get_by_id instead. For more information, see https://aka.ms/dataset-deprecation.
static get(workspace, name=None, id=None)
Parameters
Returns
The Dataset with the specified name or ID.
Return type
Remarks
You can provide either name
or id
. An exception is raised if:
both
name
andid
are specified but don't match.the Dataset with the specified
name
orid
cannot be found in the workspace.
get_all
Get all the registered datasets in the workspace.
get_all()
Parameters
Returns
A dictionary of TabularDataset and FileDataset objects keyed by their registration name.
Return type
get_all_snapshots
Get all snapshots of the Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
get_all_snapshots()
Returns
List of Dataset snapshots.
Return type
get_by_id
Get a Dataset which is saved to the workspace.
get_by_id(id)
Parameters
Returns
The dataset object. If dataset is registered, its registration name and version will also be returned.
Return type
get_by_name
Get a registered Dataset from workspace by its registration name.
get_by_name(name, version='latest')
Parameters
Returns
The registered dataset object.
Return type
get_definition
Get a specific definition of the Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
get_definition(version_id=None)
Parameters
Returns
The Dataset definition.
Return type
Remarks
If version_id
is provided, then Azure Machine Learning tries to get the definition corresponding
to that version. If that version does not exist, an exception is thrown.
If version_id
is omitted, then the latest version is retrieved.
get_definitions
Get all the definitions of the Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
get_definitions()
Returns
A dictionary of Dataset definitions.
Return type
Remarks
A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. The current definition is the latest one created.
For unregistered Datasets, only one definition exists.
get_profile
Get summary statistics on the Dataset computed earlier.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
get_profile(arguments=None, generate_if_not_exist=True, workspace=None, compute_target=None)
Parameters
Returns
DataProfile of the Dataset.
Return type
Remarks
For a Dataset registered with an Azure Machine Learning workspace, this method retrieves an existing
profile that was created earlier by calling get_profile
if it is still valid. Profiles are
invalidated when changed data is detected in the Dataset or the arguments to get_profile
are different from the ones used when the profile was generated. If the profile is not present
or invalidated, generate_if_not_exist
will determine if a new profile is generated.
For a Dataset that is not registered with an Azure Machine Learning workspace, this method always runs generate_profile and returns the result.
get_snapshot
Get snapshot of the Dataset by name.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
get_snapshot(snapshot_name)
Parameters
Returns
Dataset snapshot object.
Return type
head
Pull the specified number of records specified from this Dataset and returns them as a DataFrame.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
head(count)
Parameters
Returns
A Pandas DataFrame.
Return type
list
List all the Datasets in the workspace, including ones with is_visible
property equal to False.
Note
This method is deprecated and will no longer be supported.
Recommend to use get_all instead. For more information, see https://aka.ms/dataset-deprecation.
static list(workspace)
Parameters
Returns
A list of Dataset objects.
Return type
reactivate
Reactivate an archived or deprecated dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
reactivate()
Returns
None.
Return type
register
Register the Dataset in the workspace, making it available to other users of the workspace.
Note
This method is deprecated and will no longer be supported.
Recommend to use register instead. For more information, see https://aka.ms/dataset-deprecation.
register(workspace, name, description=None, tags=None, visible=True, exist_ok=False, update_if_exist=False)
Parameters
- visible
- bool
Indicates whether the Dataset is visible in the UI. If False, then the Dataset is hidden in the UI and available via SDK.
- exist_ok
- bool
If True, the method returns the Dataset if it already exists in the given workspace, else error.
- update_if_exist
- bool
If exist_ok
is True and update_if_exist
is True, this method will update
the definition and return the updated Dataset.
Returns
A registered Dataset object in the workspace.
Return type
sample
Generate a new sample from the source Dataset, using the sampling strategy and parameters provided.
Note
This method is deprecated and will no longer be supported.
Create a TabularDataset by calling the static methods on Dataset.Tabular and use the take_sample method there. For more information, see https://aka.ms/dataset-deprecation.
sample(sample_strategy, arguments)
Parameters
- sample_strategy
- str
Sample strategy to use. Accepted values are "top_n", "simple_random", or "stratified".
A dictionary with keys from the "Optional argument" in the list shown above, and values from tye "Type" column. Only arguments from the corresponding sampling method can be used. For example, for a "simple_random" sample type, you can only specify a dictionary with "probability" and "seed" keys.
Returns
Dataset object as a sample of the original dataset.
Return type
Remarks
Samples are generated by executing the transformation pipeline defined by this Dataset, and then applying the sampling strategy and parameters to the output data. Each sampling method supports the following optional arguments:
top_n
Optional arguments
- n, type integer. Select top N rows as your sample.
simple_random
Optional arguments
probability, type float. Simple random sampling where each row has equal probability of being selected. Probability should be a number between 0 and 1.
seed, type float. Used by random number generator. Use for repeatability.
stratified
Optional arguments
columns, type list[str]. List of strata columns in the data.
seed, type float. Used by random number generator. Use for repeatability.
fractions, type dict[tuple, float]. Tuple: column values that define a stratum, must be in the same order as column names. Float: weight attached to a stratum during sampling.
The following code snippets are example design patterns for different sample methods.
# sample_strategy "top_n"
top_n_sample_dataset = dataset.sample('top_n', {'n': 5})
# sample_strategy "simple_random"
simple_random_sample_dataset = dataset.sample('simple_random', {'probability': 0.3, 'seed': 10.2})
# sample_strategy "stratified"
fractions = {}
fractions[('THEFT',)] = 0.5
fractions[('DECEPTIVE PRACTICE',)] = 0.2
# take 50% of records with "Primary Type" as THEFT and 20% of records with "Primary Type" as
# DECEPTIVE PRACTICE into sample Dataset
sample_dataset = dataset.sample('stratified', {'columns': ['Primary Type'], 'fractions': fractions})
to_pandas_dataframe
Create a Pandas dataframe by executing the transformation pipeline defined by this Dataset definition.
Note
This method is deprecated and will no longer be supported.
Create a TabularDataset by calling the static methods on Dataset.Tabular and use the to_pandas_dataframe method there. For more information, see https://aka.ms/dataset-deprecation.
to_pandas_dataframe()
Returns
A Pandas DataFrame.
Return type
Remarks
Return a Pandas DataFrame fully materialized in memory.
to_spark_dataframe
Create a Spark DataFrame that can execute the transformation pipeline defined by this Dataset definition.
Note
This method is deprecated and will no longer be supported.
Create a TabularDataset by calling the static methods on Dataset.Tabular and use the to_spark_dataframe method there. For more information, see https://aka.ms/dataset-deprecation.
to_spark_dataframe()
Returns
A Spark DataFrame.
Return type
Remarks
The Spark Dataframe returned is only an execution plan and does not actually contain any data, as Spark Dataframes are lazily evaluated.
update
Update the Dataset mutable attributes in the workspace and return the updated Dataset from the workspace.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
update(name=None, description=None, tags=None, visible=None)
Parameters
Returns
An updated Dataset object from the workspace.
Return type
update_definition
Update the Dataset definition.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
update_definition(definition, definition_update_message)
Parameters
Returns
An updated Dataset object from the workspace.
Return type
Remarks
To consume the updated Dataset, use the object returned by this method.
Attributes
definition
Return the current Dataset definition.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
Returns
The Dataset definition.
Return type
Remarks
A Dataset definition is a series of steps that specify how to read and transform data.
A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. Having multiple definitions allows you to make changes to existing Datasets without breaking models and pipelines that depend on the older definition.
For unregistered Datasets, only one definition exists.
definition_version
Return the version of the current definition of the Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
Returns
The Dataset definition version.
Return type
Remarks
A Dataset definition is a series of steps that specify how to read and transform data.
A Dataset registered in an AzureML workspace can have multiple definitions, each created by calling update_definition. Each definition has an unique identifier. The current definition is the latest one created, whose ID is returned by this.
For unregistered Datasets, only one definition exists.
description
Return the description of the Dataset.
Returns
The Dataset description.
Return type
Remarks
Specifying a description of the data in the Dataset enables users of the workspace to understand what the data represents, and how they can use it.
id
If the Dataset was registered in a workspace, return the ID of the Dataset. Otherwise, return None.
Returns
The Dataset ID.
Return type
is_visible
Control the visibility of a registered Dataset in the Azure ML workspace UI.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
Returns
The Dataset visibility.
Return type
Remarks
Values returned:
True: Dataset is visible in workspace UI. Default.
False: Dataset is hidden in workspace UI.
Has no effect on unregistered Datasets.
name
state
Return the state of the Dataset.
Note
This method is deprecated and will no longer be supported.
For more information, see https://aka.ms/dataset-deprecation.
Returns
The Dataset state.
Return type
Remarks
The meaning and effect of states are as follows:
Active. Active definitions are exactly what they sound like, all actions can be performed on active definitions.
Deprecated. deprecated definition can be used, but will result in a warning being logged in the logs everytime the underlying data is accessed.
Archived. An archived definition cannot be used to perform any action. To perform actions on an archived definition, it must be reactivated.
tags
workspace
If the Dataset was registered in a workspace, return that. Otherwise, return None.
Returns
The workspace.
Return type
Feedback
Submit and view feedback for