data Package

Contains modules supporting data representation for Datastore and Dataset in Azure Machine Learning.

This package contains core functionality supporting Datastore and Dataset classes in the core package. Datastore objects contain connection information to Azure storage services that can be easily referred to by name without the need to work directly with or hard code connection information in scripts. Datastore supports a number of different services represented by classes in this package, including AzureBlobDatastore, AzureFileDatastore, and AzureDataLakeDatastore. For a full list of supported storage services, see the Datastore class.

While a Datastore acts as a container for your data files, you can think of a Dataset as a reference or pointer to specific data that's in your datastore. The following Datasets types are supported:

  • TabularDataset represents data in a tabular format created by parsing the provided file or list of files.

  • FileDataset references single or multiple files in your datastores or public URLs.

For more information, see the article Add & register datasets. To get started working with a datasets, see and



Contains the abstract base class for datasets in Azure Machine Learning.


Contains the base functionality for datastores that save connection information to Azure storage services.


Contains the base functionality for datastores that save connection information to Azure Data Lake Storage.


Contains the base functionality for datastores that save connection information to Azure Database for MySQL.


Contains the base functionality for datastores that save connection information to Azure Database for PostgreSQL.


Contains the base functionality for datastores that save connection information to Azure SQL database.


Contains functionality for datastores that save connection information to Azure Blob and Azure File storage.


Constants used in the package. Internal use only.


Contains functionality to manage data context of datastores and datasets. Internal use only.


Contains functionality that defines how to create references to data in datastores.


Contains functionality for managing DatacacheStore and Datacache in Azure Machine Learning.


Internal use only.


Contains functionality for DataCache consumption configuration.


Contains objects needed for Datacache Singularity settings representation.


Contains functionality to create references to data in datastores.

This module contains the DataPath class, which represents the location of data, and the DataPathComputeBinding class, which represents how the data is made available on the compute targets.


Contains functionality that manages the execution of Dataset actions.

This module provides convenience methods for creating Dataset actions and get their results after completion.


Contains functionality for Dataset consumption configuration.


Contains functionality to manage dataset definition and its operations.


This module is deprecated. For more information, see


Contains exceptions for dataset error handling in Azure Machine Learning.


Contains functionality to create datasets for Azure Machine Learning.


Class for collecting summary statistics on the data produced by a Dataflow.

Functionality in this module includes collecting information regarding which run produced the profile, whether the profile is stale or not.


Contains configuration for monitoring dataset profile run in Azure Machine Learning.

Functionality in this module includes handling and monitoring dataset profile run associated with an experiment object and individual run id.


Contains configuration to generate statistics summary of datasets in Azure Machine Learning.

Functionality in this module includes methods for submitting local or remote profile run and visualizing the result of the submitted profile run.


Contains functionality to manage Dataset snapshot operations.


This module is deprecated. For more information, see


Contains enumeration values used with Dataset.


Internal use only.


Contains functionality for datastores that save connection information to Databricks File Sytem (DBFS).


Contains functionality for referencing single or multiple files in datastores or public URLs.

For more information, see the article Add & register datasets. To get started working with a file dataset, see


Contains the base functionality for datastores that save connection information to an HDFS cluster.


Contains configurations that specifies how outputs for a job should be uploaded and promoted to a dataset.

For more information, see the article how to specify outputs.


Contains functionality for creating references to data in datastores that save connection info to SQL databases.


Contains functionality for creating a parameter to pass to a SQL stored procedure.


Contains functionality for representing data in a tabular format by parsing the provided file or list of files.

For more information, see the article Add & register datasets. To get started working with a tabular dataset, see



Configures column data types for a dataset created in Azure Machine Learning.

DataType methods are used in the TabularDatasetFactory class from_* methods, which are used to create new TabularDataset objects.



This is an experimental class, and may change at any time. Please see for more information.

Represents a storage abstraction over an Azure Machine Learning storage account.

DatacacheStores are attached to workspaces and are used to store information related to the underlying datacache solution. Currently, only partitioned blob solution is supported. Datacachestores defines various Blob datastores that could be used for caching.

Use this class to perform management operations, including register, list, get, and update datacachestores. DatacacheStores for each service are created with the register* methods of this class.

Get a datacachestore by name. This call will make a request to the datacache service.


Represents a collection of file references in datastores or public URLs to use in Azure Machine Learning.

A FileDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into file streams. Data is not loaded from the source until FileDataset is asked to deliver data.

A FileDataset is created using the from_files method of the FileDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a file dataset, see

Initialize the FileDataset object.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using FileDatasetFactory class.


Represent how to output to a HDFS path and be promoted as a FileDataset.

Initialize a HDFSOutputDatasetConfig.



This is an experimental class, and may change at any time. Please see for more information.

Represent how to link the output of a run and be promoted as a FileDataset.

The LinkFileOutputDatasetConfig allows you to link a file dataset as output dataset

   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = LinkFileOutputDatasetConfig('link_output')

   script_run_config = ScriptRunConfig('.', '', arguments=[output])

   # within
   # from azureml.core import Run, Dataset
   # run = Run.get_context()
   # workspace = run.experiment.workspace
   # dataset = Dataset.get_by_name(workspace, name='dataset_to_link')
   # run.output_datasets['link_output'].link(dataset)

   run = experiment.submit(script_run_config)

Initialize a LinkFileOutputDatasetConfig.



This is an experimental class, and may change at any time. Please see for more information.

Represent how to link the output of a run and be promoted as a TabularDataset.

The LinkTabularOutputDatasetConfig allows you to link a file Tabular as output dataset

   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = LinkTabularOutputDatasetConfig('link_output')

   script_run_config = ScriptRunConfig('.', '', arguments=[output])

   # within
   # from azureml.core import Run, Dataset
   # run = Run.get_context()
   # workspace = run.experiment.workspace
   # dataset = Dataset.get_by_name(workspace, name='dataset_to_link')
   # run.output_datasets['link_output'].link(dataset)

   run = experiment.submit(script_run_config)

Initialize a LinkTabularOutputDatasetConfig.


Represent how to copy the output of a run and be promoted as a FileDataset.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:

   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', '', arguments=[output])

   run = experiment.submit(script_run_config)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:

   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', '', arguments=[output])

   run = experiment.submit(script_run_config)

Initialize a OutputFileDatasetConfig.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:

   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', '', arguments=[output])

   run = experiment.submit(script_run_config)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:

   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', '', arguments=[output])

   run = experiment.submit(script_run_config)

Represents a tabular dataset to use in Azure Machine Learning.

A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation. Data is not loaded from the source until TabularDataset is asked to deliver data.

TabularDataset is created using methods like from_delimited_files from the TabularDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a tabular dataset, see

Initialize a TabularDataset object.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using TabularDatasetFactory class.