Create data assets

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you'll learn how to create a data asset in Azure ML. An Azure ML data asset is similar to web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.

Data asset creation also creates a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk data source integrity. You can create Data assets from Azure ML datastores, Azure Storage, public URLs, and local files.

Prerequisites

To create and work with data assets, you need:

Do I need to create a data asset to access my data?

No. If you just want to access your data in an interactive session (for example, a notebook) or a job, you are not required to create a data asset first. You can use the storage URI to access the data.

Data assets can "bookmark" your frequently used data, to avoid the need to remember long storage URIs.

Tip

For more information about accessing your data in a notebook, please see Access data from Azure cloud storage for interactive development.

For more information about accessing your data - both local and cloud storage - in a job, please see Access data in a job.

Data asset types

You can create three data asset types:

Type V2 API V1 API Canonical Scenarios V2/V1 API Difference
File
Reference a single file
uri_file FileDataset Read/write a single file - the file can have any format. A type new to V2 APIs. In V1 APIs, files always mapped to a folder on the compute target filesystem; this mapping required an os.path.join. In V2 APIs, the single file is mapped. This way, you can refer to that location in your code.
Folder
Reference a single folder
uri_folder FileDataset You must read/write a folder of parquet/CSV files into Pandas/Spark.

Deep-learning with images, text, audio, video files located in a folder.
In V1 APIs, FileDataset had an associated engine that could take a file sample from a folder. In V2 APIs, a Folder is a simple mapping to the compute target filesystem.
Table
Reference a data table
mltable TabularDataset You have a complex schema subject to frequent changes, or you need a subset of large tabular data.

AutoML with Tables.
In V1 APIs, the Azure ML back-end stored the data materialization blueprint. This storage location meant that TabularDataset only worked if you had an Azure ML workspace. mltable stores the data materialization blueprint in your storage. This storage location means you can use it disconnected to Azure ML - for example, local, on-premises. In V2 APIs, you'll find it easier to transition from local to remote jobs. Read Working with tables in Azure Machine Learning for more information.

Supported paths

When you create an Azure Machine Learning data asset, you must specify a path parameter that points to the data asset location. Supported paths include:

Location Examples
A path on your local computer ./home/username/data/my_data
A path on a Datastore azureml://datastores/<data_store_name>/paths/<path>
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage (Blob) wasbs://<containername>@<accountname>.blob.core.windows.net/<path_to_data>/
(ADLS gen2) abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
(ADLS gen1) adl://<accountname>.azuredatalakestore.net/<path_to_data>/

Note

When you create a data asset from a local path, it will automatically upload to the default Azure Machine Learning cloud datastore.

Create a File asset

Create a YAML file <file-name>.yml:

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:
# local: ./<path>/<file>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>

type: uri_file
name: <name>
description: <description>
path: <uri>

Next, execute the following command in the CLI:

> az ml data create -f <file-name>.yml

Create a Folder asset

Create a YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:
# local: ./<path>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
type: uri_folder
name: <name_of_data>
description: <description goes here>
path: <path>

Next, use the CLI to create the data asset:

az ml data create -f <file-name>.yml

Create a Table asset

You must create a valid MLTable file before you create the asset. Read Authoring MLTable files to learn more about MLTable file and artifact creation.

Important

The path should be a folder that contains a valid MLTable file.

Create a YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# path must point to **folder** containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: ./<path>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>

type: mltable
name: <name_of_data>
description: <description goes here>
path: <path>

Next, create the data asset using the CLI:

az ml data create -f <file-name>.yml

Next steps