Create data assets

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you'll learn how to create a data asset in Azure Machine Learning. An Azure Machine Learning data asset is similar to web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.

Data asset creation also creates a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.

Prerequisites

To create and work with data assets, you need:

Do I need to create a data asset to access my data?

No. If you just want to access your data in an interactive session (for example, a notebook) or a job, you are not required to create a data asset first. You can use the storage URI to access the data.

Data assets can "bookmark" your frequently used data, to avoid the need to remember long storage URIs.

Tip

For more information about accessing your data in a notebook, please see Access data from Azure cloud storage for interactive development.

For more information about accessing your data - both local and cloud storage - in a job, please see Access data in a job.

Data asset types

Note

Make sure to check out the canonical scenarios below when deciding if you want to use uri_file, uri_folder or mltable for your use case.

You can create three data asset types:

Type V2 API V1 API V2/V1 API Difference Canonical Scenarios
File
Reference a single file
uri_file FileDataset A type new to V2 APIs. In V1 APIs, files always mapped to a folder on the compute target filesystem; this mapping required an os.path.join. In V2 APIs, the single file is mapped. This way, you can refer to that location in your code. Read/write a single file - the file can have any format.
Folder
Reference a single folder
uri_folder FileDataset In V1 APIs, FileDataset had an associated engine that could take a file sample from a folder. In V2 APIs, a Folder is a simple mapping to the compute target filesystem. You must read/write a folder of parquet/CSV files into Pandas/Spark.

Deep-learning with images, text, audio, video files located in a folder.
Table
Reference a data table
mltable TabularDataset In V1 APIs, the Azure Machine Learning back-end stored the data materialization blueprint. This storage location meant that TabularDataset only worked if you had an Azure Machine Learning workspace. mltable stores the data materialization blueprint in your storage. This storage location means you can use it disconnected to AzureML - for example, local, on-premises. In V2 APIs, you'll find it easier to transition from local to remote jobs. Read Working with tables in Azure Machine Learning for more information. You have a complex schema subject to frequent changes, or you need a subset of large tabular data.

AutoML with Tables.

Important

If you are migrating your V1 datasets to V2 data assets. It's required that you rename the V2 data asset to a different name compared with the V1 dataset.

Supported paths

When you create an Azure Machine Learning data asset, you must specify a path parameter that points to the data asset location. Supported paths include:

Location Examples
A path on your local computer ./home/username/data/my_data
A path on a Datastore azureml://datastores/<data_store_name>/paths/<path>
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage (Blob) wasbs://<containername>@<accountname>.blob.core.windows.net/<path_to_data>/
(ADLS gen2) abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
(ADLS gen1) adl://<accountname>.azuredatalakestore.net/<path_to_data>/

Important

When working with folder data, ensure that the correct path structure is used (ex. /<path_to_data>/) so the data source is accurately captured.

Note

When you create a data asset from a local path, it will automatically upload to the default Azure Machine Learning cloud datastore.

Create a File asset

Create a YAML file <file-name>.yml:

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:
# local: ./<path>/<file>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>

type: uri_file
name: <name>
description: <description>
path: <uri>

Next, execute the following command in the CLI:

> az ml data create -f <file-name>.yml

Create a Folder asset

Create a YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:
# local: ./<path>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
type: uri_folder
name: <name_of_data>
description: <description goes here>
path: <path>

Next, use the CLI to create the data asset:

az ml data create -f <file-name>.yml

Create a Table asset

You must create a valid MLTable file before you create the asset. Read Authoring MLTable files to learn more about MLTable file and artifact creation.

Important

The path should be a folder that contains a valid MLTable file.

Create a YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# path must point to **folder** containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: ./<path>
# blob:  https://<account_name>.blob.core.windows.net/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>

type: mltable
name: <name_of_data>
description: <description goes here>
path: <path>

Next, create the data asset using the CLI:

az ml data create -f <file-name>.yml

Next steps