Share data across workspaces with registries (preview)
Azure Machine Learning registry enables you to collaborate across workspaces within your organization. Using registries, you can share models, components, environments and data. Sharing data with registries is currently a preview feature. In this article, you learn how to:
- Create a data asset in the registry.
- Share an existing data asset from workspace to registry
- Use the data asset from registry as input to a model training job in a workspace.
This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
Key scenario addressed by data sharing using Azure Machine Learning registry
You may want to have data shared across multiple teams, projects, or workspaces in a central location. Such data doesn't have sensitive access controls and can be broadly used in the organization.
- A team wants to share a public dataset that is preprocessed and ready to use in experiments.
- Your organization has acquired a particular dataset for a project from an external vendor and wants to make it available to all teams working on a project.
- A team wants to share data assets across workspaces in different regions.
In these scenarios, you can create a data asset in a registry or share an existing data asset from a workspace to a registry. This data asset can then be used across multiple workspaces.
Scenarios NOT addressed by data sharing using Azure Machine Learning registry
Sharing sensitive data that requires fine grained access control. You can't create a data asset in a registry to share with a small subset of users/workspaces while the registry is accessible by many other users in the org.
Sharing data that is available in existing storage that must not be copied or is too large or too expensive to be copied. Whenever data assets are created in a registry, a copy of data is ingested into the registry storage so that it can be replicated.
Data asset types supported by Azure Machine Learning registry
Check out the following canonical scenarios when deciding if you want to use
mltable for your scenario.
You can create three data asset types:
|File: Reference a single file
|Read/write a single file - the file can have any format.
|Folder: Reference a single folder
|You must read/write a directory of parquet/CSV files into Pandas/Spark. Deep-learning with images, text, audio, video files located in a directory.
|Table: Reference a data table
|You have a complex schema subject to frequent changes, or you need a subset of large tabular data.
Paths supported by Azure Machine Learning registry
When you create a data asset, you must specify a path parameter that points to the data location. Currently, the only supported paths are to locations on your local computer.
"Local" means the local storage for the computer you are using. For example, if you're using a laptop, the local drive. If an Azure Machine Learning compute instance, the "local" drive of the compute instance.
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning registry to share data. To create a registry, see Learn how to create a registry.
An Azure Machine Learning workspace. If you don't have one, use the steps in the Quickstart: Create workspace resources article to create one.
The Azure region (location) where you create your workspace must be in the list of supported regions for Azure Machine Learning registry.
The environment and component created from the How to share models, components, and environments article.
The Azure CLI and the
mlextension or the Azure Machine Learning Python SDK v2:
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
The CLI examples in this article assume that you are using the Bash (or compatible) shell. For example, from a Linux system or Windows Subsystem for Linux.
The examples also assume that you have configured defaults for the Azure CLI so that you don't have to specify the parameters for your subscription, workspace, resource group, or location. To set default settings, use the following commands. Replace the following parameters with the values for your configuration:
<subscription>with your Azure subscription ID.
<workspace>with your Azure Machine Learning workspace name.
<resource-group>with the Azure resource group that contains your workspace.
<location>with the Azure region that contains your workspace.
az account set --subscription <subscription> az configure --defaults workspace=<workspace> group=<resource-group> location=<location>
You can see what your current defaults are by using the
az configure -lcommand.
Clone examples repository
The code examples in this article are based on the
nyc_taxi_data_regression sample in the examples repository. To use these files on your development environment, use the following commands to clone the repository and change directories to the example:
git clone https://github.com/Azure/azureml-examples
For the CLI example, change directories to
cli/jobs/pipelines-with-components/nyc_taxi_data_regression in your local clone of the examples repository.
Create SDK connection
This step is only needed when using the Python SDK.
Create a client connection to both the Azure Machine Learning workspace and registry. In the following example, replace the
<...> placeholder values with the values appropriate for your configuration. For example, your Azure subscription ID, workspace name, registry name, etc.:
ml_client_workspace = MLClient( credential=credential,
subscription_id = "<workspace-subscription>",
resource_group_name = "<workspace-resource-group",
workspace_name = "<workspace-name>")
ml_client_registry = MLClient(credential=credential,
Create data in registry
The data asset created in this step is used later in this article when submitting a training job.
The same CLI command
az ml data create can be used to create data in a workspace or registry. Running the command with
--workspace-name command creates the data in a workspace whereas running the command with
--registry-name creates the data in the registry.
The data source is located in the examples repository that you cloned earlier. Under the local clone, go to the following directory path:
cli/jobs/pipelines-with-components/nyc_taxi_data_regression. In this directory, create a YAML file named
data-registry.yml and use the following YAML as the contents of the file:
description: Transformed NYC Taxi data created from local folder.
path value points to the
data_transformed subdirectory, which contains the data that is shared using the registry.
To create the data in the registry, use the
az ml data create. In the following examples, replace
<registry-name> with the name of your registry.
az ml data create --file data-registry.yml --registry-name <registry-name>
If you get an error that data with this name and version already exists in the registry, you can either edit the
version field in
data-registry.yml or specify a different version on the CLI that overrides the version value in
# use shell epoch time as the version
az ml data create --file data-registry.yml --registry-name <registry-name> --set version=$version
version=$(date +%s) command doesn't set the
$version variable in your environment, replace
$version with a random number.
version of the data from the output of the
az ml data create command and use them with
az ml data show command to view details for the asset.
az ml data show --name transformed-nyc-taxt-data --version 1 --registry-name <registry-name>
If you used a different data name or version, replace the
--version parameters accordingly.
You can also use
az ml data list --registry-name <registry-name> to list all data assets in the registry.
Create an environment and component in registry
To create an environment and component in the registry, use the steps in the How to share models, components, and environments article. The environment and component are used in the training job in next section.
You can use an environment and component from the workspace instead of using ones from the registry.
Run a pipeline job in a workspace using component from registry
When running a pipeline job that uses a component and data from a registry, the compute resources are local to the workspace. In the following example, the job uses the Scikit Learn training component and the data asset created in the previous sections to train a model.
The key aspect is that this pipeline is going to run in a workspace using training data that isn't in the specific workspace. The data is in a registry that can be used with any workspace in your organization. You can run this training job in any workspace you have access to without having worry about making the training data available in that workspace.
Verify that you are in the
cli/jobs/pipelines-with-components/nyc_taxi_data_regression directory. Edit the
component section in under the
train_job section of the
single-job-pipeline.yml file to refer to the training component and
training_data section to refer to data asset created in the previous sections. The following example shows what the
single-job-pipeline.yml looks like after editing. Replace the
<registry_name> with the name for your registry:
description: Single job pipeline to train regression model based on nyc taxi dataset
- Before running the pipeline job, confirm that the workspace in which you will run the job is in a Azure region that is supported by the registry in which you created the data.
- Confirm that the workspace has a compute cluster with the name
cpu-clusteror edit the
jobs.train_job.computewith the name of your compute.
Run the pipeline job with the
az ml job create command.
az ml job create --file single-job-pipeline.yml
If you have not configured the default workspace and resource group as explained in the prerequisites section, you will need to specify the
--resource-group parameters for the
az ml job create to work.
For more information on running jobs, see the following articles:
Share data from workspace to registry
The following steps show how to share an existing data asset from a workspace to a registry.
First, create a data asset in the workspace. Make sure that you are in the
cli/assets/data directory. The
local-folder.yml located in this directory is used to create a data asset in the workspace. The data specified in this file is available in the
cli/assets/data/sample-data directory. The following YAML is the contents of the
description: Dataset created from local folder.
To create the data asset in the workspace, use the following command:
az ml data create -f local-folder.yml
For more information on creating data assets in a workspace, see How to create data assets.
The data asset created in the workspace can be shared to a registry. From the registry, it can be used in multiple workspaces. Note that we are passing
--share_with_version parameter in share function. These parameters are optional and if you do not pass these data will be shared with same name and version as in workspace.
The following example demonstrates using share command to share a data asset. Replace
<registry-name> with the name of the registry that the data will be shared to.
az ml data share --name local-folder-example-titanic --version <version-in-workspace> --share-with-name <name-in-registry> --share-with-version <version-in-registry> --registry-name <registry-name>