Share data across workspaces with registries (preview)
Azure Machine Learning registry enables you to collaborate across workspaces within your organization. Using registries, you can share models, components, environments and data. Sharing data with registries is currently a preview feature. In this article, you learn how to:
- Create a data asset in the registry.
- Share an existing data asset from workspace to registry
- Use the data asset from registry as input to a model training job in a workspace.
Important
This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Key scenario addressed by data sharing using Azure Machine Learning registry
You may want to have data shared across multiple teams, projects, or workspaces in a central location. Such data doesn't have sensitive access controls and can be broadly used in the organization.
Examples include:
- A team wants to share a public dataset that is preprocessed and ready to use in experiments.
- Your organization has acquired a particular dataset for a project from an external vendor and wants to make it available to all teams working on a project.
- A team wants to share data assets across workspaces in different regions.
In these scenarios, you can create a data asset in a registry or share an existing data asset from a workspace to a registry. This data asset can then be used across multiple workspaces.
Scenarios NOT addressed by data sharing using Azure Machine Learning registry
Sharing sensitive data that requires fine grained access control. You can't create a data asset in a registry to share with a small subset of users/workspaces while the registry is accessible by many other users in the org.
Sharing data that is available in existing storage that must not be copied or is too large or too expensive to be copied. Whenever data assets are created in a registry, a copy of data is ingested into the registry storage so that it can be replicated.
Data asset types supported by Azure Machine Learning registry
Tip
Check out the following canonical scenarios when deciding if you want to use uri_file
, uri_folder
, or mltable
for your scenario.
You can create three data asset types:
Type | V2 API | Canonical scenario |
---|---|---|
File: Reference a single file | uri_file |
Read/write a single file - the file can have any format. |
Folder: Reference a single folder | uri_folder |
You must read/write a directory of parquet/CSV files into Pandas/Spark. Deep-learning with images, text, audio, video files located in a directory. |
Table: Reference a data table | mltable |
You have a complex schema subject to frequent changes, or you need a subset of large tabular data. |
Paths supported by Azure Machine Learning registry
When you create a data asset, you must specify a path parameter that points to the data location. Currently, the only supported paths are to locations on your local computer.
Tip
"Local" means the local storage for the computer you are using. For example, if you're using a laptop, the local drive. If an Azure Machine Learning compute instance, the "local" drive of the compute instance.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Familiarity with Azure Machine Learning registries and Data concepts in Azure Machine Learning.
An Azure Machine Learning registry to share data. To create a registry, see Learn how to create a registry.
An Azure Machine Learning workspace. If you don't have one, use the steps in the Quickstart: Create workspace resources article to create one.
Important
The Azure region (location) where you create your workspace must be in the list of supported regions for Azure Machine Learning registry.
The environment and component created from the How to share models, components, and environments article.
The Azure CLI and the
ml
extension or the Azure Machine Learning Python SDK v2:To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
Important
The CLI examples in this article assume that you are using the Bash (or compatible) shell. For example, from a Linux system or Windows Subsystem for Linux.
The examples also assume that you have configured defaults for the Azure CLI so that you don't have to specify the parameters for your subscription, workspace, resource group, or location. To set default settings, use the following commands. Replace the following parameters with the values for your configuration:
- Replace
<subscription>
with your Azure subscription ID. - Replace
<workspace>
with your Azure Machine Learning workspace name. - Replace
<resource-group>
with the Azure resource group that contains your workspace. - Replace
<location>
with the Azure region that contains your workspace.
az account set --subscription <subscription> az configure --defaults workspace=<workspace> group=<resource-group> location=<location>
You can see what your current defaults are by using the
az configure -l
command.- Replace
Clone examples repository
The code examples in this article are based on the nyc_taxi_data_regression
sample in the examples repository. To use these files on your development environment, use the following commands to clone the repository and change directories to the example:
git clone https://github.com/Azure/azureml-examples
cd azureml-examples
For the CLI example, change directories to cli/jobs/pipelines-with-components/nyc_taxi_data_regression
in your local clone of the examples repository.
cd cli/jobs/pipelines-with-components/nyc_taxi_data_regression
Create SDK connection
Tip
This step is only needed when using the Python SDK.
Create a client connection to both the Azure Machine Learning workspace and registry. In the following example, replace the <...>
placeholder values with the values appropriate for your configuration. For example, your Azure subscription ID, workspace name, registry name, etc.:
ml_client_workspace = MLClient( credential=credential,
subscription_id = "<workspace-subscription>",
resource_group_name = "<workspace-resource-group",
workspace_name = "<workspace-name>")
print(ml_client_workspace)
ml_client_registry = MLClient(credential=credential,
registry_name="<REGISTRY_NAME>",
registry_location="<REGISTRY_REGION>")
print(ml_client_registry)
Create data in registry
The data asset created in this step is used later in this article when submitting a training job.
Tip
The same CLI command az ml data create
can be used to create data in a workspace or registry. Running the command with --workspace-name
command creates the data in a workspace whereas running the command with --registry-name
creates the data in the registry.
The data source is located in the examples repository that you cloned earlier. Under the local clone, go to the following directory path: cli/jobs/pipelines-with-components/nyc_taxi_data_regression
. In this directory, create a YAML file named data-registry.yml
and use the following YAML as the contents of the file:
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: transformed-nyc-taxt-data
description: Transformed NYC Taxi data created from local folder.
version: 1
type: uri_folder
path: data_transformed/
The path
value points to the data_transformed
subdirectory, which contains the data that is shared using the registry.
To create the data in the registry, use the az ml data create
. In the following examples, replace <registry-name>
with the name of your registry.
az ml data create --file data-registry.yml --registry-name <registry-name>
If you get an error that data with this name and version already exists in the registry, you can either edit the version
field in data-registry.yml
or specify a different version on the CLI that overrides the version value in data-registry.yml
.
# use shell epoch time as the version
version=$(date +%s)
az ml data create --file data-registry.yml --registry-name <registry-name> --set version=$version
Tip
If the version=$(date +%s)
command doesn't set the $version
variable in your environment, replace $version
with a random number.
Save the name
and version
of the data from the output of the az ml data create
command and use them with az ml data show
command to view details for the asset.
az ml data show --name transformed-nyc-taxt-data --version 1 --registry-name <registry-name>
Tip
If you used a different data name or version, replace the --name
and --version
parameters accordingly.
You can also use az ml data list --registry-name <registry-name>
to list all data assets in the registry.
Create an environment and component in registry
To create an environment and component in the registry, use the steps in the How to share models, components, and environments article. The environment and component are used in the training job in next section.
Tip
You can use an environment and component from the workspace instead of using ones from the registry.
Run a pipeline job in a workspace using component from registry
When running a pipeline job that uses a component and data from a registry, the compute resources are local to the workspace. In the following example, the job uses the Scikit Learn training component and the data asset created in the previous sections to train a model.
Note
The key aspect is that this pipeline is going to run in a workspace using training data that isn't in the specific workspace. The data is in a registry that can be used with any workspace in your organization. You can run this training job in any workspace you have access to without having worry about making the training data available in that workspace.
Verify that you are in the cli/jobs/pipelines-with-components/nyc_taxi_data_regression
directory. Edit the component
section in under the train_job
section of the single-job-pipeline.yml
file to refer to the training component and path
under training_data
section to refer to data asset created in the previous sections. The following example shows what the single-job-pipeline.yml
looks like after editing. Replace the <registry_name>
with the name for your registry:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc_taxi_data_regression_single_job
description: Single job pipeline to train regression model based on nyc taxi dataset
jobs:
train_job:
type: command
component: azureml://registries/<registry-name>/component/train_linear_regression_model/versions/1
compute: azureml:cpu-cluster
inputs:
training_data:
type: uri_folder
path: azureml://registries/<registry-name>/data/transformed-nyc-taxt-data/versions/1
outputs:
model_output:
type: mlflow_model
test_data:
Warning
- Before running the pipeline job, confirm that the workspace in which you will run the job is in an Azure region that is supported by the registry in which you created the data.
- Confirm that the workspace has a compute cluster with the name
cpu-cluster
or edit thecompute
field underjobs.train_job.compute
with the name of your compute.
Run the pipeline job with the az ml job create
command.
az ml job create --file single-job-pipeline.yml
Tip
If you have not configured the default workspace and resource group as explained in the prerequisites section, you will need to specify the --workspace-name
and --resource-group
parameters for the az ml job create
to work.
For more information on running jobs, see the following articles:
Share data from workspace to registry
The following steps show how to share an existing data asset from a workspace to a registry.
First, create a data asset in the workspace. Make sure that you are in the cli/assets/data
directory. The local-folder.yml
located in this directory is used to create a data asset in the workspace. The data specified in this file is available in the cli/assets/data/sample-data
directory. The following YAML is the contents of the local-folder.yml
file:
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-folder-example-titanic
description: Dataset created from local folder.
type: uri_folder
path: sample-data/
To create the data asset in the workspace, use the following command:
az ml data create -f local-folder.yml
For more information on creating data assets in a workspace, see How to create data assets.
The data asset created in the workspace can be shared to a registry. From the registry, it can be used in multiple workspaces. Note that we are passing --share_with_name
and --share_with_version
parameter in share function. These parameters are optional and if you do not pass these data will be shared with same name and version as in workspace.
The following example demonstrates using share command to share a data asset. Replace <registry-name>
with the name of the registry that the data will be shared to.
az ml data share --name local-folder-example-titanic --version <version-in-workspace> --share-with-name <name-in-registry> --share-with-version <version-in-registry> --registry-name <registry-name>