Access data in a job
APPLIES TO:
Azure CLI ml extension v2 (current)
Python SDK azure-ai-ml v2 (current)
Learn how to read and write data for your jobs with the Azure Machine Learning Python SDK v2 and the Azure Machine Learning CLI extension v2.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.
An Azure Machine Learning workspace
Supported paths
When you provide a data input/output to a Job, you'll need to specify a path
parameter that points to the data location. Below is a table that shows the different data locations supported in Azure Machine Learning and examples for the path
parameter:
Location | Examples | Notes |
---|---|---|
A path on your local computer | ./home/username/data/my_data |
|
A path on a public http(s) server | https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv |
https path pointing to a folder is not supported since https is not a filesystem. Please use other formats(wasbs/abfss/adl) instead for folder type of data. |
A path on Azure Storage | wasbs://<containername>@<accountname>.blob.core.windows.net/<path_to_data>/ abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> adl://<accountname>.azuredatalakestore.net/<path_to_data>/ |
|
A path on a Datastore | azureml://datastores/<data_store_name>/paths/<path> |
|
A path to a Data Asset | azureml:<my_data>:<version> |
Supported modes
When you run a job with data inputs/outputs, you can specify the mode - for example, whether you would like the data to be read-only mounted or downloaded to the compute target. The table below shows the possible modes for different type/mode/input/output combinations:
Type | Input/Output | upload |
download |
ro_mount |
rw_mount |
direct |
eval_download |
eval_mount |
---|---|---|---|---|---|---|---|---|
uri_folder |
Input | ✓ | ✓ | ✓ | ||||
uri_file |
Input | ✓ | ✓ | ✓ | ||||
mltable |
Input | ✓ | ✓ | ✓ | ✓ | ✓ | ||
uri_folder |
Output | ✓ | ✓ | ✓ | ||||
uri_file |
Output | ✓ | ✓ | ✓ | ||||
mltable |
Output | ✓ | ✓ | ✓ |
Note
eval_download
and eval_mount
are unique to mltable
. Whilst ro_mount
is the default mode for MLTable, there are scenarios where an MLTable can yield files that are not necessarily co-located with the MLTable file in storage. Alternatively, an mltable
can subset or shuffle the data that resides in the storage. That view is only visible if the MLTable file is actually evaluated by the engine. These modes will provide that view of the files.
Read data in a job
Create a job specification YAML file (<file-name>.yml
). Specify in the inputs
section of the job:
- The
type
; whether the data is a specific file (uri_file
) or a folder location (uri_folder
) or anmltable
. - The
path
of where your data is located; can be any of the paths outlined in the Supported Paths section.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
# Possible Paths for Data:
# Blob: wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/<file>
# Datastore: azureml://datastores/paths/<folder>/<file>
# Data Asset: azureml:<my_data>:<version>
command: |
ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
my_data:
type: <type> # uri_file, uri_folder, mltable
path: <path>
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
Next, run in the CLI
az ml job create -f <file-name>.yml
Read V1 data assets
This section outlines how you can read V1 FileDataset
and TabularDataset
data entities in a V2 job.
Read a FileDataset
Create a job specification YAML file (<file-name>.yml
), with the type set to mltable
and the mode set to eval_mount
:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
my_data:
type: mltable
mode: eval_mount
path: azureml:<filedataset_name>@latest
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster
Next, run in the CLI
az ml job create -f <file-name>.yml
Read a TabularDataset
Create a job specification YAML file (<file-name>.yml
), with the type set to mltable
and the mode set to direct
:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
my_data:
type: mltable
mode: direct
path: azureml:<tabulardataset_name>@latest
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster
Next, run in the CLI
az ml job create -f <file-name>.yml
Write data in a job
In your job you can write data to your cloud-based storage using outputs. The Supported modes section showed that only job outputs can write data because the mode can be either rw_mount
or upload
.
Create a job specification YAML file (<file-name>.yml
), with the outputs
section populated with the type and path of where you would like to write your data to:
$schema: https://azuremlschemas.azureedge.net/latest/CommandJob.schema.json
# Possible Paths for Data:
# Blob: wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/<file>
# Datastore: azureml://datastores/<datastore_name>/paths/<folder>/<file>
# Data Asset: azureml:<my_data>:<version>
code: src
command: >-
python prep.py
--raw_data ${{inputs.raw_data}}
--prep_data ${{outputs.prep_data}}
inputs:
raw_data:
type: <type> # uri_file, uri_folder, mltable
path: <path>
outputs:
prep_data:
type: <type> # uri_file, uri_folder, mltable
path: <path>
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster
Next create a job using the CLI:
az ml job create --file <file-name>.yml
Data in pipelines
If you're working with Azure Machine Learning pipelines, you can read data into and move data between pipeline components with the Azure Machine Learning CLI v2 extension or the Python SDK v2.
Azure Machine Learning CLI v2
The following YAML file demonstrates how to use the output data from one component as the input for another component of the pipeline using the Azure Machine Learning CLI v2 extension:
APPLIES TO:
Azure CLI ml extension v2 (current)
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 3b_pipeline_with_data
description: Pipeline with 3 component jobs with data dependencies
settings:
default_compute: azureml:cpu-cluster
outputs:
final_pipeline_output:
mode: rw_mount
jobs:
component_a:
type: command
component: ./componentA.yml
inputs:
component_a_input:
type: uri_folder
path: ./data
outputs:
component_a_output:
mode: rw_mount
component_b:
type: command
component: ./componentB.yml
inputs:
component_b_input: ${{parent.jobs.component_a.outputs.component_a_output}}
outputs:
component_b_output:
mode: rw_mount
component_c:
type: command
component: ./componentC.yml
inputs:
component_c_input: ${{parent.jobs.component_b.outputs.component_b_output}}
outputs:
component_c_output: ${{parent.outputs.final_pipeline_output}}
# mode: upload
Python SDK v2
The following example defines a pipeline containing three nodes and moves data between each node.
prepare_data_node
that loads the image and labels from Fashion MNIST data set intomnist_train.csv
andmnist_test.csv
.train_node
that trains a CNN model with Keras using the training data,mnist_train.csv
.score_node
that scores the model using test data,mnist_test.csv
.
# define a pipeline containing 3 nodes: Prepare data node, train node, and score node
@pipeline(
default_compute=cpu_compute_target,
)
def image_classification_keras_minist_convnet(pipeline_input_data):
"""E2E image classification pipeline with keras using python sdk."""
prepare_data_node = prepare_data_component(input_data=pipeline_input_data)
train_node = keras_train_component(
input_data=prepare_data_node.outputs.training_data
)
train_node.compute = gpu_compute_target
score_node = keras_score_component(
input_data=prepare_data_node.outputs.test_data,
input_model=train_node.outputs.output_model,
)
# create a pipeline
pipeline_job = image_classification_keras_minist_convnet(pipeline_input_data=fashion_ds)
Next steps
Feedback
Submit and view feedback for