Access data in a job

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Learn how to read and write data for your jobs with the Azure Machine Learning Python SDK v2 and the Azure Machine Learning CLI extension v2.

Prerequisites

Supported paths

When you provide a data input/output to a Job, you'll need to specify a path parameter that points to the data location. Below is a table that shows the different data locations supported in Azure Machine Learning and examples for the path parameter:

Location Examples
A path on your local computer ./home/username/data/my_data
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage https://<account_name>.blob.core.windows.net/<container_name>/<path>
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
A path on a Datastore azureml://datastores/<data_store_name>/paths/<path>
A path to a Data Asset azureml:<my_data>:<version>

Supported modes

When you run a job with data inputs/outputs, you can specify the mode - for example, whether you would like the data to be read-only mounted or downloaded to the compute target. The table below shows the possible modes for different type/mode/input/output combinations:

Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount
uri_folder Input
uri_file Input
mltable Input
uri_folder Output
uri_file Output
mltable Output

Note

eval_download and eval_mount are unique to mltable. Whilst ro_mount is the default mode for MLTable, there are scenarios where an MLTable can yield files that are not necessarily co-located with the MLTable file in storage. Alternatively, an mltable can subset or shuffle the data that resides in the storage. That view is only visible if the MLTable file is actually evaluated by the engine. These modes will provide that view of the files.

Read data in a job

Create a job specification YAML file (<file-name>.yml). Specify in the inputs section of the job:

  1. The type; whether the data is a specific file (uri_file) or a folder location (uri_folder) or an mltable.
  2. The path of where your data is located; can be any of the paths outlined in the Supported Paths section.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# Possible Paths for Data:
# Blob: https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>
# Datastore: azureml://datastores/paths/<folder>/<file>
# Data Asset: azureml:<my_data>:<version>

command: |
  ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
  my_data:
    type: <type> # uri_file, uri_folder, mltable
    path: <path>
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

Next, run in the CLI

az ml job create -f <file-name>.yml

Read V1 data assets

This section outlines how you can read V1 FileDataset and TabularDataset data entities in a V2 job.

Read a FileDataset

Create a job specification YAML file (<file-name>.yml), with the type set to mltable and the mode set to eval_mount:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

command: |
  ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
  my_data:
    type: mltable
    mode: eval_mount
    path: azureml:<filedataset_name>@latest
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster

Next, run in the CLI

az ml job create -f <file-name>.yml

Read a TabularDataset

Create a job specification YAML file (<file-name>.yml), with the type set to mltable and the mode set to direct:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

command: |
  ls ${{inputs.my_data}}
code: <folder where code is located>
inputs:
  my_data:
    type: mltable
    mode: direct
    path: azureml:<tabulardataset_name>@latest
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster

Next, run in the CLI

az ml job create -f <file-name>.yml

Write data in a job

In your job you can write data to your cloud-based storage using outputs. The Supported modes section showed that only job outputs can write data because the mode can be either rw_mount or upload.

Create a job specification YAML file (<file-name>.yml), with the outputs section populated with the type and path of where you would like to write your data to:

$schema: https://azuremlschemas.azureedge.net/latest/CommandJob.schema.json

# Possible Paths for Data:
# Blob: https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>
# Datastore: azureml://datastores/paths/<folder>/<file>
# Data Asset: azureml:<my_data>:<version>

code: src
command: >-
  python prep.py 
  --raw_data ${{inputs.raw_data}} 
  --prep_data ${{outputs.prep_data}}
inputs:
  raw_data: 
    type: <type> # uri_file, uri_folder, mltable
    path: <path>
outputs:
  prep_data: 
    type: <type> # uri_file, uri_folder, mltable
    path: <path>
environment: azureml:<environment_name>@latest
compute: azureml:cpu-cluster

Next create a job using the CLI:

az ml job create --file <file-name>.yml

Data in pipelines

If you're working with Azure Machine Learning pipelines, you can read data into and move data between pipeline components with the Azure Machine Learning CLI v2 extension or the Python SDK v2.

Azure Machine Learning CLI v2

The following YAML file demonstrates how to use the output data from one component as the input for another component of the pipeline using the Azure Machine Learning CLI v2 extension:

APPLIES TO: Azure CLI ml extension v2 (current)

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: 3b_pipeline_with_data
description: Pipeline with 3 component jobs with data dependencies

settings:
  default_compute: azureml:cpu-cluster

outputs:
  final_pipeline_output:
    mode: rw_mount

jobs:
  component_a:
    type: command
    component: ./componentA.yml
    inputs:
      component_a_input: 
        type: uri_folder
        path: ./data

    outputs:
      component_a_output: 
        mode: rw_mount
  component_b:
    type: command
    component: ./componentB.yml
    inputs:
      component_b_input: ${{parent.jobs.component_a.outputs.component_a_output}}
    outputs:
      component_b_output: 
        mode: rw_mount
  component_c:
    type: command
    component: ./componentC.yml
    inputs:
      component_c_input: ${{parent.jobs.component_b.outputs.component_b_output}}
    outputs:
      component_c_output: ${{parent.outputs.final_pipeline_output}}
      #  mode: upload

Python SDK v2

The following example defines a pipeline containing three nodes and moves data between each node.

  • prepare_data_node that loads the image and labels from Fashion MNIST data set into mnist_train.csv and mnist_test.csv.
  • train_node that trains a CNN model with Keras using the training data, mnist_train.csv .
  • score_node that scores the model using test data, mnist_test.csv.
# define a pipeline containing 3 nodes: Prepare data node, train node, and score node
@pipeline(
    default_compute=cpu_compute_target,
)
def image_classification_keras_minist_convnet(pipeline_input_data):
    """E2E image classification pipeline with keras using python sdk."""
    prepare_data_node = prepare_data_component(input_data=pipeline_input_data)

    train_node = keras_train_component(
        input_data=prepare_data_node.outputs.training_data
    )
    train_node.compute = gpu_compute_target

    score_node = keras_score_component(
        input_data=prepare_data_node.outputs.test_data,
        input_model=train_node.outputs.output_model,
    )


# create a pipeline
pipeline_job = image_classification_keras_minist_convnet(pipeline_input_data=fashion_ds)

Next steps