CLI (v2) Automated ML text NER job YAML schema

APPLIES TO: Azure CLI ml extension v2 (current)

Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.

Every Azure Machine Learning entity has a schematized YAML representation. You can create a new entity from a YAML configuration file with a .yml or .yaml extension.

This article provides a reference for some syntax concepts you will encounter while configuring these YAML files for NLP text NER jobs.

The source JSON schema can be found at https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLNLPTextNERJob.schema.json

YAML syntax

Key Type Description Allowed values Default value
$schema string Represents the location/url to load the YAML schema. If the user uses the Azure Machine Learning VS Code extension to author the YAML file, including $schema at the top of the file enables the user to invoke schema and resource completions.
type const Required. The type of job. automl automl
task const Required. The type of AutoML task.
Task description for NER:
There are multiple possible tags for tokens in sequences. The task is to predict the tags for all the tokens for each sequence. For example, extracting domain-specific entities from unstructured text, such as contracts or financial documents.
text_ner
name string Name of the job. Must be unique across all jobs in the workspace. If omitted, Azure Machine Learning will autogenerate a GUID for the name.
display_name string Display name of the job in the studio UI. Can be non-unique within the workspace. If omitted, Azure Machine Learning will autogenerate a human-readable adjective-noun identifier for the display name.
experiment_name string Experiment name to organize the job under. Each job's run record will be organized under the corresponding experiment in the studio's "Experiments" tab. If omitted, Azure Machine Learning will default it to the name of the working directory where the job was created.
description string Description of the job.
tags object Dictionary of tags for the job.
compute string Name of the compute target to execute the job on. To reference an existing compute in the workspace, we use syntax: azureml:<compute_name>
log_verbosity number Different levels of log verbosity. not_set, debug, info, warning, error, critical info
primary_metric string The metric that AutoML will optimize for model selection. accuracy accuracy
training_data object Required. The data to be used within the job. Unlike multi-class or multi-label, which takes .csv format datasets, named entity recognition requires CoNLL format. The file must contain exactly two columns and in each row, the token and the label is separated by a single space. See NER section for more detail.
validation_data object Required. The validation data to be used within the job.
- The file should not start with an empty line
- Each line must be an empty line, or follow format {token} {label}, where there is exactly one space between the token and the label and no white space after the label
- All labels must start with I-, B-, or be exactly O. Case sensitive
- Exactly one empty line between two samples
- Exactly one empty line at the end of the file
See data validation section for more detail.
limits object Dictionary of limit configurations of the job. Parameters in this section: max_concurrent_trials, max_nodes, max_trials, timeout_minutes, trial_timeout_minutes. See limits for detail.
training_parameters object Dictionary containing training parameters for the job. Provide an object that has keys as listed in following sections.
For more information, see supported hyperparameters section
training_parameters object Dictionary containing training parameters for the job.
See supported hyperparameters for detail.
Note: Hyperparameters set in the training_parameters are fixed across all sweeping runs and thus don't need to be included in the search space.
search_space object Dictionary of the hyperparameter search space. The key is the name of the hyperparameter and the value is the parameter expression. All parameters that are fixable via training_parameters are supported here (to be instead swept over). See supported hyperparameters for more detail.
There are two types of hyperparameters:
- Discrete Hyperparameters: Discrete hyperparameters are specified as a choice among discrete values. choice can be one or more comma-separated values, a range object, or any arbitrary list object. Advanced discrete hyperparameters can also be specified using a distribution - randint, qlognormal, qnormal, qloguniform, quniform. For more information, see this section.
- Continuous hyperparameters: Continuous hyperparameters are specified as a distribution over a continuous range of values. Currently supported distributions are - lognormal, normal, loguniform, uniform. For more information, see this section.

See parameter expressions for the set of possible expressions to use.
outputs object Dictionary of output configurations of the job. The key is a name for the output within the context of the job and the value is the output configuration.
outputs.best_model object Dictionary of output configurations for best model. For more information, see Best model output configuration.

Other syntax used in configurations:

Limits

Key Type Description Allowed values Default value
max_concurrent_trials integer Represents the maximum number of trials (children jobs) that would be executed in parallel. 1
max_trials integer Represents the maximum number of trials an AutoML nlp job can try to run a training algorithm with different combination of hyperparameters. 1
timeout_minutes integer Represents the maximum amount of time in minutes that the submitted AutoML job can take to run . After this, the job will get terminated. The default timeout in AutoML NLP jobs is 7 days. 10080
trial_timeout_minutes integer Represents the maximum amount of time in minutes that each trial (child job) in the submitted AutoML job can take run. After this, the child job will get terminated.
max_nodes integer The maximum number of nodes from the backing compute cluster to leverage for the job. 1

Supported hyperparameters

The following table describes the hyperparameters that AutoML NLP supports.

Parameter name Description Syntax
gradient_accumulation_steps The number of backward operations whose gradients are to be summed up before performing one step of gradient descent by calling the optimizer’s step function.

This is leveraged to use an effective batch size which is gradient_accumulation_steps times larger than the maximum size that fits the GPU.
Must be a positive integer.
learning_rate Initial learning rate. Must be a float in the range (0, 1).
learning_rate_scheduler Type of learning rate scheduler. Must choose from linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup.
model_name Name of one of the supported models. Must choose from bert_base_cased, bert_base_uncased, bert_base_multilingual_cased, bert_base_german_cased, bert_large_cased, bert_large_uncased, distilbert_base_cased, distilbert_base_uncased, roberta_base, roberta_large, distilroberta_base, xlm_roberta_base, xlm_roberta_large, xlnet_base_cased, xlnet_large_cased.
number_of_epochs Number of training epochs. Must be a positive integer.
training_batch_size Training batch size. Must be a positive integer.
validation_batch_size Validation batch size. Must be a positive integer.
warmup_ratio Ratio of total training steps used for a linear warmup from 0 to learning_rate. Must be a float in the range [0, 1].
weight_decay Value of weight decay when optimizer is sgd, adam, or adamw. Must be a float in the range [0, 1].

Training or validation data

Key Type Description Allowed values Default value
description string The detailed information that describes this input data.
path string The path from where data should be loaded. Path can be a file path, folder path or pattern for paths. pattern specifies a search pattern to allow globbing(* and **) of files and folders containing data. URI types are azureml, https, wasbs, abfss, and adl. For more information on how to use the azureml:// URI format, see core yaml syntax. URI of the location of the artifact file. If this URI doesn't have a scheme (for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the default workspace blob-storage as the entity is created.
mode string Dataset delivery mechanism. direct direct
type const In order to generate nlp models, the user needs to bring training data in the form of an MLTable. For more information, see preparing data mltable mltable

Best model output configuration

Key Type Description Allowed values Default value
type string Required. Type of best model. AutoML allows only mlflow models. mlflow_model mlflow_model
path string Required. URI of the location where the model-artifact file(s) are stored. If this URI doesn't have a scheme (for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the default workspace blob-storage as the entity is created.
storage_uri string The HTTP URL of the Model. Use this URL with az storage copy -s THIS_URL -d DESTINATION_PATH --recursive to download the data.

Remarks

The az ml job command can be used for managing Azure Machine Learning jobs.

Examples

Examples are available in the examples GitHub repository. Examples relevant to text NER job are linked below.

YAML: AutoML text NER job

$schema: https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.json

type: automl
experiment_name: dpv2-cli-text-ner
description: A text named entity recognition job using CoNLL 2003 data

compute: azureml:gpu-cluster

task: text_ner
primary_metric: accuracy
log_verbosity: debug

limits:
  timeout_minutes: 60

training_data:
  path: "./training-mltable-folder"
  type: mltable
validation_data:
  type: mltable
  path: "./validation-mltable-folder"

# featurization:
#   dataset_language: "eng"

YAML: AutoML text NER sweeping job

$schema: https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.json

type: automl
experiment_name: dpv2-cli-text-ner
description: A text named entity recognition job using CoNLL 2003 data

compute: azureml:gpu-cluster

task: text_ner
primary_metric: accuracy
log_verbosity: debug

limits:
  timeout_minutes: 120
  max_nodes: 4
  max_trials: 2
  max_concurrent_trials: 2

training_data:
  path: "./training-mltable-folder"
  type: mltable
validation_data:
  type: mltable
  path: "./validation-mltable-folder"

# featurization:
#   dataset_language: "eng"

sweep:
  sampling_algorithm: random
  early_termination:
    type: bandit
    evaluation_interval: 2
    slack_amount: 0.05
    delay_evaluation: 6

search_space:
  - model_name:
      type: choice
      values: [bert-base-cased, roberta-base]
  - model_name:
      type: choice
      values: [distilroberta-base]
    weight_decay:
      type: uniform
      min_value: 0.01
      max_value: 0.1

YAML: AutoML text NER pipeline job

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

description: Pipeline using AutoML Text Ner task

display_name: pipeline-with-text-ner
experiment_name: pipeline-with-automl

settings:
  default_compute: azureml:gpu-cluster

inputs:
  text_ner_training_data:
    type: mltable
    path: ./training-mltable-folder
  text_ner_validation_data:
    type: mltable
    path: ./validation-mltable-folder

jobs:
  preprocessing_node:
    type: command
    component: file:./components/component_preprocessing.yaml
    inputs:
      train_data: ${{parent.inputs.text_ner_training_data}}
      validation_data: ${{parent.inputs.text_ner_validation_data}}
    outputs:
      preprocessed_train_data:
        type: mltable
      preprocessed_validation_data:
        type: mltable
  text_ner_node:
    type: automl
    task: text_ner
    log_verbosity: info
    primary_metric: accuracy
    limits:
      max_trials: 1
      timeout_minutes: 60
    target_column_name: label
    training_data: ${{parent.jobs.preprocessing_node.outputs.preprocessed_train_data}}
    validation_data: ${{parent.jobs.preprocessing_node.outputs.preprocessed_validation_data}}
    # currently need to specify outputs "mlflow_model" explicitly to reference it in following nodes
    outputs:
      best_model:
        type: mlflow_model
  register_model_node:
    type: command
    component: file:./components/component_register_model.yaml
    inputs:
      model_input_path: ${{parent.jobs.text_ner_node.outputs.best_model}}
      model_base_name: paper_categorization_model

Next steps