Debug jobs and monitor training progress

2024-08-28

Machine learning model training is an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, data scientists can use the Azure Machine Learning Python SDK, Azure Machine Learning CLI or the Azure Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including JupyterLab, TensorBoard, VS Code or by connecting to the job container directly via SSH.

Interactive training is supported on Azure Machine Learning Compute Clusters and Azure Arc-enabled Kubernetes Cluster.

Prerequisites

Review getting started with training on Azure Machine Learning.
For more information, see this link for VS Code to set up the Azure Machine Learning extension.
Make sure your job environment has the openssh-server and ipykernel ~=6.0 packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than PyTorch, TensorFlow, or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) isn't currently supported.
To use SSH, you need an SSH key pair. You can use the ssh-keygen -f "<filepath>" command to generate a public and private key pair.

Interact with your job container

By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.

Enable during job submission

Create a new job from the left pane in the studio portal.
Choose Compute cluster or Attached compute (Kubernetes) as the compute type, choose the compute target, and specify how many nodes you need in Instance count.

Screenshot of selecting a compute location for a job.

Follow the wizard to choose the environment you want to start the job.
In the Training script step, add your training code (and input/output data) and reference it in your command to make sure it's mounted to your job.

Screenshot of reviewing a drafted job and completing the creation.

You can put sleep <specific time> at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:

sleep 1s
sleep 1m
sleep 1h
sleep 1d

You can also use the sleep infinity command that would keep the job alive indefinitely.

Note

If you use sleep infinity, you will need to manually cancel the job to let go of the compute resource (and stop billing).

In Compute settings, expand the option for Training applications. Select at least one training application you want to use to interact with the job. If you don't select an application, the debug feature won't be available.

Screenshot of selecting a training application for the user to use for a job.

Review and create the job.

Define the interactive services you want to use for your job. Make sure to replace your compute name with your own value. If you want to use your own custom environment, follow the examples in this tutorial to create a custom environment.

You have to import the JobService class from the azure.ai.ml.entities package to configure interactive services via the SDK.

command_job = command(...
    code="./src",  # local path where the code is stored
    command="python main.py", # you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running
    environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="<name-of-compute>",
    services={
      "My_jupyterlab": JupyterLabJobService(
        nodes="all" # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
      ),
      "My_vscode": VsCodeJobService(
        nodes="all"
      ),
      "My_tensorboard": TensorBoardJobService(
        nodes="all",
        log_dir="output/tblogs"  # relative path of Tensorboard logs (same as in your training script)         
      ),
      "My_ssh": SshJobService(
        ssh_public_keys="<add-public-key>",
        nodes="all"  
      ),
    }
)

# submit the command
returned_job = ml_client.jobs.create_or_update(command_job)

The services section specifies the training applications you want to interact with.

You can put sleep <specific time> at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:

sleep 1s
sleep 1m
sleep 1h
sleep 1d

You can also use the sleep infinity command that would keep the job alive indefinitely.

Note

If you use sleep infinity, you will need to manually cancel the job to let go of the compute resource (and stop billing).

Submit your training job. For more details on how to train with the Python SDK, check out this article.

Create a job yaml job.yaml using the sample content. Make sure to replace your compute name with your own value. If you want to use custom environment, follow the examples in this tutorial to create a custom environment.

code: src 
command: 
  python train.py 
  # you can add a command like "sleep 1h" to reserve the compute resource is reserved after the script finishes running.
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:41
compute: azureml:<your compute name>
services:
    my_vs_code:
      type: vs_code
      nodes: all # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
    my_tensor_board:
      type: tensor_board
      log_dir: "output/tblogs" # relative path of Tensorboard logs (same as in your training script)
      nodes: all
    my_jupyter_lab:
      type: jupyter_lab
      nodes: all
    my_ssh:
      type: ssh
      ssh_public_keys: <paste the entire pub key content>
      nodes: all

The services section specifies the training applications you want to interact with.

You can put sleep <specific time> at the end of the command to specify the amount of time you want to reserve the compute resource. The format follows:

sleep 1s
sleep 1m
sleep 1h
sleep 1d

You can also use the sleep infinity command that would keep the job alive indefinitely.

Note

If you use sleep infinity, you will need to manually cancel the job to let go of the compute resource (and stop billing).

Run command az ml job create --file <path to your job yaml file> --workspace-name <your workspace name> --resource-group <your resource group name> --subscription <sub-id> to submit your training job. For more details on running a job via CLI, check out this article.

Connect to endpoints

To interact with your running job, select the button Debug and monitor on the job details page.

Screenshot of interactive jobs debug and monitor panel location.

Clicking the applications in the panel opens a new tab for the applications. You can access the applications only when they are in Running status and only the job owner is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with.

Screenshot of interactive jobs right panel information. Information content varies depending on the user's data.

It might take a few minutes to start the job and the training applications specified during job creation.

Once the job is submitted, you can use ml_client.jobs.show_services("<job name>", <compute node index>) to view the interactive service endpoints.
To connect via SSH to the container where the job is running, run the command az ml job connect-ssh --name <job-name> --node-index <compute node index> --private-key-file-path <path to private key>. To set up the Azure Machine Learning CLI, follow this guide.

You can find the reference documentation for the SDK here.

You can access the applications only when they are in Running status and only the job owner is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with by passing in the node index.

When the job is running, Run the command az ml job show-services --name <job name> --node-index <compute node index> to get the URL to the applications. The endpoint URL shows under services in the output. For VS Code, you must copy and paste the provided URL in your browser.
To connect via SSH to the container where the job is running, run the command az ml job connect-ssh --name <job-name> --node-index <compute node index> --private-key-file-path <path to private key>.

You can find the reference documentation for these commands here.

Interact with the applications

When you select on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from system_logs->interactive_capability under Outputs + logs tab.

Screenshot of interactive jobs interactive logs panel location.

You can open a terminal from Jupyter Lab and start interacting within the job container. You can also directly iterate on your training script with Jupyter Lab.
You can also interact with the job container within VS Code. To attach a debugger to a job during job submission and pause execution, navigate here.

Note

Private link-enabled workspaces are not currently supported when interacting with the job container with VS Code.
If you have logged tensorflow events for your job, you can use TensorBoard to monitor the metrics when your job is running.

End job

Once you're done with the interactive training, you can also go to the job details page to cancel the job, which will release the compute resource. Alternatively, use az ml job cancel -n <your job name> in the CLI or ml_client.job.cancel("<job name>") in the SDK.

Screenshot of interactive jobs cancel job option and its location for user selection

Attach a debugger to a job

To submit a job with a debugger attached and the execution paused, you can use debugpy, and VS Code (debugpy must be installed in your job environment).

Note

Private link-enabled workspaces are not currently supported when attaching a debugger to a job in VS Code.

During job submission (either through the UI, the CLI or the SDK) use the debugpy command to run your python script. For example, the following screenshot shows a sample command that uses debugpy to attach the debugger for a tensorflow script (tfevents.py can be replaced with the name of your training script).

Screenshot of interactive jobs configuration of debugpy

Once the job has been submitted, connect to the VS Code, and select the in-built debugger.
Use the "Remote Attach" debug configuration to attach to the submitted job and pass in the path and port you configured in your job submission command. You can also find this information on the job details page.
Set breakpoints and walk through your job execution as you would in your local debugging workflow.

Note

If you use debugpy to start your job, your job will not execute unless you attach the debugger in VS Code and execute the script. If this is not done, the compute will be reserved until the job is cancelled.

Next steps

Learn more about how and where to deploy a model.

Share via

Debug jobs and monitor training progress

Prerequisites

Interact with your job container

Enable during job submission

Connect to endpoints

Interact with the applications

End job

Attach a debugger to a job

Next steps

Feedback

Additional resources