Debug jobs and monitor training progress

Machine learning model training is an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, data scientists can use the Azure Machine Learning Python SDK, Azure Machine Learning CLI or the Azure Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including JupyterLab, TensorBoard, VS Code or by connecting to the job container directly via SSH.

Interactive training is supported on Azure Machine Learning Compute Clusters and Azure Arc-enabled Kubernetes Cluster.

Prerequisites

  • Review getting started with training on Azure Machine Learning.
  • For more information, see this link for VS Code to set up the Azure Machine Learning extension.
  • Make sure your job environment has the openssh-server and ipykernel ~=6.0 packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
  • Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, TensorFlow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) isn't currently supported.
  • To use SSH, you need an SSH key pair. You can use the ssh-keygen -f "<filepath>" command to generate a public and private key pair.

Interact with your job container

By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.

Enable during job submission

  1. Create a new job from the left navigation pane in the studio portal.

  2. Choose Compute cluster or Attached compute (Kubernetes) as the compute type, choose the compute target, and specify how many nodes you need in Instance count.

Screenshot of selecting a compute location for a job.

  1. Follow the wizard to choose the environment you want to start the job.

  2. In Job settings step, add your training code (and input/output data) and reference it in your command to make sure it's mounted to your job.

Screenshot of reviewing a drafted job and completing the creation.

You can put sleep <specific time> at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows:

  • sleep 1s
  • sleep 1m
  • sleep 1h
  • sleep 1d

You can also use the sleep infinity command that would keep the job alive indefinitely.

Note

If you use sleep infinity, you will need to manually cancel the job to let go of the compute resource (and stop billing).

  1. Select at least one training application you want to use to interact with the job. If you don't select an application, the debug feature won't be available.

Screenshot of selecting a training application for the user to use for a job.

  1. Review and create the job.

Connect to endpoints

To interact with your running job, select the button Debug and monitor on the job details page.

Screenshot of interactive jobs debug and monitor panel location.

Clicking the applications in the panel opens a new tab for the applications. You can access the applications only when they are in Running status and only the job owner is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with.

Screenshot of interactive jobs right panel information. Information content varies depending on the user's data.

It might take a few minutes to start the job and the training applications specified during job creation.

Interact with the applications

When you select on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from system_logs->interactive_capability under Outputs + logs tab.

Screenshot of interactive jobs interactive logs panel location.

  • You can open a terminal from Jupyter Lab and start interacting within the job container. You can also directly iterate on your training script with Jupyter Lab.

    Screenshot of interactive jobs Jupyter lab content panel.

  • You can also interact with the job container within VS Code. To attach a debugger to a job during job submission and pause execution, navigate here.

    Screenshot of interactive jobs VS Code panel when first opened. This shows the sample python file that was created to print two lines.

  • If you have logged tensorflow events for your job, you can use TensorBoard to monitor the metrics when your job is running.

    Screenshot of interactive jobs tensorboard panel when first opened. This information varies depending upon customer data

End job

Once you're done with the interactive training, you can also go to the job details page to cancel the job, which will release the compute resource. Alternatively, use az ml job cancel -n <your job name> in the CLI or ml_client.job.cancel("<job name>") in the SDK.

Screenshot of interactive jobs cancel job option and its location for user selection

Attach a debugger to a job

To submit a job with a debugger attached and the execution paused, you can use debugpy, and VS Code (debugpy must be installed in your job environment).

  1. During job submission (either through the UI, the CLI or the SDK) use the debugpy command to run your python script. For example, the below screenshot shows a sample command that uses debugpy to attach the debugger for a tensorflow script (tfevents.py can be replaced with the name of your training script).

Screenshot of interactive jobs configuration of debugpy

  1. Once the job has been submitted, connect to the VS Code, and select the in-built debugger.

    Screenshot of interactive jobs location of open debugger on the left side panel

  2. Use the "Remote Attach" debug configuration to attach to the submitted job and pass in the path and port you configured in your job submission command. You can also find this information on the job details page.

    Screenshot of interactive jobs completed jobs

    Screenshot of interactive jobs add a remote attach button

  3. Set breakpoints and walk through your job execution as you would in your local debugging workflow.

    Screenshot of location of an example breakpoint that is set in the Visual Studio Code editor

Note

If you use debugpy to start your job, your job will not execute unless you attach the debugger in VS Code and execute the script. If this is not done, the compute will be reserved until the job is cancelled.

Next steps