Debug jobs and monitor training progress (preview)

Important

Items marked (preview) in this article are currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Machine learning model training is usually an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, data scientists can use the Azure Machine Learning Python SDKv2, Azure Machine Learning CLIv2 or the Azure Studio to access the container where their job is running. Once the job container is accessed, users can iterate on training scripts, monitor training progress or debug the job remotely like they typically do on their local machines. Jobs can be interacted with via different training applications including JupyterLab, TensorBoard, VS Code or by connecting to the job container directly via SSH.

Interactive training is supported on Azure Machine Learning Compute Clusters and Azure Arc-enabled Kubernetes Cluster.

Prerequisites

  • Review getting started with training on Azure Machine Learning.
  • To use this feature in Azure Machine Learning Studio, enable the "Debug & monitor your training jobs" flight via the preview panel.
  • To use VS Code, follow this guide to set up the Azure Machine Learning extension.
  • Make sure your job environment has the openssh-server and ipykernel ~=6.0 packages installed (all Azure Machine Learning curated training environments have these packages installed by default).
  • Interactive applications can't be enabled on distributed training runs where the distribution type is anything other than Pytorch, Tensorflow or MPI. Custom distributed training setup (configuring multi-node training without using the above distribution frameworks) is not currently supported.

Interact with your job container

By specifying interactive applications at job creation, you can connect directly to the container on the compute node where your job is running. Once you have access to the job container, you can test or debug your job in the exact same environment where it would run. You can also use VS Code to attach to the running process and debug as you would locally.

Enable during job submission

  1. Create a new job from the left navigation pane in the studio portal.

  2. Choose Compute cluster or Attached compute (Kubernetes) as the compute type, choose the compute target, and specify how many nodes you need in Instance count.

Screenshot of selecting a compute location for a job.

  1. Follow the wizard to choose the environment you want to start the job.

  2. In Job settings step, add your training code (and input/output data) and reference it in your command to make sure it's mounted to your job.

Screenshot of reviewing a drafted job and completing the creation.

You can put sleep <specific time> at the end of your command to specify the amount of time you want to reserve the compute resource. The format follows: * sleep 1s * sleep 1m * sleep 1h * sleep 1d

You can also use the sleep infinity command that would keep the job alive indefinitely.

Note

If you use sleep infinity, you will need to manually cancel the job to let go of the compute resource (and stop billing).

  1. Select the training applications you want to use to interact with the job.

Screenshot of selecting a training application for the user to use for a job.

  1. Review and create the job.

If you don't see the above options, make sure you have enabled the "Debug & monitor your training jobs" flight via the preview panel.

Connect to endpoints

To interact with your running job, click the button Debug and monitor on the job details page.

Screenshot of interactive jobs debug and monitor panel location.

Clicking the applications in the panel opens a new tab for the applications. You can access the applications only when they are in Running status and only the job owner is authorized to access the applications. If you're training on multiple nodes, you can pick the specific node you would like to interact with.

Screenshot of interactive jobs right panel information. Information content will vary depending on the user's data.

It might take a few minutes to start the job and the training applications specified during job creation. If you don't see the above options, make sure you have enabled the "Debug & monitor your training jobs" flight via the preview panel.

Interact with the applications

When you click on the endpoints to interact when your job, you're taken to the user container under your working directory, where you can access your code, inputs, outputs, and logs. If you run into any issues while connecting to the applications, the interactive capability and applications logs can be found from system_logs->interactive_capability under Outputs + logs tab.

Screenshot of interactive jobs interactive logs panel location.

  • You can open a terminal from Jupyter Lab and start interacting within the job container. You can also directly iterate on your training script with Jupyter Lab.

    Screenshot of interactive jobs Jupyter lab content panel.

  • You can also interact with the job container within VS Code. To attach a debugger to a job during job submission and pause execution, navigate here.

    Screenshot of interactive jobs VS Code panel when first opened. This shows the sample python file that was created to print two lines.

  • If you have logged tensorflow events for your job, you can use TensorBoard to monitor the metrics when your job is running.

    Screenshot of interactive jobs tensorboard panel when first opened. This information will vary depending upon customer data

If you don't see the above options, make sure you have enabled the "Debug & monitor your training jobs" flight via the preview panel.

End job

Once you're done with the interactive training, you can also go to the job details page to cancel the job which will release the compute resource. Alternatively, use az ml job cancel -n <your job name> in the CLI or ml_client.job.cancel("<job name>") in the SDK.

Screenshot of interactive jobs cancel job option and its location for user selection

Attach a debugger to a job

To submit a job with a debugger attached and the execution paused, you can use debugpy and VS Code (debugpy must be installed in your job environment).

  1. During job submission (either through the UI, the CLIv2 or the SDKv2) use the debugpy command to run your python script. For example, the below screenshot shows a sample command that uses debugpy to attach the debugger for a tensorflow script (tfevents.py can be replaced with the name of your training script).

Screenshot of interactive jobs configuration of debugpy

  1. Once the job has been submitted, connect to the VS Code, and click on the in-built debugger.

    Screenshot of interactive jobs location of open debugger on the left side panel

  2. Use the "Remote Attach" debug configuration to attach to the submitted job and pass in the path and port you configured in your job submission command. You can also find this information on the job details page.

    Screenshot of interactive jobs completed jobs

    Screenshot of interactive jobs add a remote attach button

  3. Set breakpoints and walk through your job execution as you would in your local debugging workflow.

    Screenshot of location of an example breakpoint that is set in the Visual Studio Code editor

Note

If you use debugpy to start your job, your job will not execute unless you attach the debugger in VS Code and execute the script. If this is not done, the compute will be reserved until the job is cancelled.

Next steps