Vulnerability management for Azure Machine Learning

Article
08/28/2024

Vulnerability management involves detecting, assessing, mitigating, and reporting on any security vulnerabilities that exist in an organization's systems and software. Vulnerability management is a shared responsibility between you and Microsoft.

This article discusses these responsibilities and outlines the vulnerability management controls that Azure Machine Learning provides. You learn how to keep your service instance and applications up to date with the latest security updates, and how to minimize the window of opportunity for attackers.

Microsoft-managed VM images

Azure Machine Learning manages host OS virtual machine (VM) images for Azure Machine Learning compute instances, Azure Machine Learning compute clusters, and Data Science Virtual Machines. The update frequency is monthly and includes the following details:

For each new VM image version, the latest updates are sourced from the original publisher of the OS. Using the latest updates helps ensure that you get all applicable OS-related patches. For Azure Machine Learning, the publisher is Canonical for all the Ubuntu images. These images are used for Azure Machine Learning compute instances, compute clusters, and Data Science Virtual Machines.
VM images are updated monthly.
In addition to patches that the original publisher applies, Azure Machine Learning updates system packages when updates are available.
Azure Machine Learning checks and validates any machine learning packages that might require an upgrade. In most circumstances, new VM images contain the latest package versions.
All VM images are built on secure subscriptions that run vulnerability scanning regularly. Azure Machine Learning flags any unaddressed vulnerabilities and fixes them within the next release.
The frequency is a monthly interval for most images. For compute instances, the image release is aligned with the release cadence of the Azure Machine Learning SDK that's preinstalled in the environment.

In addition to the regular release cadence, Azure Machine Learning applies hotfixes if vulnerabilities surface. Microsoft rolls out hotfixes within 72 hours for Azure Machine Learning compute clusters and within a week for compute instances.

Note

The host OS is not the OS version that you might specify for an environment when you're training or deploying a model. Environments run inside Docker. Docker runs on the host OS.

Microsoft-managed container images

Base docker images that Azure Machine Learning maintains get security patches frequently to address newly discovered vulnerabilities.

Azure Machine Learning releases updates for supported images every two weeks to address vulnerabilities. As a commitment, we aim to have no vulnerabilities older than 30 days in the latest version of supported images.

Patched images are released under a new immutable tag and an updated :latest tag. Using the :latest tag or pinning to a particular image version might be a tradeoff between security and environment reproducibility for your machine learning job.

Managing environments and container images

Reproducibility is a key aspect of software development and machine learning experimentation. The Azure Machine Learning environment component's primary focus is to guarantee reproducibility of the environment where the user's code is executed. To ensure reproducibility for any machine learning job, earlier built images are pulled to the compute nodes without the need for rematerialization.

Although Azure Machine Learning patches base images with each release, whether you use the latest image might be tradeoff between reproducibility and vulnerability management. It's your responsibility to choose the environment version that you use for your jobs or model deployments.

By default, dependencies are layered on top of base images that Azure Machine Learning provides when you're building environments. You can also use your own base images when you're using environments in Azure Machine Learning. After you install more dependencies on top of the Microsoft-provided images, or bring your own base images, vulnerability management becomes your responsibility.

Associated with your Azure Machine Learning workspace is an Azure Container Registry instance that functions as a cache for container images. Any image that materializes is pushed to the container registry. The workspace uses it if experimentation or deployment is triggered for the corresponding environment.

Azure Machine Learning doesn't delete any image from your container registry. You're responsible for evaluating the need for an image over time. To monitor and maintain environment hygiene, you can use Microsoft Defender for Container Registry to help scan your images for vulnerabilities. To automate your processes based on triggers from Microsoft Defender, see Automate remediation responses.

Using a private package repository

Azure Machine Learning uses Conda and Pip to install Python packages. By default, Azure Machine Learning downloads packages from public repositories. If your organization requires you to source packages only from private repositories like Azure DevOps feeds, you can override the Conda and Pip configuration as part of your base images and your environment configurations for compute instances.

The following example configuration shows how to remove the default channels and add your own private Conda and Pip feeds. Consider using compute instance setup scripts for automation.

RUN conda config --set offline false \
&& conda config --remove channels defaults || true \
&& conda config --add channels https://my.private.conda.feed/conda/feed \
&& conda config --add repodata_fns <repodata_file_on_your_server>.json

# Configure Pip private indexes and ensure that the client trusts your host
RUN pip config set global.index https://my.private.pypi.feed/repository/myfeed/pypi/ \
&&  pip config set global.index-url https://my.private.pypi.feed/repository/myfeed/simple/

# In case your feed host isn't secured through SSL
RUN  pip config set global.trusted-host http://my.private.pypi.feed/

To learn how to specify your own base images in Azure Machine Learning, see Create an environment from a Docker build context. For more information on configuring Conda environments, see Creating an environment file manually on the Conda site.

Vulnerability management on compute hosts

Managed compute nodes in Azure Machine Learning use Microsoft-managed OS VM images. When you provision a node, it pulls the latest updated VM image. This behavior applies to compute instance, compute cluster, serverless compute (preview), and managed inference compute options.

Although OS VM images are regularly patched, Azure Machine Learning doesn't actively scan compute nodes for vulnerabilities while they're in use. For an extra layer of protection, consider network isolation of your compute.

Ensuring that your environment is up to date and that compute nodes use the latest OS version is a shared responsibility between you and Microsoft. Nodes that aren't idle can't be updated to the latest VM image. Considerations are slightly different for each compute type, as listed in the following sections.

Compute instance

Compute instances get the latest VM images at the time of provisioning. Microsoft releases new VM images on a monthly basis. After you deploy a compute instance, it isn't actively updated. You can query an instance's operating system version. To keep current with the latest software updates and security patches, you can use one of these methods:

Re-create a compute instance to get the latest OS image (recommended).

If you use this method, you'll lose data and customizations (such as installed packages) that are stored on the instance's OS and temporary disks.

When you re-create your instance:
- Store notebooks in the User files directory to persist them.
- Mount data to persist files.
For more information about image releases, see Azure Machine Learning compute instance image release notes.
Regularly update OS and Python packages.
- Use Linux package management tools to update the package list with the latest versions:
```
sudo apt-get update
```
- Use Linux package management tools to upgrade packages to the latest versions. Package conflicts might occur when you use this approach.
```
sudo apt-get upgrade
```
- Use Python package management tools to upgrade packages and check for updates:
```
pip list --outdated
```

You can install and run additional scanning software on the compute instance to scan for security issues:

Use Trivy to discover OS and Python package-level vulnerabilities.
Use ClamAV to discover malware. It comes preinstalled on compute instances.

Microsoft Defender for Servers agent installation is currently not supported.

Consider using customization scripts for automation. For an example setup script that combines Trivy and ClamAV, see Compute instance sample setup scripts.

Compute clusters

Compute clusters automatically upgrade nodes to the latest VM image. If you configure the cluster with min nodes = 0, it automatically upgrades nodes to the latest VM image version when all jobs are completed and the cluster reduces to zero nodes.

In the following conditions, cluster nodes don't scale down, so they can't get the latest VM image:

The cluster's minimum node count is set to a value greater than zero.
Jobs are scheduled continuously on your cluster.

You're responsible for scaling down non-idle cluster nodes to get the latest OS VM image updates. Azure Machine Learning doesn't stop any running workloads on compute nodes to issue VM updates. Temporarily change the minimum nodes to zero and allow the cluster to reduce to zero nodes.

Managed online endpoints

Managed online endpoints automatically receive OS host image updates that include vulnerability fixes. The update frequency of images is at least once a month.

Compute nodes are automatically upgraded to the latest VM image version when that version is released. You don't need to take any action.

Customer-managed Kubernetes clusters

Kubernetes compute lets you configure Kubernetes clusters to train, perform inference, and manage models in Azure Machine Learning.

Because you manage the environment with Kubernetes, management of both OS VM vulnerabilities and container image vulnerabilities is your responsibility.

Azure Machine Learning frequently publishes new versions of Azure Machine Learning extension container images in Microsoft Artifact Registry. Microsoft is responsible for ensuring that new image versions are free from vulnerabilities. Each release fixes vulnerabilities.

When your clusters run jobs without interruption, running jobs might run outdated container image versions. After you upgrade the amlarc extension to a running cluster, newly submitted jobs start to use the latest image version. When you're upgrading the amlarc extension to its latest version, clean up the old container image versions from the clusters as required.

To observe whether your Azure Arc cluster is running the latest version of amlarc, use the Azure portal. Under your Azure Arc resource of the type Kubernetes - Azure Arc, go to Extensions to find the version of the amlarc extension.

AutoML and Designer environments

For code-based training experiences, you control which Azure Machine Learning environment to use. With AutoML and the designer, the environment is encapsulated as part of the service. These types of jobs can run on computes that you configure, to allow for extra controls such as network isolation.

AutoML jobs run on environments that layer on top of Azure Machine Learning base Docker images.

Designer jobs are compartmentalized into components. Each component has its own environment that layers on top of the Azure Machine Learning base Docker images. For more information on components, see the component reference.

Share via