Vulnerability management for Azure Machine Learning

Vulnerability management involves detecting, assessing, mitigating, and reporting on any security vulnerabilities that exist in an organization’s systems and software. Vulnerability management is a shared responsibility between you and Microsoft.

In this article, we discuss these responsibilities and outline the vulnerability management controls provided by Azure Machine Learning. You'll learn how to keep your service instance and applications up to date with the latest security updates, and how to minimize the window of opportunity for attackers.

Microsoft-managed VM images

Azure Machine Learning manages host OS VM images for Azure ML compute instance, Azure ML compute clusters, and Data Science Virtual Machines. The update frequency is monthly and includes the following:

  • For each new VM image version, the latest updates are sourced from the original publisher of the OS. Using the latest updates ensures that all OS-related patches that are applicable are picked. For Azure Machine Learning, the publisher is Canonical for all the Ubuntu 18 images. These images are used for Azure Machine Learning compute instances, compute clusters, and Data Science Virtual Machines.
  • VM images are updated monthly.
  • In addition to patches applied by the original publisher, Azure Machine Learning updates system packages when updates are available.
  • Azure Machine Learning checks and validates any machine learning packages that may require an upgrade. In most circumstances, new VM images contain the latest package versions.
  • All VM images are built on secure subscriptions that run vulnerability scanning regularly. Any unaddressed vulnerabilities are flagged and are to be fixed within the next release.
  • The frequency is on a monthly interval for most images. For compute instance, the image release is aligned with the Azure ML SDK release cadence as it comes preinstalled in the environment.

Next to the regular release cadence, hot fixes are applied in the case vulnerabilities are discovered. Hot fixes get rolled out within 72 hours for Azure ML compute and within a week for Compute Instance.

Note

The host OS is not the OS version you might specify for an environment when training or deploying a model. Environments run inside Docker. Docker runs on the host OS.

Microsoft-managed container images

Base docker images maintained by Azure Machine Learning get security patches frequently to address newly discovered vulnerabilities.

Azure Machine Learning releases updates for supported images every two weeks to address vulnerabilities. As a commitment, we aim to have no vulnerabilities older than 30 days in the latest version of supported images.

Patched images are released under new immutable tag and also updated :latest tag. Using the :latest tag or pinning to a particular image version may be a trade-off of security and environment reproducibility for your machine learning job.

Managing environments and container images

Reproducibility is a key aspect of software development and machine learning experimentation. Azure Machine Learning Environment component’s primary focus is to guarantee reproducibility of the environment where user's code gets executed. To ensure reproducibility for any machine learning job, earlier built images will be pulled to the compute nodes without a need of rematerialization.

While Azure Machine Learning patches base images with each release, whether you use the latest image may be tradeoff between reproducibility and vulnerability management. So, it's your responsibility to choose the environment version used for your jobs or model deployments.

By default, dependencies are layered on top of base images provided by Azure ML when building environments. You can also use your own base images when using environments in Azure Machine Learning. Once you install more dependencies on top of the Microsoft-provided images, or bring your own base images, vulnerability management becomes your responsibility.

Associated to your Azure Machine Learning workspace is an Azure Container Registry instance that's used as a cache for container images. Any image materialized, is pushed to the container registry, and used if experimentation or deployment is triggered for the corresponding environment. Azure Machine Learning doesn't delete any image from your container registry, and it's your responsibility to evaluate the need of an image over time. To monitor and maintain environment hygiene, you can use Microsoft Defender for Container Registry to help scan your images for vulnerabilities. To automate your processes based on triggers from Microsoft Defender, see Automate responses to Microsoft Defender for Cloud triggers.

Using a private package repository

Azure Machine Learning uses Conda for package installations. By default, packages are downloaded from public repositories. In case your organization requires packages to be sourced only from private repositories, you may override the conda configuration as part of your base image. Below example configuration shows how to remove the default channels, and add your own private conda feed.

RUN conda config --set offline false \
&& conda config --remove channels defaults || true \
&& conda config --add channels https://my.private.conda.feed/conda/feed

See use your own dockerfile to learn how to specify your own base images in Azure Machine Learning. For more details on configuring Conda environments, see Conda - Creating an environment file manually.

Vulnerability management on compute hosts

Managed compute nodes in Azure Machine Learning make use of Microsoft-managed OS VM images and pull the latest updated VM image at the time that a node gets provisioned. This applies to compute instance, compute cluster, and managed inference compute SKUs. While OS VM images are regularly patched, compute nodes are not actively scanned for vulnerabilities while in use. For an extra layer of protection, consider network isolation of your compute.
It's a shared responsibility between you and Microsoft to ensure that your environment is up-to-date and compute nodes use the latest OS version. Nodes that are non-idle can't get updated to the latest VM image. Considerations are slightly different for each compute type, as listed in the following sections.

Compute instance

Compute instances get the latest VM images at the time of provisioning. Microsoft releases new VM images on a monthly basis. Once a compute instance is deployed, it does not get actively updated. You could query an instance's operating system version. To keep current with the latest software updates and security patches, you could:

  1. Recreate a compute instance to get the latest OS image (recommended)

  2. Alternatively, regularly update OS and Python packages.

    • Use Linux package management tools to update the package list with the latest versions.

      sudo apt-get update
      
    • Use Linux package management tools to upgrade packages to the latest versions. Note that package conflicts might occur using this approach.

      sudo apt-get upgrade
      
    • Use Python package management tools to upgrade packages and check for updates.

      pip list --outdated
      

You may install and run additional scanning software on compute instance to scan for security issues.

  • Trivy may be used to discover OS and Python package level vulnerabilities.
  • ClamAV may be used to discover malware and comes pre-installed on compute instance.
  • Defender for Server agent installation is currently not supported.
  • Consider using customization scripts for automation.

Compute clusters

Compute clusters automatically upgrade to the latest VM image. If the cluster is configured with min nodes = 0, it automatically upgrades nodes to the latest VM image version when all jobs are completed and the cluster reduces to zero nodes.

  • There are conditions in which cluster nodes do not scale down, and as a result are unable to get the latest VM images.

    • Cluster minimum node count may be set to a value greater than 0.
    • Jobs may be scheduled continuously on your cluster.
  • It is your responsibility to scale non-idle cluster nodes down to get the latest OS VM image updates. Azure Machine Learning does not abort any running workloads on compute nodes to issue VM updates.

    • Temporarily change the minimum nodes to zero and allow the cluster to reduce to zero nodes.

Managed online endpoints

  • Managed Online Endpoints automatically receive OS host image updates that include vulnerability fixes. The update frequency of images is at least once a month.
  • Compute nodes get automatically upgraded to the latest VM image version once released. There’s no action required on you.

Customer managed Kubernetes clusters

Kubernetes compute lets you configure Kubernetes clusters to train, inference, and manage models in Azure Machine Learning.

  • Because you manage the environment with Kubenetes, both OS VM vulnerabilities and container image vulnerability management is your responsibility.
  • Azure Machine Learning frequently publishes new versions of AzureML extension container images into Microsoft Container Registry. It's Microsoft’s responsibility to ensure new image versions are free from vulnerabilities. Vulnerabilities are fixed with each release.
  • When your clusters run jobs without interruption, running jobs may run outdated container image versions. Once you upgrade the amlarc extension to a running cluster, newly submitted jobs will start to use the latest image version. When upgrading the AMLArc extension to its latest version, clean up the old container image versions from the clusters as required.
  • Observability on whether your Azure Arc cluster is running the latest version of AMLArc, you can find via the Azure portal. Under your Arc resource of the type 'Kubernetes - Azure Arc', see 'Extensions' to find the version of the AMLArc extension.

Automated ML and Designer environments

For code-based training experiences, you control which Azure Machine Learning environment is used. With AutoML and Designer, the environment is encapsulated as part of the service. These types of jobs can run on computes configured by you, allowing for extra controls such as network isolation.

  • Automated ML jobs run on environments that layer on top of Azure ML base docker images.

  • Designer jobs are compartmentalized into Components. Each component has its own environment that layers on top of the Azure ML base docker images. For more information on components, see the Component reference.

Next steps