How to Upgrade the NVIDIA Driver Version for an AKS Cluster GPU Node

Elvis Baugh 20 Reputation points
2023-12-14T05:41:37.3666667+00:00

I need to upgrade the NVIDIA driver version on an AKS cluster GPU node from version 510 to version 535 to run my ML app compiled with CUDA version 12.2. Downgrading to the matching CUDA version breaks some of my TensorFlow packages. I followed the examples from this NVIDIA page but was unsuccessful. How can I successfully upgrade the NVIDIA driver version on an AKS cluster GPU node?

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,458 questions
{count} votes

Accepted answer
  1. vipullag-MSFT 26,487 Reputation points Moderator
    2023-12-15T00:51:31.3733333+00:00

    Hello Elvis Baugh

    Welcome to Microsoft Q&A Platform, thanks for posting your query here.

    The nodes in AKS contains specific versioned software for its operation and interaction with Azure. Upgrading the version of component -outside of the supported node re-image [1]- may result in negative impact or unexpected consequences. It is not recommended.

    This is a publicly accessible change log [2] which contains information about the latest in new releases. For example Release 2023-06-04 [3].  

    Based on your ask, looks like you are doing unsupported modifications in the cluster. The documentation [4] you referred is about an unsupported method.

    Can you please check this document "Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)" [5] which explains how to install NVIDIA support in AKS.

    Hope this helps.

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.