CUDA-capable device(s) is/are busy or unavailable

Minh, Nguyen Quoc 5 Reputation points
2025-03-18T11:22:31.06+00:00

I following this document: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

I create a nodePool with the type Standard_NV36ads_A10_v5. I checked the Gpu driver and the toolkit was installed by Azure, not by Gpu Operator.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P8             N/A /  N/A  |       1MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=======================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But when I run the vectorAdd, it returns
[Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,447 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Anonymous
    2025-03-21T14:19:08.3233333+00:00

    Hi Minh, Nguyen Quoc,

    Based on the error messages provided, here’s a concise response:

    The log suggests missing or incompatible GPU libraries (cuFFT, cuDNN, cuBLAS) required by TensorFlow. Ensure the following steps are taken:

    Verify that all required GPU libraries are installed and compatible with CUDA 12.4 and TensorFlow 2.19.0. Follow the TensorFlow GPU Setup Guide(https://www.tensorflow.org/install/pip).

    Rebuild the TensorFlow Docker image to avoid duplicate library registrations. Ensure only necessary libraries are linked.

    Check the LD_LIBRARY_PATH environment variable to confirm it includes paths to the necessary GPU libraries (/usr/local/cuda/lib64).

    For additional GPU troubleshooting on AKS, refer to the Azure AKS GPU Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

    These steps should address the missing library issue and prevent duplicate registrations.

    If you have any further queries, please let us know we are glad to help you.

    If it was helpful, please click "Upvote" on this post to let us know.

    Thank You.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.