CUDA-capable device(s) is/are busy or unavailable

Question

CUDA-capable device(s) is/are busy or unavailable

Minh, Nguyen Quoc 5

I following this document: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

I create a nodePool with the type Standard_NV36ads_A10_v5. I checked the Gpu driver and the toolkit was installed by Azure, not by Gpu Operator.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P8             N/A /  N/A  |       1MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=======================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But when I run the vectorAdd, it returns
[Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!

Anonymous

2025-03-18T18:50:03.99+00:00

Hi Minh, Nguyen Quoc,

The error "CUDA-capable device(s) is/are busy or unavailable" can occur due to GPU utilization or configuration issues. Ensure that no other processes are utilizing the GPU by checking nvidia-smi, which shows no running processes in this case.

Verify that the installed CUDA version (12.4 in your scenario) is compatible with your GPU model (NVIDIA A10-24Q) and the software used. Setting the environment variable CUDA_LAUNCH_BLOCKING=1 might help with debugging. Restarting the node may reset the GPU state. Finally, ensure the NVIDIA driver (550.144.03) is compatible with your CUDA version.

Refer to Microsoft's AKS GPU Cluster Setup Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool for more details on GPU configurations.

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.

Minh, Nguyen Quoc 5

Hi Geethasri.V, the CUDA version is auto-installed by Azure, so I think it has to be compatible with your GPU model (NVIDIA A10-24Q).
The output when I run the pod spec for the tensorflow job: Use GPUs on Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Learn is:

2025-03-19 04:11:19.055372: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2025-03-19 04:11:19.137912: E tensorflow/core/common_runtime/direct_session.cc:170] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_VALUE
Traceback (most recent call last):
  File "/app/main.py", line 212, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/app/main.py", line 185, in main
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
    train()
  File "/app/main.py", line 152, in train
    sess = tf.InteractiveSession()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1612, in __init__
    super(InteractiveSession, self).__init__(target, graph, config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 622, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Anonymous

2025-03-19T13:34:46.3733333+00:00
Hi Minh, Nguyen Quoc,

The error CUDA_ERROR_INVALID_VALUE suggests an issue with TensorFlow initializing the CUDA device. Please ensure:

The TensorFlow version is compatible with CUDA 12.4. Refer to the TensorFlow GPU Support Guide https://www.tensorflow.org/install/source#gpu

TensorFlow is configured correctly to detect GPUs. Run this command to confirm:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Environment variables like CUDA_VISIBLE_DEVICES=0 and TF_FORCE_GPU_ALLOW_GROWTH=true are set for proper GPU usage.

For AKS-specific GPU configurations, refer to the Azure AKS GPU Cluster Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.

Minh, Nguyen Quoc 5

Here is my version

env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0

env | grep TF_FORCE_GPU_ALLOW_GROWTH
TF_FORCE_GPU_ALLOW_GROWTH=true
python3 --version
Python 3.10.12

pip3 list | grep tenso
tensorboard                  2.19.0
tensorboard-data-server      0.7.2
tensorflow                   2.19.0
tensorflow-io-gcs-filesystem 0.37.1

It is compatible with the document, the document doesn't have version for CUDA 12.4: User's image

Here is the output of Python command

python3 -c "import tensorflow as tf;  print(tf.config.list_physical_devices('GPU'))"
2025-03-20 02:19:18.594098: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1742437158.605671    1581 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742437158.609305    1581 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742437158.619366    1581 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742437158.619384    1581 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742437158.619387    1581 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742437158.619389    1581 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-20 02:19:18.622349: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0000 00:00:1742437159.821558    1581 gpu_device.cc:2341] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

Anonymous

2025-03-21T08:09:01.7466667+00:00

Hi Minh, Nguyen Quoc,

Based on the error messages provided, here’s a concise response:

The log suggests missing or incompatible GPU libraries (cuFFT, cuDNN, cuBLAS) required by TensorFlow. Ensure the following steps are taken:

Verify that all required GPU libraries are installed and compatible with CUDA 12.4 and TensorFlow 2.19.0. Follow the TensorFlow GPU Setup Guide(https://www.tensorflow.org/install/pip).

Rebuild the TensorFlow Docker image to avoid duplicate library registrations. Ensure only necessary libraries are linked.

Check the LD_LIBRARY_PATH environment variable to confirm it includes paths to the necessary GPU libraries (/usr/local/cuda/lib64).

For additional GPU troubleshooting on AKS, refer to the Azure AKS GPU Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

These steps should address the missing library issue and prevent duplicate registrations.

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Minh, Nguyen Quoc 5 Reputation points

2025-03-21T09:50:22.92+00:00

Geethasri.V Im using Standard_NV36ads_A10_v5, not Standard D32ds v5 or Standard E32ads v5
Anonymous

2025-03-24T09:09:40.97+00:00

Hi Minh, Nguyen Quoc,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-25T05:40:58.77+00:00

Hi Minh, Nguyen Quoc,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Minh, Nguyen Quoc 5 Reputation points

2025-03-27T04:46:49.5033333+00:00

Hi Geethasri.V, today, after the Azure update, the nodePool version to AKSUbuntu-2204gen2containerd-202503.13.0; this issue has been fixed automatically.
Anonymous

2025-03-27T14:21:32.2733333+00:00

Hi Minh, Nguyen Quoc,

Thank you for the update! I’m glad to hear that the issue has been resolved with the Azure update to the node pool version AKSUbuntu-2204gen2containerd-202503.02.0. Please let me know if you encounter any further issues or need additional assistance
Łukasz Dolegowski 0 Reputation points

2025-03-27T21:33:07.2433333+00:00

Hi, Minh, Nguyen Quoc, How did you manage to solve this problem? I'm also using theStandard_NV36ads_A10_v5 for 6 months and after restarting the machine, I get the same error.
Minh, Nguyen Quoc 5 Reputation points

2025-03-28T02:49:46.0733333+00:00

Łukasz Dolegowski I just wait for Azure to provide the new version of the Image nodePool from AKSUbuntu-2204gen2containerd-202503.02.0 to AKSUbuntu-2204gen2containerd-202503.13.0. Then the Nvidia driver returned to the 535.xxx version, and everything was back to normal.
I think for further deployment, we need to clarify and check the version of nodePool Kernel + Nvidia driver version first before upgrading it.

1 answer

Your answer

Anonymous

2025-03-18T18:50:03.99+00:00

Hi Minh, Nguyen Quoc,

The error "CUDA-capable device(s) is/are busy or unavailable" can occur due to GPU utilization or configuration issues. Ensure that no other processes are utilizing the GPU by checking nvidia-smi, which shows no running processes in this case.

Verify that the installed CUDA version (12.4 in your scenario) is compatible with your GPU model (NVIDIA A10-24Q) and the software used. Setting the environment variable CUDA_LAUNCH_BLOCKING=1 might help with debugging. Restarting the node may reset the GPU state. Finally, ensure the NVIDIA driver (550.144.03) is compatible with your CUDA version.

Refer to Microsoft's AKS GPU Cluster Setup Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool for more details on GPU configurations.

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-19T13:34:46.3733333+00:00

Hi Minh, Nguyen Quoc,

The error CUDA_ERROR_INVALID_VALUE suggests an issue with TensorFlow initializing the CUDA device. Please ensure:

The TensorFlow version is compatible with CUDA 12.4. Refer to the TensorFlow GPU Support Guide https://www.tensorflow.org/install/source#gpu

TensorFlow is configured correctly to detect GPUs. Run this command to confirm:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Environment variables like CUDA_VISIBLE_DEVICES=0 and TF_FORCE_GPU_ALLOW_GROWTH=true are set for proper GPU usage.

For AKS-specific GPU configurations, refer to the Azure AKS GPU Cluster Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-21T08:09:01.7466667+00:00

Hi Minh, Nguyen Quoc,

Based on the error messages provided, here’s a concise response:

The log suggests missing or incompatible GPU libraries (cuFFT, cuDNN, cuBLAS) required by TensorFlow. Ensure the following steps are taken:

Verify that all required GPU libraries are installed and compatible with CUDA 12.4 and TensorFlow 2.19.0. Follow the TensorFlow GPU Setup Guide(https://www.tensorflow.org/install/pip).

Rebuild the TensorFlow Docker image to avoid duplicate library registrations. Ensure only necessary libraries are linked.

Check the LD_LIBRARY_PATH environment variable to confirm it includes paths to the necessary GPU libraries (/usr/local/cuda/lib64).

For additional GPU troubleshooting on AKS, refer to the Azure AKS GPU Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

These steps should address the missing library issue and prevent duplicate registrations.

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Minh, Nguyen Quoc 5 Reputation points

2025-03-21T09:50:22.92+00:00

Geethasri.V Im using Standard_NV36ads_A10_v5, not Standard D32ds v5 or Standard E32ads v5
Anonymous

2025-03-24T09:09:40.97+00:00

Hi Minh, Nguyen Quoc,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Anonymous

2025-03-25T05:40:58.77+00:00

Hi Minh, Nguyen Quoc,

I wanted to check if you had the opportunity to review the information which was provided in my previous posted comment.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.
Minh, Nguyen Quoc 5 Reputation points

2025-03-27T04:46:49.5033333+00:00

Hi Geethasri.V, today, after the Azure update, the nodePool version to AKSUbuntu-2204gen2containerd-202503.13.0; this issue has been fixed automatically.
Anonymous

2025-03-27T14:21:32.2733333+00:00

Hi Minh, Nguyen Quoc,

Thank you for the update! I’m glad to hear that the issue has been resolved with the Azure update to the node pool version AKSUbuntu-2204gen2containerd-202503.02.0. Please let me know if you encounter any further issues or need additional assistance
Łukasz Dolegowski 0 Reputation points

2025-03-27T21:33:07.2433333+00:00

Hi, Minh, Nguyen Quoc, How did you manage to solve this problem? I'm also using theStandard_NV36ads_A10_v5 for 6 months and after restarting the machine, I get the same error.
Minh, Nguyen Quoc 5 Reputation points

2025-03-28T02:49:46.0733333+00:00

Łukasz Dolegowski I just wait for Azure to provide the new version of the Image nodePool from AKSUbuntu-2204gen2containerd-202503.02.0 to AKSUbuntu-2204gen2containerd-202503.13.0. Then the Nvidia driver returned to the 535.xxx version, and everything was back to normal.
I think for further deployment, we need to clarify and check the version of nodePool Kernel + Nvidia driver version first before upgrading it.

Answer 1

Hi Minh, Nguyen Quoc,

Based on the error messages provided, here’s a concise response:

The log suggests missing or incompatible GPU libraries (cuFFT, cuDNN, cuBLAS) required by TensorFlow. Ensure the following steps are taken:

Verify that all required GPU libraries are installed and compatible with CUDA 12.4 and TensorFlow 2.19.0. Follow the TensorFlow GPU Setup Guide(https://www.tensorflow.org/install/pip).

Rebuild the TensorFlow Docker image to avoid duplicate library registrations. Ensure only necessary libraries are linked.

Check the LD_LIBRARY_PATH environment variable to confirm it includes paths to the necessary GPU libraries (/usr/local/cuda/lib64).

For additional GPU troubleshooting on AKS, refer to the Azure AKS GPU Guide : https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-azure-linux-gpu-node-pool

These steps should address the missing library issue and prevent duplicate registrations.

If you have any further queries, please let us know we are glad to help you.

If it was helpful, please click "Upvote" on this post to let us know.

Thank You.

Share via

CUDA-capable device(s) is/are busy or unavailable

1 answer

Your answer