Yes, setting shm_size="32g" in the command() prompt solved the issue in my case.
How to fix "ERROR: insufficient shared memory (shm)"?
Hello,
I am facing the following error message during a DL training job using PyTorch and 4 NVIDIA A100 GPUs with 80 GB RAM each in a single compute node of type Standard_NC96ads_A100_v4:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 363) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The training works as expected in the beginning. However, if the job gets interrupted due to the "low priority" of the compute node, the script fails at the start of the first epoch after resuming the training on a newly assigned compute node. Previously, I have successfully tested my training script and also the job resuming process by using only 2 GPUs. Can you help to explain this behavior and what exactly causes the OOM error w.r.t. shared memory?I will now try to use the following command in my job script to increase the limit for shared memory during the run:
job = command(
# Some Arguments
# ...
)
job.set_resources(shm_size="256g")
returned_job = ml_client.jobs.create_or_update(job)
Is this the intended way how to solve this issue?
Best regards!