Can't run training in multi-gpu setting

Question

Can't run training in multi-gpu setting

Daniel Otero Gómez 0

I am new to multi-gpu training. My code ran perfectly on my Laptop's GPU (single RTX 3060) and it runs out of memory using four GPUs. I think it may be due to a misconfiguration of my GPUs or misuse of DDP strategy in Lightning. I hope someone can help me debug the log messages NCCL is leaving. Since they are very long, I'll paste here just the logs that come from the main rank of the process. I have experienced different errors that I think are related to memory. These are the ones I can track back:

OSError: [Errno 28] No space left on device

RuntimeError: cuDNN error: CUDNN_STATUS_ALLOC_FAILED

torch.cuda.OutOfMemoryError: CUDA out of memory.

RuntimeError: DataLoader worker (pid 4748) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The only time it gave a different error is when I manually set NCCL_IB_DISABLE=0. It gave me:


File "/mnt/azureml/cr/j/e01ca930a056451cad891d256ce58f06/exe/wd/models/ssl/monitor_metrics.py", line 44, in rankme

    S = torch.linalg.svdvals(Z)  # pylint: disable=invalid-name, not-callable

RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends.

I was told that this may be resolved by modifying the shared memory limits of the docker container, but I do not know how to do this since Azure initializes the container under the hood. They redirected me to the following [troubleshooting page]. Does anyone know how I can manipulate the shared memory manually?

As some additional info:

I am running a job a cluster with four Teslta T4 GPUs. Specifically, this cluster Standard_NC64as_T4_v3.

I have been using Azure Containers for Pytorch and installing additional dependencies as they recommend. Below I pasted the Dockerfiles I have been using to build my environments. I commented the second base image to avoid posting two different Dockerfiles with the same content.


FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.0-cuda11.7

# FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-2.1-cuda12.1

RUN pip install timm

RUN pip install scikit-learn

RUN pip install mlflow

I checked and the env using cuda 12.1 is using NCCL version 12.18.3 and the one using cuda 11.7 is using 12.17.1.

Also, I am specifying a distribution when launching the job using the command function. I understand that this will tell the system to use the four GPUs. Nonetheless, I experienced the same issue whenever I didn't specify the distribution in the command.


# Create or update the component

    print("Creating job...")

    print(job_command)

    command_job = command(

        experiment_name="testing-ssl-byol",

        description=description,

        code=str(code_dir),

        environment=enviornment,

        inputs=inputs,

        outputs=outputs,

        command=job_command,

        compute="Testing-GPU-Cluster",

        distribution=MpiDistribution(process_count_per_instance=4),

        environment_variables={"NCCL_DEBUG": "DEBUG", "NCCL_IB_DISABLE": "0"},

        tags={"project": "ssl-research", "job-purpose": "testing"}

    )

    job = ml_client.jobs.create_or_update(command_job)

    print(f"Job created with ID: {job.id}")

Here are the log messages:


[2024-03-05 15:37:45,277] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training

  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0

  warnings.warn("onnxruntime training package info: __version__: %s" % version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2

  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020

  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info

  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]

  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)

Global seed set to 42

Using 16bit None Automatic Mixed Precision (AMP)

GPU available: True (cuda), used: True

TPU available: False, using: 0 TPU cores

IPU available: False, using: 0 IPUs

HPU available: False, using: 0 HPUs

[rank: 0] Global seed set to 42

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

[2024-03-05 15:37:54,375] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

[2024-03-05 15:37:54,398] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

[2024-03-05 15:37:54,398] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training

  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0

  warnings.warn("onnxruntime training package info: __version__: %s" % version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2

  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020

  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info

  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]

  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)

[rank: 3] Global seed set to 42

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training

  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0

  warnings.warn("onnxruntime training package info: __version__: %s" % version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2

  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020

  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info

  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]

  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)

[rank: 1] Global seed set to 42

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:118: UserWarning: onnxruntime training package info: package_name: onnxruntime-training

  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:119: UserWarning: onnxruntime training package info: __version__: 1.17.0

  warnings.warn("onnxruntime training package info: __version__: %s" % version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:120: UserWarning: onnxruntime training package info: cuda_version: 12.2

  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:121: UserWarning: onnxruntime build info: cudart_version: 12020

  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:129: UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info

  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")

/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_validation.py:130: UserWarning: WARNING: found cudart versions: [12010]

  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)

[rank: 2] Global seed set to 42

[rank: 2] Global seed set to 42

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4

[rank: 3] Global seed set to 42

Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4

[rank: 1] Global seed set to 42

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4

----------------------------------------------------------------------------------------------------

distributed_backend=nccl

All distributed processes registered. Starting with 4 processes

----------------------------------------------------------------------------------------------------

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:349 [0] NCCL INFO cudaDriverVersion 12010

NCCL version 2.18.3+cuda12.1

e2bd2729de1e4961bccb1c0d631e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Plugin Path : /opt/nccl-rdma-sharp-plugins/lib/libnccl-net.so

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P plugin IBext

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NET/IB : No device found.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Using network Socket

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO comm 0x9ffe0160 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 100000 commId 0x1e29799c32293b9b - Init START

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between co1f3a4000001:902:2618 [1] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:9nnected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO NVLS multicast support is not available on dev 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 00/02 :    0   1   2   3

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 01/02 :    0   1   2   3

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P Chunksize set to 131072

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 3

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 3

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting trans03:3893 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000)

e2bd2729de1e4961bccb1LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

   | Name               | Type                  | Params

--------------------------------------------------------------

0  | criterion          | BCEWithLogitsLoss     | 0     

1  | backbone           | ResNet                | 11.2 M

2  | classifier         | Linear                | 513   

3  | train_metrics      | ModuleDict            | 0     

4  | val_metrics        | ModuleDict            | 0     

5  | test_metrics       | ModuleDict            | 0     

6  | knn_acc_metric     | WeightedKNNClassifier | 0     

7  | momentum_backbone  | ResNet                | 11.2 M

8  | projector          | Sequential            | 1.6 M 

9  | momentum_projector | Sequential            | 1.6 M 

10 | predictor          | Sequential            | 1.1 M 

--------------------------------------------------------------

13.8 M    Trainable params

12.8 M    Non-trainable params

26.6 M    Total params

53.134    Total estimated model params size (MB)

Number of CPU cores: 32

port for rank 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Connected all rings

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Rank 0 selecting transport for rank 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000)

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 0 canConnect 0

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Transport 1 canConnect 1

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Connected all trees

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO MSCCL: No external scheduler found, using internal implementation

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO MSCCL: Internal Scheduler will use /usr/lib/x86_64-linux-gnu/msccl-algorithms as algorithm directory and /usr/lib/x86_64-linux-gnu/../share/nccl/msccl-algorithms as share algorithm directory and /usr/share/nccl/msccl-algorithms as package installed share algorithm directory 

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO Using MSCCL Algo files from /usr/share/nccl/msccl-algorithms

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO MSCCL: Initialization finished

e2bd2729de1e4961bccb1c0d6311f3a4000001:349:2588 [0] NCCL INFO comm 0x9ffe0160 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 100000 commId 0x1e29799c32293b9b - Init COMPLETE

Sanity Checking: 0it [00:00, ?it/s]

Sanity Checking:   0%|          | 0/2 [00:00

    raise e  # Re-raise the exception to handle it normally or to stop the program.

  

#####################################################################################################

Different errors are happening here in the middle

#####################################################################################################

e2bd2729de1e4961bccb1c0d6311f3a4000001:904:3903 [3] NCCL INFO [Service thread] Connection closed by localRank 3

e2bd2729de1e4961bccb1c0d6311f3a4000001:902:3902 [1] NCCL INFO [Service thread] Connection closed by localRank 1

Epoch 0:   0%|          | 0/4 [00:06

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-03-07T06:35:48.69+00:00

@Daniel Otero Gómez I'm not familiar with GPU training scenarios but as per the available documentation for GPU training with Pytorch here is some guidance along with supported VM types if you have not checked already.

Your answer

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-03-07T06:35:48.69+00:00

@Daniel Otero Gómez I'm not familiar with GPU training scenarios but as per the available documentation for GPU training with Pytorch here is some guidance along with supported VM types if you have not checked already.

Share via

Can't run training in multi-gpu setting

Your answer