NCCL INFO and WARNING logs not present in new AzureML runtime

Question

NCCL INFO and WARNING logs not present in new AzureML runtime

Kyriaki Dionysopoulou 1

Hi all,

Recently, I have been trying to debug a very weird NCCL OOM issue and I have noticed that if I use the old runtime NCCL prints an OOM warning in 70_driver_log_0.txt but that when I use the new runtime all these NCCL logs magically disappear. My hunch is that warnings are somehow suppressed in the new runtime. Is there a way to activate them again? If I enforce the old runtime by specifying the environment variable pytorch_env.environment_variables = {"AZUREML_COMPUTE_USE_COMMON_RUNTIME": "false"} in my runs, then the warning is logged. My worry is that warnings like these (i.e. OOM warnings that don't make the job crash) would be difficult to detect if warnings are switched off by default in the new runtime. Otherwise, would there be any other reason why this warning wouldn't make it into the logs? If it does appear, would you be so kind to point me in the right direction? Btw, my job completes successfully even with the warning message below so can't really troubleshoot based on job status.

:200:451 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.8<>  
:200:451 [0] NCCL INFO include/socket.h:445 -> 2  
:200:451 [0] NCCL INFO include/socket.h:457 -> 2  
:200:451 [0] NCCL INFO bootstrap.cc:229 -> 2  
  
:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 71)  
  
:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1  
  
:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 70)  
  
:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1

Thanks in advance for your time.

Best,

Kiki

Ramr-msft 17,826 Reputation points

2022-08-08T12:00:34.887+00:00

@Kyriaki Dionysopoulou Thanks for the question. Unlike the Azure Machine Learning SDK v1, there is no logging functionality in the Azure Machine Learning SDK for Python (v2). If you were using Azure Machine Learning SDK v1 before, we recommend you to start leveraging MLflow for tracking experiments. See Migrate logging from SDK v1 to MLflow for specific guidance.

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-log-view-metrics?tabs=interactive

1 answer

Your answer

Ramr-msft 17,826 Reputation points

2022-08-08T12:00:34.887+00:00

@Kyriaki Dionysopoulou Thanks for the question. Unlike the Azure Machine Learning SDK v1, there is no logging functionality in the Azure Machine Learning SDK for Python (v2). If you were using Azure Machine Learning SDK v1 before, we recommend you to start leveraging MLflow for tracking experiments. See Migrate logging from SDK v1 to MLflow for specific guidance.

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-log-view-metrics?tabs=interactive

Answer 1

Hi @Ramr-msft and thanks for your reply. We are currently using MLflow to log metrics, images, etc... Unfortunately, the logs I am referring to in my question have nothing to do with the model logs, but rather the NCCL library logs which (I don't think) we have any control over and which are produced by the library itself. I am not sure if there are any environment variables I should be setting in the new runtime in order for the NCCL info and warn logs to appear in any of the available log files. Would for example the environment variable AZUREML_CR_HT_LOG_FILTERING_POLICY be relevant here? If so, what values does it accept?

A sample of the 70_driver_log_0.txt file is shown below

bash: /azureml-envs/azureml_<REDACTED>/lib/libtinfo.so.6: no version information available (required by bash)  
[2022-08-06T10:14:50.915054] Entering context manager injector.  
[2022-08-06T10:14:51.453543] context_manager_injector.py Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=[<REDACTED>])  
This is a PyTorch job. Rank:0  
Script type = None  
[2022-08-06T10:14:51.457604] Entering Run History Context Manager.  
[2022-08-06T10:14:52.089533] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/<REDACTED>  
[2022-08-06T10:14:52.089756] Preparing to call script [<REDACTED>]  
[2022-08-06T10:14:52.089847] After variable expansion, calling script [<REDACTED>] with arguments:[<REDACTED>]  
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0  
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.  
INFO:root:Device used: cuda:0  
<REDACTED> [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>  
<REDACTED> [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation  
<REDACTED> [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.  
<REDACTED> [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>  
<REDACTED> [0] NCCL INFO Using network Socket  
NCCL version 2.10.3+cuda10.2  
<REDACTED> NCCL INFO Channel 00/02 :    0   1   2   3  
<REDACTED> NCCL INFO Channel 01/02 :    0   1   2   3  
<REDACTED> NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1  
<REDACTED> NCCL INFO Setting affinity for GPU 0 to 0fff  
<REDACTED> NCCL INFO Channel 00 : 0[846d00000] -> 1[aff500000] via direct shared memory  
<REDACTED> NCCL INFO Channel 01 : 0[846d00000] -> 1[aff500000] via direct shared memory  
<REDACTED> NCCL INFO Connected all rings  
<REDACTED> NCCL INFO Connected all trees  
<REDACTED> NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512  
<REDACTED> NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer  
<REDACTED> NCCL INFO comm 0x7f9358001240 rank 0 nranks 4 cudaDev 0 busId 846d00000 - Init COMPLETE  
<REDACTED> NCCL INFO Launch mode Parallel  
INFO:root:Epoch 1

and later on I get the OOM warning

 <REDACTED>:200:451 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.8<REDACTED>  
 <REDACTED>:200:451 [0] NCCL INFO include/socket.h:445 -> 2  
 <REDACTED>:200:451 [0] NCCL INFO include/socket.h:457 -> 2  
 <REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:229 -> 2  
      
 <REDACTED>:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 71)  
      
 <REDACTED>:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
 <REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1  
      
 <REDACTED>:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 70)  
      
 <REDACTED>:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
 <REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1

For comparison, the same run using the new runtime produces the logs below (which are missing the corresponding NCCL info and warning lines).

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0  
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.  
INFO:azureml._restclient.clientbase:Created a worker pool for first use  
INFO:root:Device used: cuda:0  
INFO:root:Epoch 1

I also checked the system_logs/hosttools_capability/hosttools-capability.log file and it does not contain any of them.

Any help would be greatly appreciated.

Lucas Pickup 11 Reputation points Microsoft Employee

2022-08-15T14:26:19.647+00:00

HI @Kyriaki Dionysopoulou ,

There is an environment variable which controls the NCCL logging level, it's called NCCL_DEBUG.
,As you've discovered AzureML is rolling our a newer Runtime which improves drastically on it's design internally and generally fast and more detailed in it's error handling.
The old Runtime has many baked in fixes/hacks which are sometimes helpful, but also sometimes limiting. It seems that NCCL_DEBUG is always set to INFO in the old Runtime. This is potentially limiting in that not other values can be set for NCCL_DEBUG like WARN, here's the documentation for NCCL_DEBUG: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug

Setting NCCL_DEBUG=INFO when submitting your runs should give equivalent NCCL logging behavior between the old and new Runtime. We will discuss internally if we want to set that value by default, and how we can do that without stopping Users from providing their own values.

Cheers,
Lucas
Kyriaki Dionysopoulou 1 Reputation point

2022-08-15T19:15:20.437+00:00

Hi @Lucas Pickup ,

and thank you so much for your reply! I wasn't sure if it was an NCCL environment variable or some other AZUREML variable I had to change.

Share via

NCCL INFO and WARNING logs not present in new AzureML runtime

1 answer

Your answer