NCCL INFO and WARNING logs not present in new AzureML runtime

Kyriaki Dionysopoulou 1 Reputation point
2022-08-06T10:11:51.157+00:00

Hi all,

Recently, I have been trying to debug a very weird NCCL OOM issue and I have noticed that if I use the old runtime NCCL prints an OOM warning in 70_driver_log_0.txt but that when I use the new runtime all these NCCL logs magically disappear. My hunch is that warnings are somehow suppressed in the new runtime. Is there a way to activate them again? If I enforce the old runtime by specifying the environment variable pytorch_env.environment_variables = {"AZUREML_COMPUTE_USE_COMMON_RUNTIME": "false"} in my runs, then the warning is logged. My worry is that warnings like these (i.e. OOM warnings that don't make the job crash) would be difficult to detect if warnings are switched off by default in the new runtime. Otherwise, would there be any other reason why this warning wouldn't make it into the logs? If it does appear, would you be so kind to point me in the right direction? Btw, my job completes successfully even with the warning message below so can't really troubleshoot based on job status.

:200:451 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.8<>  
:200:451 [0] NCCL INFO include/socket.h:445 -> 2  
:200:451 [0] NCCL INFO include/socket.h:457 -> 2  
:200:451 [0] NCCL INFO bootstrap.cc:229 -> 2  
  
:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 71)  
  
:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1  
  
:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 70)  
  
:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1  

Thanks in advance for your time.

Best,

Kiki

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,351 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Kyriaki Dionysopoulou 1 Reputation point
    2022-08-09T07:49:32.753+00:00

    Hi @Ramr-msft and thanks for your reply. We are currently using MLflow to log metrics, images, etc... Unfortunately, the logs I am referring to in my question have nothing to do with the model logs, but rather the NCCL library logs which (I don't think) we have any control over and which are produced by the library itself. I am not sure if there are any environment variables I should be setting in the new runtime in order for the NCCL info and warn logs to appear in any of the available log files. Would for example the environment variable AZUREML_CR_HT_LOG_FILTERING_POLICY be relevant here? If so, what values does it accept?

    A sample of the 70_driver_log_0.txt file is shown below

    bash: /azureml-envs/azureml_<REDACTED>/lib/libtinfo.so.6: no version information available (required by bash)  
    [2022-08-06T10:14:50.915054] Entering context manager injector.  
    [2022-08-06T10:14:51.453543] context_manager_injector.py Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=[<REDACTED>])  
    This is a PyTorch job. Rank:0  
    Script type = None  
    [2022-08-06T10:14:51.457604] Entering Run History Context Manager.  
    [2022-08-06T10:14:52.089533] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/<REDACTED>  
    [2022-08-06T10:14:52.089756] Preparing to call script [<REDACTED>]  
    [2022-08-06T10:14:52.089847] After variable expansion, calling script [<REDACTED>] with arguments:[<REDACTED>]  
    INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0  
    INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.  
    INFO:root:Device used: cuda:0  
    <REDACTED> [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>  
    <REDACTED> [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation  
    <REDACTED> [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.  
    <REDACTED> [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>  
    <REDACTED> [0] NCCL INFO Using network Socket  
    NCCL version 2.10.3+cuda10.2  
    <REDACTED> NCCL INFO Channel 00/02 :    0   1   2   3  
    <REDACTED> NCCL INFO Channel 01/02 :    0   1   2   3  
    <REDACTED> NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1  
    <REDACTED> NCCL INFO Setting affinity for GPU 0 to 0fff  
    <REDACTED> NCCL INFO Channel 00 : 0[846d00000] -> 1[aff500000] via direct shared memory  
    <REDACTED> NCCL INFO Channel 01 : 0[846d00000] -> 1[aff500000] via direct shared memory  
    <REDACTED> NCCL INFO Connected all rings  
    <REDACTED> NCCL INFO Connected all trees  
    <REDACTED> NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512  
    <REDACTED> NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer  
    <REDACTED> NCCL INFO comm 0x7f9358001240 rank 0 nranks 4 cudaDev 0 busId 846d00000 - Init COMPLETE  
    <REDACTED> NCCL INFO Launch mode Parallel  
    INFO:root:Epoch 1  
    

    and later on I get the OOM warning

     <REDACTED>:200:451 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.8<REDACTED>  
     <REDACTED>:200:451 [0] NCCL INFO include/socket.h:445 -> 2  
     <REDACTED>:200:451 [0] NCCL INFO include/socket.h:457 -> 2  
     <REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:229 -> 2  
          
     <REDACTED>:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 71)  
          
     <REDACTED>:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
     <REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1  
          
     <REDACTED>:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 70)  
          
     <REDACTED>:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'  
     <REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1  
    

    For comparison, the same run using the new runtime produces the logs below (which are missing the corresponding NCCL info and warning lines).

    INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0  
    INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.  
    INFO:azureml._restclient.clientbase:Created a worker pool for first use  
    INFO:root:Device used: cuda:0  
    INFO:root:Epoch 1  
    

    I also checked the system_logs/hosttools_capability/hosttools-capability.log file and it does not contain any of them.

    Any help would be greatly appreciated.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.