Hi @Ramr-msft and thanks for your reply. We are currently using MLflow
to log metrics, images, etc... Unfortunately, the logs I am referring to in my question have nothing to do with the model logs, but rather the NCCL
library logs which (I don't think) we have any control over and which are produced by the library itself. I am not sure if there are any environment variables I should be setting in the new runtime in order for the NCCL
info and warn logs to appear in any of the available log files. Would for example the environment variable AZUREML_CR_HT_LOG_FILTERING_POLICY
be relevant here? If so, what values does it accept?
A sample of the 70_driver_log_0.txt
file is shown below
bash: /azureml-envs/azureml_<REDACTED>/lib/libtinfo.so.6: no version information available (required by bash)
[2022-08-06T10:14:50.915054] Entering context manager injector.
[2022-08-06T10:14:51.453543] context_manager_injector.py Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'Dataset:context_managers.Datasets', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=[<REDACTED>])
This is a PyTorch job. Rank:0
Script type = None
[2022-08-06T10:14:51.457604] Entering Run History Context Manager.
[2022-08-06T10:14:52.089533] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/<REDACTED>
[2022-08-06T10:14:52.089756] Preparing to call script [<REDACTED>]
[2022-08-06T10:14:52.089847] After variable expansion, calling script [<REDACTED>] with arguments:[<REDACTED>]
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:root:Device used: cuda:0
<REDACTED> [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
<REDACTED> [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
<REDACTED> [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
<REDACTED> [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
<REDACTED> [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
<REDACTED> NCCL INFO Channel 00/02 : 0 1 2 3
<REDACTED> NCCL INFO Channel 01/02 : 0 1 2 3
<REDACTED> NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
<REDACTED> NCCL INFO Setting affinity for GPU 0 to 0fff
<REDACTED> NCCL INFO Channel 00 : 0[846d00000] -> 1[aff500000] via direct shared memory
<REDACTED> NCCL INFO Channel 01 : 0[846d00000] -> 1[aff500000] via direct shared memory
<REDACTED> NCCL INFO Connected all rings
<REDACTED> NCCL INFO Connected all trees
<REDACTED> NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
<REDACTED> NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
<REDACTED> NCCL INFO comm 0x7f9358001240 rank 0 nranks 4 cudaDev 0 busId 846d00000 - Init COMPLETE
<REDACTED> NCCL INFO Launch mode Parallel
INFO:root:Epoch 1
and later on I get the OOM warning
<REDACTED>:200:451 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.8<REDACTED>
<REDACTED>:200:451 [0] NCCL INFO include/socket.h:445 -> 2
<REDACTED>:200:451 [0] NCCL INFO include/socket.h:457 -> 2
<REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:229 -> 2
<REDACTED>:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 71)
<REDACTED>:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
<REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1
<REDACTED>:200:451 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 70)
<REDACTED>:200:451 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
<REDACTED>:200:451 [0] NCCL INFO bootstrap.cc:231 -> 1
For comparison, the same run using the new runtime produces the logs below (which are missing the corresponding NCCL
info and warning lines).
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:azureml._restclient.clientbase:Created a worker pool for first use
INFO:root:Device used: cuda:0
INFO:root:Epoch 1
I also checked the system_logs/hosttools_capability/hosttools-capability.log
file and it does not contain any of them.
Any help would be greatly appreciated.