Skip to content

Multi-gpu training with slurm times out #20434

@nightingal3

Description

@nightingal3

Bug description

Bug description

(Note: cross posting from litgpt since I think this may be about pytorch-lightning?)

I was transferring some checkpoints from a cluster that didn't use slurm to one that does use slurm. I trained the checkpoint using multiple gpus/nodes, and I found that I'm able to load and start training it when using an interactive job. However, when I use sbatch to submit my job, the job times out after some time.

I've seen this post: https://lightning.ai/docs/fabric/2.4.0/guide/multi_node/slurm.html and added srun to my submission script. However, even though 4 devices seem to be initialized, the model still gets stuck before training and times out.

A debug log and my submission script is linked. My sbatch script is a bit different since it runs another sh script, which does a bunch of stuff and then litgpt pretrain <...>, but I'm not sure this would be an issue...

I also tried setting the fabric initialization to explicitly have the number of nodes, devices, etc like in the example in pretrain.py but it didn't make a difference:

fabric = L.Fabric(
        accelerator="gpu", devices=4, num_nodes=1, strategy=strategy, precision=precision, loggers=[logger]
    )

Details:
My script:

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --mail-user=<email>
#SBATCH --mail-type=ALL

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO

# Check if training type is provided
if [ $# -eq 0 ]; then
    echo "Usage: $0 <sequential|mixed> [training args...]"
    exit 1
fi

# Get the training type and remove it from args
TRAIN_TYPE=$1
shift

case $TRAIN_TYPE in
    sequential)
        srun ./pretrain_then_finetune.sh "$@"
        ;;
    mixed)
        srun ./mixed_pretraining_fixed.sh "$@"
        ;;
    *)
        echo "Invalid training type. Use 'sequential' or 'mixed'"
        exit 1
        ;;
esac

Debug example error:

Node information:

=== Slurm Environment ===
SLURM_NTASKS: 4
SLURM_PROCID: 0
SLURM_LOCALID: 0
SLURM_JOB_ID: 3178163

=== GPU Information ===
Available GPUs:
GPU 0: NVIDIA RTX A6000 (UUID: GPU-b349d8f4-c2a8-bd4b-2ed8-4678cc3093ad)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-6386b3c4-ba07-b55c-a8d8-1d7e38378b83)
GPU 2: NVIDIA RTX A6000 (UUID: GPU-8a6310cf-0811-4754-64db-8c4117d4be50)
GPU 3: NVIDIA RTX A6000 (UUID: GPU-99486ac3-a03a-16fa-0e46-8bade74f121a)

GPU Topology:
	�[4mGPU0	GPU1	GPU2	GPU3	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID�[0m
GPU0	 X 	SYS	SYS	SYS	SYS	1-2,7-8,129-130	0		N/A
GPU1	SYS	 X 	NV4	NODE	NODE	64-65,67,69	1		N/A
GPU2	SYS	NV4	 X 	NODE	NODE	64-65,67,69	1		N/A
GPU3	SYS	NODE	NODE	 X 	NODE	64-65,67,69	1		N/A
NIC0	SYS	NODE	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

What operating system are you using?

Linux

LitGPT Version

Version: 0.4.0

What version are you seeing the problem on?

v2.3

How to reproduce the bug

submission script:

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH [email protected]
#SBATCH --mail-type=ALL

# Get training type
TRAIN_TYPE=$1
shift

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO

# Run training script
case $TRAIN_TYPE in
    sequential)
        srun ./pretrain_then_finetune.sh "$@"
        ;;
    mixed)
        srun ./mixed_pretraining_fixed.sh "$@"
        ;;
    *)
        echo "Invalid training type. Use 'sequential' or 'mixed'"
        exit 1
        ;;
esac


inside `pretrain_then_finetune.sh`:
```bash
<conda activate the env>

litgpt pretrain $model_name <...>


### Error messages and logs

[...previous stuff]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42

distributed_backend=nccl
All distributed processes registered. Starting with 4 processes

[rank: 0] Seed set to 42
babel-0-31:2324237:2324237 [0] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324237:2324237 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
/home/mengyan3/.local/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
warnings.warn(
babel-0-31:2324240:2324240 [1] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324240:2324240 [1] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324240:2324524 [1] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324240:2324524 [1] NCCL INFO Using network IB
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324240:2324524 [1] NCCL INFO Setting affinity for GPU 1 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324240:2324524 [1] NCCL INFO NVLS multicast support is not available on dev 1
babel-0-31:2324240:2324524 [1] NCCL INFO comm 0x555d6dc227b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
babel-0-31:2324240:2324524 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
babel-0-31:2324240:2324524 [1] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all rings
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all trees
babel-0-31:2324240:2324524 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324240:2324524 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank1]:[E1119 13:21:12.512786305 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324239:2324239 [3] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324239:2324239 [3] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324239:2324522 [3] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324239:2324522 [3] NCCL INFO Using network IB
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324239:2324522 [3] NCCL INFO Setting affinity for GPU 3 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324239:2324522 [3] NCCL INFO NVLS multicast support is not available on dev 3
babel-0-31:2324239:2324522 [3] NCCL INFO comm 0x5584021f39f0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
babel-0-31:2324239:2324522 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
babel-0-31:2324239:2324522 [3] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all rings
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all trees
babel-0-31:2324239:2324522 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324239:2324522 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank3]:[E1119 13:21:12.512781555 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324238:2324238 [2] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324238:2324238 [2] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324238:2324523 [2] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324238:2324523 [2] NCCL INFO Using network IB
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324238:2324523 [2] NCCL INFO Setting affinity for GPU 2 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324238:2324523 [2] NCCL INFO NVLS multicast support is not available on dev 2
babel-0-31:2324238:2324523 [2] NCCL INFO comm 0x55880160e670 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
babel-0-31:2324238:2324523 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
babel-0-31:2324238:2324523 [2] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all rings
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all trees
babel-0-31:2324238:2324523 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324238:2324523 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank2]:[E1119 13:21:12.525244336 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
[rank1]:[E1119 13:21:13.938073877 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938095107 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938100947 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1119 13:21:13.938104817 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.938073737 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938094977 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938100577 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1119 13:21:13.938104557 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1119 13:21:13.938073817 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938094667 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938100907 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1119 13:21:13.938104757 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.092845528 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fdf89a24446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fdf8ad37772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fdf8ad3ebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdf8ad4061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fdfd36cd5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7fdfe2c89c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7fdfe2d0ec40 in /lib64/libc.so.6)

[rank1]:[E1119 13:21:13.092997168 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4e1d4446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f3b4f4e7772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b4f4eebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b4f4f061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f3b97e7d5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7f3ba7489c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7f3ba750ec40 in /lib64/libc.so.6)

[rank3]:[E1119 13:21:13.092988978 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe0c139c446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe0c26af772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe0c26b6bb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe0c26b861d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fe10b0455c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7fe11a689c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7fe11a70ec40 in /lib64/libc.so.6)

/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324239 Aborted (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324238 Aborted (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324240 Aborted (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
srun: First task exited 60s ago
srun: StepId=3178163.0 task 0: running
srun: StepId=3178163.0 tasks 1-3: exited
srun: Terminating StepId=3178163.0
slurmstepd: error: *** STEP 3178163.0 ON babel-0-31 CANCELLED AT 2024-11-19T13:24:07 ***
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
slurmstepd: error: --task-epilog failed status=9



### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:
		- NVIDIA RTX A6000
		- NVIDIA RTX A6000
		- NVIDIA RTX A6000
		- NVIDIA RTX A6000
	- available:         True
	- version:           12.4
* Lightning:
	- botorch:           0.10.0
	- gpytorch:          1.11
	- lightning:         2.3.0.dev20240428
	- lightning-utilities: 0.11.8
	- pytorch-lightning: 2.3.1
	- torch:             2.5.1
	- torchmetrics:      1.4.0.post0
* Packages:
	- absl-py:           2.1.0
	- accelerate:        0.32.0
	- aiohttp:           3.9.5
	- aiosignal:         1.3.1
	- annotated-types:   0.7.0
	- antlr4-python3-runtime: 4.11.0
	- anyio:             4.4.0
	- argcomplete:       3.5.1
	- asttokens:         2.4.1
	- async-timeout:     4.0.3
	- attrs:             23.2.0
	- awscrt:            0.20.11
	- beautifulsoup4:    4.12.3
	- bitsandbytes:      0.42.0
	- boto3:             1.35.63
	- botocore:          1.34.138
	- botorch:           0.10.0
	- bs4:               0.0.2
	- build:             1.2.1
	- certifi:           2024.6.2
	- chardet:           5.2.0
	- charset-normalizer: 3.3.2
	- click:             8.1.7
	- colorama:          0.4.6
	- contourpy:         1.2.1
	- cycler:            0.12.1
	- dataproperty:      1.0.1
	- datasets:          2.20.0
	- dill:              0.3.8
	- distro:            1.9.0
	- dnspython:         2.6.1
	- docker-pycreds:    0.4.0
	- docstring-parser:  0.16
	- dotwiz:            0.4.0
	- email-validator:   2.2.0
	- evaluate:          0.4.2
	- exceptiongroup:    1.2.1
	- executing:         2.0.1
	- exrex:             0.11.0
	- fastapi:           0.111.0
	- fastapi-cli:       0.0.4
	- filelock:          3.16.1
	- fonttools:         4.53.1
	- frozenlist:        1.4.1
	- fsspec:            2024.10.0
	- funcy:             2.0
	- git-filter-repo:   2.34.0
	- gitdb:             4.0.11
	- gitpython:         3.1.43
	- gpytorch:          1.11
	- grpcio:            1.64.1
	- h11:               0.14.0
	- hf-transfer:       0.1.6
	- httpcore:          1.0.5
	- httptools:         0.6.1
	- httpx:             0.27.0
	- huggingface-hub:   0.23.4
	- idna:              3.7
	- importlib-metadata: 8.0.0
	- importlib-resources: 6.4.0
	- jaxtyping:         0.2.33
	- jinja2:            3.1.4
	- jiter:             0.5.0
	- jmespath:          1.0.1
	- joblib:            1.4.2
	- jsonargparse:      4.31.0
	- jsonlines:         4.0.0
	- kiwisolver:        1.4.5
	- lightning:         2.3.0.dev20240428
	- lightning-utilities: 0.11.8
	- linear-operator:   0.5.1
	- litdata:           0.2.30
	- litgpt:            0.4.0
	- litserve:          0.1.1.dev0
	- littleutils:       0.2.4
	- lm-eval:           0.4.3
	- lxml:              5.2.2
	- magicattr:         0.1.6
	- markdown:          3.6
	- markdown-it-py:    3.0.0
	- markupsafe:        2.1.5
	- matplotlib:        3.9.1.post1
	- mbstrdecoder:      1.1.3
	- mdurl:             0.1.2
	- more-itertools:    10.3.0
	- mpmath:            1.3.0
	- multidict:         6.0.5
	- multipledispatch:  1.0.0
	- multiprocess:      0.70.16
	- networkx:          3.2.1
	- nltk:              3.8.1
	- numexpr:           2.10.1
	- numpy:             1.26.4
	- nvidia-cublas-cu12: 12.4.5.8
	- nvidia-cuda-cupti-cu12: 12.4.127
	- nvidia-cuda-nvrtc-cu12: 12.4.127
	- nvidia-cuda-runtime-cu12: 12.4.127
	- nvidia-cudnn-cu12: 9.1.0.70
	- nvidia-cufft-cu12: 11.2.1.3
	- nvidia-curand-cu12: 10.3.5.147
	- nvidia-cusolver-cu12: 11.6.1.9
	- nvidia-cusparse-cu12: 12.3.1.170
	- nvidia-nccl-cu12:  2.21.5
	- nvidia-nvjitlink-cu12: 12.4.127
	- nvidia-nvtx-cu12:  12.4.127
	- openai:            1.43.0
	- opt-einsum:        3.3.0
	- orjson:            3.10.6
	- packaging:         24.1
	- pandas:            2.2.2
	- pathvalidate:      3.2.0
	- peft:              0.11.1
	- pillow:            10.4.0
	- pip:               24.0
	- pip-tools:         7.4.1
	- platformdirs:      4.2.2
	- portalocker:       2.10.0
	- protobuf:          4.25.3
	- psutil:            6.0.0
	- pyarrow:           16.1.0
	- pyarrow-hotfix:    0.6
	- pybind11:          2.13.1
	- pydantic:          2.8.0
	- pydantic-core:     2.20.0
	- pygments:          2.18.0
	- pyheck:            0.1.5
	- pyparsing:         3.1.2
	- pyproject-hooks:   1.1.0
	- pyro-api:          0.1.2
	- pyro-ppl:          1.9.1
	- pytablewriter:     1.2.0
	- python-dateutil:   2.9.0.post0
	- python-dotenv:     1.0.1
	- python-multipart:  0.0.9
	- pytorch-lightning: 2.3.1
	- pytz:              2024.1
	- pyyaml:            6.0.1
	- regex:             2024.5.15
	- requests:          2.32.3
	- rich:              13.7.1
	- rouge-score:       0.1.2
	- s3transfer:        0.10.3
	- sacrebleu:         2.4.2
	- safetensors:       0.4.3
	- scikit-learn:      1.5.1
	- scipy:             1.13.1
	- sentencepiece:     0.2.0
	- sentry-sdk:        2.7.1
	- setproctitle:      1.3.3
	- setuptools:        69.5.1
	- shellingham:       1.5.4
	- six:               1.16.0
	- smmap:             5.0.1
	- sniffio:           1.3.1
	- sorcery:           0.2.2
	- soupsieve:         2.6
	- sqlitedict:        2.1.0
	- starlette:         0.37.2
	- sympy:             1.13.1
	- tabledata:         1.3.3
	- tabulate:          0.9.0
	- tasksource:        0.0.45
	- tcolorpy:          0.1.6
	- tensorboard:       2.17.0
	- tensorboard-data-server: 0.7.2
	- threadpoolctl:     3.5.0
	- tokenizers:        0.19.1
	- tomli:             2.0.1
	- tomlkit:           0.13.2
	- torch:             2.5.1
	- torchmetrics:      1.4.0.post0
	- tqdm:              4.66.4
	- tqdm-multiprocess: 0.0.11
	- transformers:      4.42.3
	- triton:            3.1.0
	- typeguard:         2.13.3
	- typepy:            1.3.2
	- typer:             0.12.3
	- typeshed-client:   2.5.1
	- typing-extensions: 4.12.2
	- tzdata:            2024.1
	- ujson:             5.10.0
	- urllib3:           1.26.19
	- uvicorn:           0.30.1
	- uvloop:            0.19.0
	- wandb:             0.17.4
	- watchfiles:        0.22.0
	- websockets:        12.0
	- werkzeug:          3.0.3
	- wheel:             0.43.0
	- word2number:       1.1
	- wrapt:             1.16.0
	- xmltodict:         0.14.2
	- xxhash:            3.4.1
	- yarl:              1.9.4
	- zipp:              3.19.2
	- zstandard:         0.22.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.9.0
	- release:           5.14.0-427.40.1.el9_4.x86_64
	- version:           #1 SMP PREEMPT_DYNAMIC Wed Oct 16 07:08:17 EDT 2024

</details>


### More info

_No response_

cc @justusschock @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions