-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
Bug description
(Note: cross posting from litgpt since I think this may be about pytorch-lightning?)
I was transferring some checkpoints from a cluster that didn't use slurm to one that does use slurm. I trained the checkpoint using multiple gpus/nodes, and I found that I'm able to load and start training it when using an interactive job. However, when I use sbatch to submit my job, the job times out after some time.
I've seen this post: https://lightning.ai/docs/fabric/2.4.0/guide/multi_node/slurm.html and added srun to my submission script. However, even though 4 devices seem to be initialized, the model still gets stuck before training and times out.
A debug log and my submission script is linked. My sbatch script is a bit different since it runs another sh script, which does a bunch of stuff and then litgpt pretrain <...>, but I'm not sure this would be an issue...
I also tried setting the fabric initialization to explicitly have the number of nodes, devices, etc like in the example in pretrain.py but it didn't make a difference:
fabric = L.Fabric(
accelerator="gpu", devices=4, num_nodes=1, strategy=strategy, precision=precision, loggers=[logger]
)Details:
My script:
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --mail-user=<email>
#SBATCH --mail-type=ALL
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO
# Check if training type is provided
if [ $# -eq 0 ]; then
echo "Usage: $0 <sequential|mixed> [training args...]"
exit 1
fi
# Get the training type and remove it from args
TRAIN_TYPE=$1
shift
case $TRAIN_TYPE in
sequential)
srun ./pretrain_then_finetune.sh "$@"
;;
mixed)
srun ./mixed_pretraining_fixed.sh "$@"
;;
*)
echo "Invalid training type. Use 'sequential' or 'mixed'"
exit 1
;;
esacDebug example error:
Node information:
=== Slurm Environment ===
SLURM_NTASKS: 4
SLURM_PROCID: 0
SLURM_LOCALID: 0
SLURM_JOB_ID: 3178163
=== GPU Information ===
Available GPUs:
GPU 0: NVIDIA RTX A6000 (UUID: GPU-b349d8f4-c2a8-bd4b-2ed8-4678cc3093ad)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-6386b3c4-ba07-b55c-a8d8-1d7e38378b83)
GPU 2: NVIDIA RTX A6000 (UUID: GPU-8a6310cf-0811-4754-64db-8c4117d4be50)
GPU 3: NVIDIA RTX A6000 (UUID: GPU-99486ac3-a03a-16fa-0e46-8bade74f121a)
GPU Topology:
�[4mGPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID�[0m
GPU0 X SYS SYS SYS SYS 1-2,7-8,129-130 0 N/A
GPU1 SYS X NV4 NODE NODE 64-65,67,69 1 N/A
GPU2 SYS NV4 X NODE NODE 64-65,67,69 1 N/A
GPU3 SYS NODE NODE X NODE 64-65,67,69 1 N/A
NIC0 SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0What operating system are you using?
Linux
LitGPT Version
Version: 0.4.0
What version are you seeing the problem on?
v2.3
How to reproduce the bug
submission script:
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH [email protected]
#SBATCH --mail-type=ALL
# Get training type
TRAIN_TYPE=$1
shift
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO
# Run training script
case $TRAIN_TYPE in
sequential)
srun ./pretrain_then_finetune.sh "$@"
;;
mixed)
srun ./mixed_pretraining_fixed.sh "$@"
;;
*)
echo "Invalid training type. Use 'sequential' or 'mixed'"
exit 1
;;
esac
inside `pretrain_then_finetune.sh`:
```bash
<conda activate the env>
litgpt pretrain $model_name <...>
### Error messages and logs
[...previous stuff]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
[rank: 0] Seed set to 42
babel-0-31:2324237:2324237 [0] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324237:2324237 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
/home/mengyan3/.local/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
warnings.warn(
babel-0-31:2324240:2324240 [1] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324240:2324240 [1] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324240:2324524 [1] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324240:2324524 [1] NCCL INFO Using network IB
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324240:2324524 [1] NCCL INFO Setting affinity for GPU 1 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324240:2324524 [1] NCCL INFO NVLS multicast support is not available on dev 1
babel-0-31:2324240:2324524 [1] NCCL INFO comm 0x555d6dc227b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
babel-0-31:2324240:2324524 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
babel-0-31:2324240:2324524 [1] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all rings
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all trees
babel-0-31:2324240:2324524 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324240:2324524 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank1]:[E1119 13:21:12.512786305 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324239:2324239 [3] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324239:2324239 [3] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324239:2324522 [3] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324239:2324522 [3] NCCL INFO Using network IB
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324239:2324522 [3] NCCL INFO Setting affinity for GPU 3 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324239:2324522 [3] NCCL INFO NVLS multicast support is not available on dev 3
babel-0-31:2324239:2324522 [3] NCCL INFO comm 0x5584021f39f0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
babel-0-31:2324239:2324522 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
babel-0-31:2324239:2324522 [3] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all rings
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all trees
babel-0-31:2324239:2324522 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324239:2324522 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank3]:[E1119 13:21:12.512781555 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324238:2324238 [2] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324238:2324238 [2] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324238:2324523 [2] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324238:2324523 [2] NCCL INFO Using network IB
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324238:2324523 [2] NCCL INFO Setting affinity for GPU 2 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324238:2324523 [2] NCCL INFO NVLS multicast support is not available on dev 2
babel-0-31:2324238:2324523 [2] NCCL INFO comm 0x55880160e670 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
babel-0-31:2324238:2324523 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
babel-0-31:2324238:2324523 [2] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all rings
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all trees
babel-0-31:2324238:2324523 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324238:2324523 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank2]:[E1119 13:21:12.525244336 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
[rank1]:[E1119 13:21:13.938073877 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938095107 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938100947 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1119 13:21:13.938104817 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.938073737 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938094977 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938100577 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1119 13:21:13.938104557 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1119 13:21:13.938073817 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938094667 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938100907 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1119 13:21:13.938104757 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.092845528 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fdf89a24446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fdf8ad37772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fdf8ad3ebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdf8ad4061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fdfd36cd5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7fdfe2c89c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7fdfe2d0ec40 in /lib64/libc.so.6)
[rank1]:[E1119 13:21:13.092997168 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4e1d4446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f3b4f4e7772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b4f4eebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b4f4f061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f3b97e7d5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7f3ba7489c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7f3ba750ec40 in /lib64/libc.so.6)
[rank3]:[E1119 13:21:13.092988978 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe0c139c446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe0c26af772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe0c26b6bb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe0c26b861d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fe10b0455c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: + 0x89c02 (0x7fe11a689c02 in /lib64/libc.so.6)
frame #6: + 0x10ec40 (0x7fe11a70ec40 in /lib64/libc.so.6)
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324239 Aborted (core dumped) litgpt pretrain
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324238 Aborted (core dumped) litgpt pretrain
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324240 Aborted (core dumped) litgpt pretrain
srun: First task exited 60s ago
srun: StepId=3178163.0 task 0: running
srun: StepId=3178163.0 tasks 1-3: exited
srun: Terminating StepId=3178163.0
slurmstepd: error: *** STEP 3178163.0 ON babel-0-31 CANCELLED AT 2024-11-19T13:24:07 ***
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
slurmstepd: error: --task-epilog failed status=9
### Environment
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- available: True
- version: 12.4
* Lightning:
- botorch: 0.10.0
- gpytorch: 1.11
- lightning: 2.3.0.dev20240428
- lightning-utilities: 0.11.8
- pytorch-lightning: 2.3.1
- torch: 2.5.1
- torchmetrics: 1.4.0.post0
* Packages:
- absl-py: 2.1.0
- accelerate: 0.32.0
- aiohttp: 3.9.5
- aiosignal: 1.3.1
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.11.0
- anyio: 4.4.0
- argcomplete: 3.5.1
- asttokens: 2.4.1
- async-timeout: 4.0.3
- attrs: 23.2.0
- awscrt: 0.20.11
- beautifulsoup4: 4.12.3
- bitsandbytes: 0.42.0
- boto3: 1.35.63
- botocore: 1.34.138
- botorch: 0.10.0
- bs4: 0.0.2
- build: 1.2.1
- certifi: 2024.6.2
- chardet: 5.2.0
- charset-normalizer: 3.3.2
- click: 8.1.7
- colorama: 0.4.6
- contourpy: 1.2.1
- cycler: 0.12.1
- dataproperty: 1.0.1
- datasets: 2.20.0
- dill: 0.3.8
- distro: 1.9.0
- dnspython: 2.6.1
- docker-pycreds: 0.4.0
- docstring-parser: 0.16
- dotwiz: 0.4.0
- email-validator: 2.2.0
- evaluate: 0.4.2
- exceptiongroup: 1.2.1
- executing: 2.0.1
- exrex: 0.11.0
- fastapi: 0.111.0
- fastapi-cli: 0.0.4
- filelock: 3.16.1
- fonttools: 4.53.1
- frozenlist: 1.4.1
- fsspec: 2024.10.0
- funcy: 2.0
- git-filter-repo: 2.34.0
- gitdb: 4.0.11
- gitpython: 3.1.43
- gpytorch: 1.11
- grpcio: 1.64.1
- h11: 0.14.0
- hf-transfer: 0.1.6
- httpcore: 1.0.5
- httptools: 0.6.1
- httpx: 0.27.0
- huggingface-hub: 0.23.4
- idna: 3.7
- importlib-metadata: 8.0.0
- importlib-resources: 6.4.0
- jaxtyping: 0.2.33
- jinja2: 3.1.4
- jiter: 0.5.0
- jmespath: 1.0.1
- joblib: 1.4.2
- jsonargparse: 4.31.0
- jsonlines: 4.0.0
- kiwisolver: 1.4.5
- lightning: 2.3.0.dev20240428
- lightning-utilities: 0.11.8
- linear-operator: 0.5.1
- litdata: 0.2.30
- litgpt: 0.4.0
- litserve: 0.1.1.dev0
- littleutils: 0.2.4
- lm-eval: 0.4.3
- lxml: 5.2.2
- magicattr: 0.1.6
- markdown: 3.6
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- matplotlib: 3.9.1.post1
- mbstrdecoder: 1.1.3
- mdurl: 0.1.2
- more-itertools: 10.3.0
- mpmath: 1.3.0
- multidict: 6.0.5
- multipledispatch: 1.0.0
- multiprocess: 0.70.16
- networkx: 3.2.1
- nltk: 3.8.1
- numexpr: 2.10.1
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.4.5.8
- nvidia-cuda-cupti-cu12: 12.4.127
- nvidia-cuda-nvrtc-cu12: 12.4.127
- nvidia-cuda-runtime-cu12: 12.4.127
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.2.1.3
- nvidia-curand-cu12: 10.3.5.147
- nvidia-cusolver-cu12: 11.6.1.9
- nvidia-cusparse-cu12: 12.3.1.170
- nvidia-nccl-cu12: 2.21.5
- nvidia-nvjitlink-cu12: 12.4.127
- nvidia-nvtx-cu12: 12.4.127
- openai: 1.43.0
- opt-einsum: 3.3.0
- orjson: 3.10.6
- packaging: 24.1
- pandas: 2.2.2
- pathvalidate: 3.2.0
- peft: 0.11.1
- pillow: 10.4.0
- pip: 24.0
- pip-tools: 7.4.1
- platformdirs: 4.2.2
- portalocker: 2.10.0
- protobuf: 4.25.3
- psutil: 6.0.0
- pyarrow: 16.1.0
- pyarrow-hotfix: 0.6
- pybind11: 2.13.1
- pydantic: 2.8.0
- pydantic-core: 2.20.0
- pygments: 2.18.0
- pyheck: 0.1.5
- pyparsing: 3.1.2
- pyproject-hooks: 1.1.0
- pyro-api: 0.1.2
- pyro-ppl: 1.9.1
- pytablewriter: 1.2.0
- python-dateutil: 2.9.0.post0
- python-dotenv: 1.0.1
- python-multipart: 0.0.9
- pytorch-lightning: 2.3.1
- pytz: 2024.1
- pyyaml: 6.0.1
- regex: 2024.5.15
- requests: 2.32.3
- rich: 13.7.1
- rouge-score: 0.1.2
- s3transfer: 0.10.3
- sacrebleu: 2.4.2
- safetensors: 0.4.3
- scikit-learn: 1.5.1
- scipy: 1.13.1
- sentencepiece: 0.2.0
- sentry-sdk: 2.7.1
- setproctitle: 1.3.3
- setuptools: 69.5.1
- shellingham: 1.5.4
- six: 1.16.0
- smmap: 5.0.1
- sniffio: 1.3.1
- sorcery: 0.2.2
- soupsieve: 2.6
- sqlitedict: 2.1.0
- starlette: 0.37.2
- sympy: 1.13.1
- tabledata: 1.3.3
- tabulate: 0.9.0
- tasksource: 0.0.45
- tcolorpy: 0.1.6
- tensorboard: 2.17.0
- tensorboard-data-server: 0.7.2
- threadpoolctl: 3.5.0
- tokenizers: 0.19.1
- tomli: 2.0.1
- tomlkit: 0.13.2
- torch: 2.5.1
- torchmetrics: 1.4.0.post0
- tqdm: 4.66.4
- tqdm-multiprocess: 0.0.11
- transformers: 4.42.3
- triton: 3.1.0
- typeguard: 2.13.3
- typepy: 1.3.2
- typer: 0.12.3
- typeshed-client: 2.5.1
- typing-extensions: 4.12.2
- tzdata: 2024.1
- ujson: 5.10.0
- urllib3: 1.26.19
- uvicorn: 0.30.1
- uvloop: 0.19.0
- wandb: 0.17.4
- watchfiles: 0.22.0
- websockets: 12.0
- werkzeug: 3.0.3
- wheel: 0.43.0
- word2number: 1.1
- wrapt: 1.16.0
- xmltodict: 0.14.2
- xxhash: 3.4.1
- yarl: 1.9.4
- zipp: 3.19.2
- zstandard: 0.22.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.0
- release: 5.14.0-427.40.1.el9_4.x86_64
- version: #1 SMP PREEMPT_DYNAMIC Wed Oct 16 07:08:17 EDT 2024
</details>
### More info
_No response_
cc @justusschock @lantiga