单机多卡训练卡住，日志也看不出问题




06/26 11:07:50 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 1070453503
    GPU 0,1: NVIDIA L40S
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.3, V12.3.103
    GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    PyTorch: 2.4.0.dev20240507+cu121
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

    TorchVision: 0.19.0.dev20240507+cu121
    OpenCV: 4.9.0
    MMEngine: 0.10.4

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 1070453503
    deterministic: False
    **Distributed launcher: pytorch
    Distributed training: True
    GPU number: 2**
------------------------------------------------------------

I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   entrypoint       : /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   min_nodes        : 1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   max_nodes        : 1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   nproc_per_node   : 2
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   run_id           : none
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   rdzv_backend     : static
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   rdzv_endpoint    : 127.0.0.1:28346
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   rdzv_configs     : {'rank': 0, 'timeout': 900}
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   max_restarts     : 0
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   monitor_interval : 0.1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   log_dir          : /tmp/torchelastic_yoyanqm0
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]   metrics_cfg      : {}
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] 
I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:869] [default] starting workers for entrypoint: python3
I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:702] [default] Rendezvous'ing worker group
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   restart_count=0
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   master_addr=127.0.0.1
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   master_port=28346
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   group_rank=0
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   group_world_size=1
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   local_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   role_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   global_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   role_world_sizes=[2, 2]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]   global_world_sizes=[2, 2]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] 
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:710] [default] Starting worker group
I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:184] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:216] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
demo-ai-xtuner-pod:210:210 [0] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0>
demo-ai-xtuner-pod:210:210 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
demo-ai-xtuner-pod:210:210 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
demo-ai-xtuner-pod:211:211 [1] NCCL INFO cudaDriverVersion 12040
demo-ai-xtuner-pod:211:211 [1] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0>
demo-ai-xtuner-pod:211:211 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Failed to open libibverbs.so[.1]
demo-ai-xtuner-pod:210:227 [0] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0>
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using non-device net plugin version 0
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using network Socket
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Failed to open libibverbs.so[.1]
demo-ai-xtuner-pod:211:228 [1] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0>
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using non-device net plugin version 0
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using network Socket
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init START
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init START
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/04 :    0   1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/04 :    0   1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/04 :    0   1
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/04 :    0   1
demo-ai-xtuner-pod:211:228 [1] NCCL INFO P2P Chunksize set to 131072
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO P2P Chunksize set to 131072
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all rings
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all trees
demo-ai-xtuner-pod:211:228 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all rings
demo-ai-xtuner-pod:211:228 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all trees
demo-ai-xtuner-pod:210:227 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-ai-xtuner-pod:210:227 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init COMPLETE
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init COMPLETE
06/26 11:47:18 - mmengine - INFO - 




work_dir = '/models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work'

06/26 11:47:18 - mmengine - DEBUG - Get class `Visualizer` from "visualizer" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - Get class `TensorboardVisBackend` from "vis_backend" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - An `TensorboardVisBackend` instance is built from registry, and its implementation can be found in mmengine.visualization.vis_backend
06/26 11:47:18 - mmengine - DEBUG - An `Visualizer` instance is built from registry, and its implementation can be found in mmengine.visualization.visualizer
06/26 11:47:18 - mmengine - DEBUG - Attribute `_env_initialized` is not defined in <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'> or `<class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized is False, `_init_env` will be called and <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized will be set to True
06/26 11:47:18 - mmengine - DEBUG - Get class `BaseDataPreprocessor` from "model" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - An `BaseDataPreprocessor` instance is built from registry, and its implementation can be found in mmengine.model.base_model.data_preprocessor
quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>
06/26 11:47:18 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00,  2.95s/it]
06/26 11:47:30 - mmengine - DEBUG - An `from_pretrained` instance is built from registry, and its implementation can be found in transformers.models.auto.auto_factory
06/26 11:47:30 - mmengine - DEBUG - An `LoraConfig` instance is built from registry, and its implementation can be found in peft.tuners.lora.config
06/26 11:47:32 - mmengine - DEBUG - An `SupervisedFinetune` instance is built from registry, and its implementation can be found in xtuner.model.sft
卡在这里，
最后超时错误



[rank1]:[E626 11:57:18.466822874 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E626 11:57:18.469226529 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
demo-ai-xtuner-pod:211:231 [1] NCCL INFO [Service thread] Connection closed by localRank 1
demo-ai-xtuner-pod:211:224 [0] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE
[rank1]:[E626 11:57:18.675112321 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E626 11:57:18.675128694 ProcessGroupNCCL.cpp:586] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E626 11:57:18.675133603 ProcessGroupNCCL.cpp:592] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E626 11:57:18.675167441 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1452 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe549a5 (0x7fc139ce99a5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0626 11:57:21.928000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 210 closing signal SIGTERM
E0626 11:57:22.092000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 211) of binary: /usr/bin/python3
I0626 11:57:22.096000 140300452140160 torch/distributed/elastic/multiprocessing/errors/__init__.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 1)
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>




进入到容器里，发现确实是启动了2个训练进程


root@demo-ai-xtuner-pod:/app# ps -efwww
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 12:03 ?        00:00:00 /bin/bash /models/meta-Llama-3-8B-xtuner-trainer/train-model.sh
root          10       1 86 12:03 ?        00:00:19 /usr/bin/python3 /usr/local/bin/xtuner train --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py
root         143      10 27 12:03 ?        00:00:05 /usr/bin/python3 /usr/local/bin/torchrun --nnodes=1 --node_rank=0 --nproc_per_node=gpu --master_addr=127.0.0.1 --master_port=25860 /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
root         209     143 99 12:03 ?        00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
root         210     143 99 12:03 ?        00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch



工作目录中输出中只有一个进程的日志，
20240626_114717_root@demo-ai-xtuner-pod_device0_rank0.log

没有rank1.log的日志，
不知道怎么设置能让rank1.log出现








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单机多卡训练卡住，日志也看不出问题 #792

06/26 11:07:50 - mmengine - INFO -

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1070453503
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 2

/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

单机多卡训练卡住，日志也看不出问题 #792

Description

06/26 11:07:50 - mmengine - INFO -

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1070453503 deterministic: False Distributed launcher: pytorch Distributed training: True GPU number: 2

/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1070453503
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 2