Skip to content

单机多卡训练卡住,日志也看不出问题 #792

@apachemycat

Description

@apachemycat

06/26 11:07:50 - mmengine - INFO -

System environment:
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1070453503
GPU 0,1: NVIDIA L40S
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.103
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.4.0.dev20240507+cu121
PyTorch compiling details: PyTorch built with:

  • GCC 9.3

  • C++ Version: 201703

  • Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX512

  • CUDA Runtime 12.1

  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90

  • CuDNN 8.9.2

  • Magma 2.6.1

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

    TorchVision: 0.19.0.dev20240507+cu121
    OpenCV: 4.9.0
    MMEngine: 0.10.4

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1070453503
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 2

I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] entrypoint : /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] min_nodes : 1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_nodes : 1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] nproc_per_node : 2
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] run_id : none
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_backend : static
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_endpoint : 127.0.0.1:28346
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_configs : {'rank': 0, 'timeout': 900}
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_restarts : 0
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] monitor_interval : 0.1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] log_dir : /tmp/torchelastic_yoyanqm0
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] metrics_cfg : {}
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]
I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:869] [default] starting workers for entrypoint: python3
I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:702] [default] Rendezvous'ing worker group
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] restart_count=0
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_addr=127.0.0.1
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_port=28346
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_rank=0
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_world_size=1
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] local_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_world_sizes=[2, 2]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_world_sizes=[2, 2]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:710] [default] Starting worker group
I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:184] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:216] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
demo-ai-xtuner-pod:210:210 [0] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0>
demo-ai-xtuner-pod:210:210 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
demo-ai-xtuner-pod:210:210 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
demo-ai-xtuner-pod:211:211 [1] NCCL INFO cudaDriverVersion 12040
demo-ai-xtuner-pod:211:211 [1] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0>
demo-ai-xtuner-pod:211:211 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Failed to open libibverbs.so[.1]
demo-ai-xtuner-pod:210:227 [0] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0>
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using non-device net plugin version 0
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using network Socket
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Failed to open libibverbs.so[.1]
demo-ai-xtuner-pod:211:228 [1] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0>
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using non-device net plugin version 0
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using network Socket
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init START
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init START
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/04 : 0 1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/04 : 0 1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/04 : 0 1
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/04 : 0 1
demo-ai-xtuner-pod:211:228 [1] NCCL INFO P2P Chunksize set to 131072
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO P2P Chunksize set to 131072
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all rings
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all trees
demo-ai-xtuner-pod:211:228 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all rings
demo-ai-xtuner-pod:211:228 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all trees
demo-ai-xtuner-pod:210:227 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-ai-xtuner-pod:210:227 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init COMPLETE
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init COMPLETE
06/26 11:47:18 - mmengine - INFO -

work_dir = '/models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work'

06/26 11:47:18 - mmengine - DEBUG - Get class Visualizer from "visualizer" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - Get class TensorboardVisBackend from "vis_backend" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - An TensorboardVisBackend instance is built from registry, and its implementation can be found in mmengine.visualization.vis_backend
06/26 11:47:18 - mmengine - DEBUG - An Visualizer instance is built from registry, and its implementation can be found in mmengine.visualization.visualizer
06/26 11:47:18 - mmengine - DEBUG - Attribute _env_initialized is not defined in <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'> or <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized is False, _init_envwill be called and <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized will be set to True 06/26 11:47:18 - mmengine - DEBUG - Get classBaseDataPreprocessorfrom "model" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - AnBaseDataPreprocessorinstance is built from registry, and its implementation can be found in mmengine.model.base_model.data_preprocessor quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'> 06/26 11:47:18 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.low_cpu_mem_usagewas None, now set to True since model is quantized. Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00, 2.95s/it] 06/26 11:47:30 - mmengine - DEBUG - Anfrom_pretrainedinstance is built from registry, and its implementation can be found in transformers.models.auto.auto_factory 06/26 11:47:30 - mmengine - DEBUG - AnLoraConfiginstance is built from registry, and its implementation can be found in peft.tuners.lora.config 06/26 11:47:32 - mmengine - DEBUG - AnSupervisedFinetune` instance is built from registry, and its implementation can be found in xtuner.model.sft
卡在这里,
最后超时错误

[rank1]:[E626 11:57:18.466822874 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E626 11:57:18.469226529 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
demo-ai-xtuner-pod:211:231 [1] NCCL INFO [Service thread] Connection closed by localRank 1
demo-ai-xtuner-pod:211:224 [0] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE
[rank1]:[E626 11:57:18.675112321 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E626 11:57:18.675128694 ProcessGroupNCCL.cpp:586] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E626 11:57:18.675133603 ProcessGroupNCCL.cpp:592] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E626 11:57:18.675167441 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1452 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe549a5 (0x7fc139ce99a5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0626 11:57:21.928000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 210 closing signal SIGTERM
E0626 11:57:22.092000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 211) of binary: /usr/bin/python3
I0626 11:57:22.096000 140300452140160 torch/distributed/elastic/multiprocessing/errors/init.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 1)
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

进入到容器里,发现确实是启动了2个训练进程

root@demo-ai-xtuner-pod:/app# ps -efwww
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 12:03 ? 00:00:00 /bin/bash /models/meta-Llama-3-8B-xtuner-trainer/train-model.sh
root 10 1 86 12:03 ? 00:00:19 /usr/bin/python3 /usr/local/bin/xtuner train --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py
root 143 10 27 12:03 ? 00:00:05 /usr/bin/python3 /usr/local/bin/torchrun --nnodes=1 --node_rank=0 --nproc_per_node=gpu --master_addr=127.0.0.1 --master_port=25860 /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
root 209 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
root 210 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch

工作目录中输出中只有一个进程的日志,
20240626_114717_root@demo-ai-xtuner-pod_device0_rank0.log

没有rank1.log的日志,
不知道怎么设置能让rank1.log出现

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions