-
Notifications
You must be signed in to change notification settings - Fork 400
Description
06/26 11:07:50 - mmengine - INFO -
System environment:
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1070453503
GPU 0,1: NVIDIA L40S
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.103
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.4.0.dev20240507+cu121
PyTorch compiling details: PyTorch built with:
-
GCC 9.3
-
C++ Version: 201703
-
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
-
Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
-
OpenMP 201511 (a.k.a. OpenMP 4.5)
-
LAPACK is enabled (usually provided by MKL)
-
NNPACK is enabled
-
CPU capability usage: AVX512
-
CUDA Runtime 12.1
-
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
-
CuDNN 8.9.2
-
Magma 2.6.1
-
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.19.0.dev20240507+cu121
OpenCV: 4.9.0
MMEngine: 0.10.4
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1070453503
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 2
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] entrypoint : /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] min_nodes : 1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_nodes : 1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] nproc_per_node : 2
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] run_id : none
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_backend : static
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_endpoint : 127.0.0.1:28346
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] rdzv_configs : {'rank': 0, 'timeout': 900}
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] max_restarts : 0
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] monitor_interval : 0.1
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] log_dir : /tmp/torchelastic_yoyanqm0
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188] metrics_cfg : {}
I0626 11:47:13.862000 140300452140160 torch/distributed/launcher/api.py:188]
I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:869] [default] starting workers for entrypoint: python3
I0626 11:47:13.863000 140300452140160 torch/distributed/elastic/agent/server/api.py:702] [default] Rendezvous'ing worker group
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] restart_count=0
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_addr=127.0.0.1
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] master_port=28346
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_rank=0
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] group_world_size=1
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] local_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_ranks=[0, 1]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] role_world_sizes=[2, 2]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568] global_world_sizes=[2, 2]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:568]
I0626 11:47:13.866000 140300452140160 torch/distributed/elastic/agent/server/api.py:710] [default] Starting worker group
I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:184] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0626 11:47:13.867000 140300452140160 torch/distributed/elastic/agent/server/local_elastic_agent.py:216] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
demo-ai-xtuner-pod:210:210 [0] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0>
demo-ai-xtuner-pod:210:210 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
demo-ai-xtuner-pod:210:210 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
demo-ai-xtuner-pod:211:211 [1] NCCL INFO cudaDriverVersion 12040
demo-ai-xtuner-pod:211:211 [1] NCCL INFO Bootstrap : Using eth0:197.166.199.168<0>
demo-ai-xtuner-pod:211:211 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Failed to open libibverbs.so[.1]
demo-ai-xtuner-pod:210:227 [0] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0>
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using non-device net plugin version 0
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Using network Socket
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Failed to open libibverbs.so[.1]
demo-ai-xtuner-pod:211:228 [1] NCCL INFO NET/Socket : Using [0]eth0:197.166.199.168<0>
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using non-device net plugin version 0
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Using network Socket
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init START
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init START
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/04 : 0 1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/04 : 0 1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/04 : 0 1
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/04 : 0 1
demo-ai-xtuner-pod:211:228 [1] NCCL INFO P2P Chunksize set to 131072
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
demo-ai-xtuner-pod:210:227 [0] NCCL INFO P2P Chunksize set to 131072
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all rings
demo-ai-xtuner-pod:211:228 [1] NCCL INFO Connected all trees
demo-ai-xtuner-pod:211:228 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all rings
demo-ai-xtuner-pod:211:228 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-ai-xtuner-pod:210:227 [0] NCCL INFO Connected all trees
demo-ai-xtuner-pod:210:227 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-ai-xtuner-pod:210:227 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-ai-xtuner-pod:211:228 [1] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1a000 commId 0x97541cd9f2e1e519 - Init COMPLETE
demo-ai-xtuner-pod:210:227 [0] NCCL INFO comm 0x5604fbe23ae0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 18000 commId 0x97541cd9f2e1e519 - Init COMPLETE
06/26 11:47:18 - mmengine - INFO -
work_dir = '/models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work'
06/26 11:47:18 - mmengine - DEBUG - Get class Visualizer from "visualizer" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - Get class TensorboardVisBackend from "vis_backend" registry in "mmengine"
06/26 11:47:18 - mmengine - DEBUG - An TensorboardVisBackend instance is built from registry, and its implementation can be found in mmengine.visualization.vis_backend
06/26 11:47:18 - mmengine - DEBUG - An Visualizer instance is built from registry, and its implementation can be found in mmengine.visualization.visualizer
06/26 11:47:18 - mmengine - DEBUG - Attribute _env_initialized is not defined in <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'> or <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized is False, _init_envwill be called and <class 'mmengine.visualization.vis_backend.TensorboardVisBackend'>._env_initialized will be set to True 06/26 11:47:18 - mmengine - DEBUG - Get classBaseDataPreprocessorfrom "model" registry in "mmengine" 06/26 11:47:18 - mmengine - DEBUG - AnBaseDataPreprocessorinstance is built from registry, and its implementation can be found in mmengine.model.base_model.data_preprocessor quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'> 06/26 11:47:18 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.low_cpu_mem_usagewas None, now set to True since model is quantized. Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00, 2.95s/it] 06/26 11:47:30 - mmengine - DEBUG - Anfrom_pretrainedinstance is built from registry, and its implementation can be found in transformers.models.auto.auto_factory 06/26 11:47:30 - mmengine - DEBUG - AnLoraConfiginstance is built from registry, and its implementation can be found in peft.tuners.lora.config 06/26 11:47:32 - mmengine - DEBUG - AnSupervisedFinetune` instance is built from registry, and its implementation can be found in xtuner.model.sft
卡在这里,
最后超时错误
[rank1]:[E626 11:57:18.466822874 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank1]:[E626 11:57:18.469226529 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
demo-ai-xtuner-pod:211:231 [1] NCCL INFO [Service thread] Connection closed by localRank 1
demo-ai-xtuner-pod:211:224 [0] NCCL INFO comm 0x55763901fad0 rank 1 nranks 2 cudaDev 1 busId 1a000 - Abort COMPLETE
[rank1]:[E626 11:57:18.675112321 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E626 11:57:18.675128694 ProcessGroupNCCL.cpp:586] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E626 11:57:18.675133603 ProcessGroupNCCL.cpp:592] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E626 11:57:18.675167441 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc13a08f582 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc13a0964b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc13a0982bc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1452 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc187779017 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe549a5 (0x7fc139ce99a5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fc186eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fc18c8fdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fc18c98ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
W0626 11:57:21.928000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 210 closing signal SIGTERM
E0626 11:57:22.092000 140300452140160 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 211) of binary: /usr/bin/python3
I0626 11:57:22.096000 140300452140160 torch/distributed/elastic/multiprocessing/errors/init.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 1)
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
进入到容器里,发现确实是启动了2个训练进程
root@demo-ai-xtuner-pod:/app# ps -efwww
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 12:03 ? 00:00:00 /bin/bash /models/meta-Llama-3-8B-xtuner-trainer/train-model.sh
root 10 1 86 12:03 ? 00:00:19 /usr/bin/python3 /usr/local/bin/xtuner train --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py
root 143 10 27 12:03 ? 00:00:05 /usr/bin/python3 /usr/local/bin/torchrun --nnodes=1 --node_rank=0 --nproc_per_node=gpu --master_addr=127.0.0.1 --master_port=25860 /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
root 209 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
root 210 143 99 12:03 ? 00:00:17 /usr/bin/python3 -u /usr/local/lib/python3.10/dist-packages/xtuner/tools/train.py --work-dir /models/meta-Llama-3-8B-xtuner-trainer/demo-ai-xtuner-pod/train-work config.py --launcher pytorch
工作目录中输出中只有一个进程的日志,
20240626_114717_root@demo-ai-xtuner-pod_device0_rank0.log
没有rank1.log的日志,
不知道怎么设置能让rank1.log出现