-
Notifications
You must be signed in to change notification settings - Fork 654
Description
I use 4 rtx4090 to accelerate the training process, using the script in mcmc_4gpus will crash in the first iter, but if I disable the packed mode, the training will continue to the end.
Here is the training command:
# train and eval at the last step (30000) CUDA_VISIBLE_DEVICES=0,1,2,3 python simple_trainer.py mcmc --eval_steps 30000 --disable_viewer --data_factor $DATA_FACTOR \ --steps_scaler 0.25 --packed \ --strategy.cap-max $CAP_MAX \ --data_dir $SCENE_DIR/$SCENE/ \ --result_dir $RESULT_DIR/$SCENE/ \
error log:
[rank0]:[E1223 02:36:10.552864265 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc9f2d6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc9f2d15a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc9f31ef918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc9a0df1556 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc9a0dfe8c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x617 (0x7fc9a0e00557 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fc9a0e016ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7fc9f32605c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7fca94c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fca94d25a04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc9f2d6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc9f2d15a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc9f31ef918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc9a0df1556 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc9a0dfe8c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x617 (0x7fc9a0e00557 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fc9a0e016ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7fc9f32605c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7fca94c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fca94d25a04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc9f2d6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7fc9a0a5c6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7fc9f32605c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7fca94c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fca94d25a04 in /lib/x86_64-linux-gnu/libc.so.6)
W1223 02:36:10.641000 3152233 torch/multiprocessing/spawn.py:169] Terminating process 3152363 via signal SIGTERM
Traceback (most recent call last):
File "/usr/local/gsplat/examples/simple_trainer.py", line 1262, in
cli(main, cfg, verbose=True)
File "/usr/local/lib/gsplat/gsplat/distributed.py", line 344, in cli
process_context.join()
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/usr/local/lib/gsplat/gsplat/distributed.py", line 295, in _distributed_worker
fn(local_rank, world_rank, world_size, args)
File "/usr/local/lib/gsplat/examples/simple_trainer.py", line 1184, in main
runner.train()
File "/usr/local/lib/gsplat/examples/simple_trainer.py", line 720, in train
desc = f"loss={loss.item():.3f}| " f"sh degree={sh_degree_to_use}| "
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
`