Skip to content

Crash running the multi-GPU benchmark mcmc_4gpus.py #845

@zjuzjgxc

Description

@zjuzjgxc

I use 4 rtx4090 to accelerate the training process, using the script in mcmc_4gpus will crash in the first iter, but if I disable the packed mode, the training will continue to the end.

Here is the training command:

# train and eval at the last step (30000) CUDA_VISIBLE_DEVICES=0,1,2,3 python simple_trainer.py mcmc --eval_steps 30000 --disable_viewer --data_factor $DATA_FACTOR \ --steps_scaler 0.25 --packed \ --strategy.cap-max $CAP_MAX \ --data_dir $SCENE_DIR/$SCENE/ \ --result_dir $RESULT_DIR/$SCENE/ \

error log:

[rank0]:[E1223 02:36:10.552864265 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc9f2d6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc9f2d15a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc9f31ef918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc9a0df1556 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc9a0dfe8c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x617 (0x7fc9a0e00557 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fc9a0e016ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7fc9f32605c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7fca94c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fca94d25a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc9f2d6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc9f2d15a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc9f31ef918 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc9a0df1556 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc9a0dfe8c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x617 (0x7fc9a0e00557 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fc9a0e016ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7fc9f32605c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7fca94c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fca94d25a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc9f2d6c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7fc9a0a5c6fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7fc9f32605c0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7fca94c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fca94d25a04 in /lib/x86_64-linux-gnu/libc.so.6)

W1223 02:36:10.641000 3152233 torch/multiprocessing/spawn.py:169] Terminating process 3152363 via signal SIGTERM
Traceback (most recent call last):
File "/usr/local/gsplat/examples/simple_trainer.py", line 1262, in
cli(main, cfg, verbose=True)
File "/usr/local/lib/gsplat/gsplat/distributed.py", line 344, in cli
process_context.join()
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/usr/local/lib/gsplat/gsplat/distributed.py", line 295, in _distributed_worker
fn(local_rank, world_rank, world_size, args)
File "/usr/local/lib/gsplat/examples/simple_trainer.py", line 1184, in main
runner.train()
File "/usr/local/lib/gsplat/examples/simple_trainer.py", line 720, in train
desc = f"loss={loss.item():.3f}| " f"sh degree={sh_degree_to_use}| "
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions