-
Notifications
You must be signed in to change notification settings - Fork 154
Open
Labels
bugSomething isn't workingSomething isn't working
Description
LLama3.1-405B Seeing Functionality issue on both GB300 and GB200. These functionality issues will be addressed according to Nemo team in 26.02 nemo container.
GB300 issue:
[rank3]: torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
[rank3]: Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank3]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
GB200:
[rank112]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3694, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.28.3
[rank112]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank112]: Last error:
[rank112]: Cuda failure 2 'out of memory'
[rank125]: Traceback (most recent call last):
[rank125]: File "/lustre/fsw/infra_rd_gsw/rsalagame/llama3.1-405B-70B-8B-GB300/test-env-25.11.01.rc3/workloads/pretrain_llama3.1/Megatron-Bridge/scripts/performance/run_script.py", line 60, in <module>
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working