LLama3.1-405B Seeing Functionality issue on both GB300 and GB200.

LLama3.1-405B Seeing Functionality issue on both GB300 and GB200. These functionality issues will be addressed according to Nemo team in 26.02 nemo container. 

GB300 issue:

```
[rank3]: torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
[rank3]: Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank3]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank3]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank3]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 
```


GB200:

```
[rank112]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3694, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.28.3
[rank112]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank112]: Last error:
[rank112]: Cuda failure 2 'out of memory'
[rank125]: Traceback (most recent call last):
[rank125]:  File "/lustre/fsw/infra_rd_gsw/rsalagame/llama3.1-405B-70B-8B-GB300/test-env-25.11.01.rc3/workloads/pretrain_llama3.1/Megatron-Bridge/scripts/performance/run_script.py", line 60, in <module>
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama3.1-405B Seeing Functionality issue on both GB300 and GB200. #1901

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLama3.1-405B Seeing Functionality issue on both GB300 and GB200. #1901

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions