Skip to content

[Bug] H100 + DeepSeekV3 error- DeepEP error: timeout (dispatch CPU) #1335

@malay-nagda

Description

@malay-nagda

Describe the bug

Pre-Training DeepSeek V3 on H100 for both precisions- BF16 and FP8-SC give the error-

[rank428]:         ^^^^^^^^^^^^^^^^
[rank428]:   File "/opt/venv/lib/python3.12/site-packages/deep_ep/buffer.py", line 376, in dispatch
[rank428]:     return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
[rank428]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank428]:   File "/opt/venv/lib/python3.12/site-packages/deep_ep/buffer.py", line 581, in internode_dispatch
[rank428]:     recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
[rank428]:                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank428]: RuntimeError: DeepEP error: timeout (dispatch CPU)

Steps/Code to reproduce bug

You can use performance scripts in github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/scripts/performance to reproduce the error-

python3 scripts/performance/setup_experiment.py -a <slurm_account> -p <slurm_partition> -g h100 -i <nemo_container_image_url> -l <results_dir> -vb 1 -m deepseek -s v3 -ng 1024

Expected behavior

Per train-step throughput of ~330 TFLOPs/sec/GPU.

Additional context

Megatron-Bridge commit used for the Slurm job: 74105de40239803f9c566d21bedd1ae3c0c43bc7
DeepEP Version: 1.2.1+ef73fd9

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingperformanceperformance/releasePerformance items related with NeMo release

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions