[Bug] H100 + DeepSeekV3 error- ` DeepEP error: timeout (dispatch CPU)`

**Describe the bug**

Pre-Training DeepSeek V3 on H100 for both precisions- BF16 and FP8-SC give the error-

```
[rank428]:         ^^^^^^^^^^^^^^^^
[rank428]:   File "/opt/venv/lib/python3.12/site-packages/deep_ep/buffer.py", line 376, in dispatch
[rank428]:     return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
[rank428]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank428]:   File "/opt/venv/lib/python3.12/site-packages/deep_ep/buffer.py", line 581, in internode_dispatch
[rank428]:     recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
[rank428]:                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank428]: RuntimeError: DeepEP error: timeout (dispatch CPU)
```

**Steps/Code to reproduce bug**

You can use performance scripts in github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/scripts/performance to reproduce the error-

```
python3 scripts/performance/setup_experiment.py -a <slurm_account> -p <slurm_partition> -g h100 -i <nemo_container_image_url> -l <results_dir> -vb 1 -m deepseek -s v3 -ng 1024
```

**Expected behavior**

Per train-step throughput of ~330 TFLOPs/sec/GPU.


**Additional context**

Megatron-Bridge commit used for the Slurm job: `74105de40239803f9c566d21bedd1ae3c0c43bc7`
DeepEP Version: `1.2.1+ef73fd9`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] H100 + DeepSeekV3 error- `DeepEP error: timeout (dispatch CPU)` #1335

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] H100 + DeepSeekV3 error- DeepEP error: timeout (dispatch CPU) #1335

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] H100 + DeepSeekV3 error- `DeepEP error: timeout (dispatch CPU)` #1335