-
Notifications
You must be signed in to change notification settings - Fork 226
Closed
Labels
bugSomething isn't workingSomething isn't workingperformanceperformance/releasePerformance items related with NeMo releasePerformance items related with NeMo release
Milestone
Description
Describe the bug
Pre-Training DeepSeek V3 on H100 for both precisions- BF16 and FP8-SC give the error-
[rank428]: ^^^^^^^^^^^^^^^^
[rank428]: File "/opt/venv/lib/python3.12/site-packages/deep_ep/buffer.py", line 376, in dispatch
[rank428]: return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
[rank428]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank428]: File "/opt/venv/lib/python3.12/site-packages/deep_ep/buffer.py", line 581, in internode_dispatch
[rank428]: recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
[rank428]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank428]: RuntimeError: DeepEP error: timeout (dispatch CPU)
Steps/Code to reproduce bug
You can use performance scripts in github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/scripts/performance to reproduce the error-
python3 scripts/performance/setup_experiment.py -a <slurm_account> -p <slurm_partition> -g h100 -i <nemo_container_image_url> -l <results_dir> -vb 1 -m deepseek -s v3 -ng 1024
Expected behavior
Per train-step throughput of ~330 TFLOPs/sec/GPU.
Additional context
Megatron-Bridge commit used for the Slurm job: 74105de40239803f9c566d21bedd1ae3c0c43bc7
DeepEP Version: 1.2.1+ef73fd9
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingperformanceperformance/releasePerformance items related with NeMo releasePerformance items related with NeMo release