Skip to content

[1/N][port from deepseek085] add custom allreduce from AITER#629

Merged
tjtanaavllm merged 1 commit intollama_fp8_03122025from
zejun/llama_fp8_03122025_custom_allreduce
Aug 12, 2025
Merged

[1/N][port from deepseek085] add custom allreduce from AITER#629
tjtanaavllm merged 1 commit intollama_fp8_03122025from
zejun/llama_fp8_03122025_custom_allreduce

Conversation

@zejunchen-zejun
Copy link

@zejunchen-zejun zejunchen-zejun commented Aug 11, 2025

Sync deepseek085 optimization to rocm/vllm llama branch.
Will upstream same code changes to public vllm.

The custom allreduce is controlled by VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE(default: True).
If AITER is imported, the custom allreduce will be default used.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@zejunchen-zejun zejunchen-zejun changed the title [1/N] add custom allreduce from AITER to vllm [1/N][port back from deepseek085] add custom allreduce from AITER to vllm Aug 11, 2025
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/llama_fp8_03122025_custom_allreduce branch from 5ee9cab to d785596 Compare August 11, 2025 12:48
@zejunchen-zejun zejunchen-zejun changed the title [1/N][port back from deepseek085] add custom allreduce from AITER to vllm [1/N][port from deepseek085] add custom allreduce from AITER to vllm Aug 11, 2025
@zejunchen-zejun zejunchen-zejun changed the title [1/N][port from deepseek085] add custom allreduce from AITER to vllm [1/N][port from deepseek085] add custom allreduce from AITER Aug 11, 2025
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/llama_fp8_03122025_custom_allreduce branch from d785596 to 555aff3 Compare August 12, 2025 07:53
@zejunchen-zejun
Copy link
Author

zejunchen-zejun commented Aug 12, 2025

Here is the accuracy verification. Use Qwen/Qwen3-0.6B and gsm8k dataset.
Use CUDA allreduce:
image

Use AITER allreduce(export VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE=1):
image

No accuracy degradation is seen.

Verify command:

export VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE=1

#!/bin/bash
rm -rf /root/.cache/vllm
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B,tensor_parallel_size=8,max_model_len=10000,gpu_memory_utilization=0.2 --trust_remote_code --tasks gsm8k --batch_size auto 2>&1 | tee ./pr_gsm8k-Qwen_Qwen3-32B-aiter-v1-3.log

control it by the env flag VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE
(default: True)

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/llama_fp8_03122025_custom_allreduce branch from 555aff3 to 37e91e2 Compare August 12, 2025 12:37
@tjtanaavllm
Copy link

LGTM @zejunchen-zejun Thank you for the feature.

@tjtanaavllm tjtanaavllm merged commit e2fa100 into llama_fp8_03122025 Aug 12, 2025
5 of 6 checks passed
@gshtras gshtras deleted the zejun/llama_fp8_03122025_custom_allreduce branch September 25, 2025 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants