Skip to content

Conversation

@zhuyuhua-v
Copy link

@zhuyuhua-v zhuyuhua-v commented Nov 5, 2025

Purpose

add sequence parallel support for rocm platforms:

  1. support pattern replacement of allreduce+rmsnorm with reduce_scatter+rmsnrom+all_gather
  2. support pattern replacement of allreduce+rmsnorm+quant with reduce_scatter+rmsnrom+quant+all_gather

Test Plan

server:

vllm serve $model_path \
    --tensor-parallel-size 8 \
    --max-num-batched-tokens 32768 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --gpu_memory_utilization 0.9 \
    --port 6789 \
    --compilation-config '{"cudagraph_mode": "FULL", "pass_config": {"enable_sequence_parallelism": true}, "use_inductor_graph_partition": false, "splitting_ops":[]}' \
    --block-size 1 \
    --async-scheduling

accuracy:

model="/mnt/raid0/models/DeepSeek-R1/"
lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args model=${model_path},base_url=http://127.0.0.1:6789/v1/completions \
--batch_size 100 

Test Result

accuracy:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9477 ± 0.0061
strict-match 5 exact_match 0.9454 ± 0.0063

@zhuyuhua-v zhuyuhua-v marked this pull request as ready for review November 5, 2025 05:50
@wuhuikx
Copy link

wuhuikx commented Nov 5, 2025

Will you file a PR to upstream? How about the performance, especially for TTFT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants