Skip to content

Commit fb7686c

Browse files
committed
[rollout, vllm] fix: accuracy issue in verl serve mode + vllm-ascend + dp + ep + tp scenarios (verl-project#4783)
### What does this PR do? Fix the accuracy issue in verl + vllm-ascend dp+ep+tp+server scenarios, issue:vllm-project/vllm-ascend#5544 ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test Tested GRPO on local NPU host: <img width="1047" height="117" alt="58274edd-d0d3-454c-8e39-3188f6f19e71" src="https://github.com/user-attachments/assets/dee7bf2f-6faf-4f44-a8b3-64670d5b1e10" /> ### Design & Code Changes Root cause analysis: currently, the version of Verl + Ascend NPU + vllm-ascend is [v0.11.0](https://verl.readthedocs.io/en/latest/ascend_tutorial/ascend_quick_start.html). In the vllm-ascend v0.11.0 code, the all2all backend (flashinfer_all2allv) is selected and updated to the vllm worker environment. However, verl's ExternalZeroMQDistributedExecutor does not pass this environment to the vllm worker processes like vllm's [RayDistributedExecutor](https://github.com/vllm-project/vllm/blob/0d4044edd85de30d7d4558aeea4d1e95c7c556d6/vllm/v1/executor/ray_executor.py#L337) backend does. Therefore, due to the all2all backend for vllm-ascend is wrong, and then there is a precision issue on vllm-ascend. Implementation: 1. In vLLMAsyncRollout, when initiating vllm work, if it's an NPU scenario, add the environment variables required by vllm-ascend. 2. Add vllm engine environment variables setting in rollout.yaml, supports setting by user. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Co-authored-by: FightingZhen --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
1 parent 0faff75 commit fb7686c

File tree

7 files changed

+10
-6
lines changed

7 files changed

+10
-6
lines changed

tests/special_npu/run_qwen2_5_05b_grpo.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
set -x
2-
export VLLM_ASCEND_ENABLE_NZ=0
32

43
MODEL_ID=${MODEL_ID:-Qwen/Qwen2.5-0.5B-Instruct}
54
MODEL_PATH=${MODEL_PATH:-${HOME}/.cache/models/${MODEL_ID}}

tests/special_npu/run_qwen2_5_05b_grpo_mindspeed.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
set -x
2-
export VLLM_ASCEND_ENABLE_NZ=0
32

43
MODEL_ID=${MODEL_ID:-Qwen/Qwen2.5-0.5B-Instruct}
54
MODEL_PATH=${MODEL_PATH:-${HOME}/.cache/models/${MODEL_ID}}

tests/special_npu/run_qwen2_5_vl_3b_npu.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
set -x
2-
export VLLM_ASCEND_ENABLE_NZ=0
32

43
ENGINE=${1:-vllm}
54

tests/special_npu/run_qwen3_06b_ppo.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
set -x
2-
export VLLM_ASCEND_ENABLE_NZ=0
32

43
MODEL_ID=${MODEL_ID:-Qwen/Qwen2.5-0.5B-Instruct} # TODO: change to Qwen3-0.6B when CI server is ready
54
MODEL_PATH=${MODEL_PATH:-${HOME}/.cache/models/${MODEL_ID}}

tests/special_npu/run_qwen3_30b_grpo_mindspeed.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
#!/usr/bin/env bash
22
set -xeuo pipefail
33

4-
export VLLM_ASCEND_ENABLE_NZ=0
54

65
MODEL_ID=${MODEL_ID:-Qwen/Qwen3-30B-A3B-Instruct-2507}
76
MODEL_PATH=${MODEL_PATH:-${HOME}/.cache/models/${MODEL_ID}}

verl/experimental/one_step_off_policy/shell/grpo_qwen3_8b_gsm8k_fsdp2_8_8_npu.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# The script has been validated on the Ascend Atlas 800T A3.
22
set -x
33

4-
export VLLM_ASCEND_ENABLE_NZ=0
54
export HCCL_EXEC_TIMEOUT=60000
65
export HCCL_CONNECT_TIMEOUT=7200
76

verl/workers/rollout/vllm_rollout/vllm_rollout.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,8 @@
7373
logger = logging.getLogger(__file__)
7474
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
7575

76+
VLLM_ASCEND_REQUIRED_ENV_VARS = {"VLLM_ALL2ALL_BACKEND": "flashinfer_all2allv", "VLLM_ASCEND_ENABLE_NZ": "0"}
77+
7678
# TODO
7779
# 1. support pp in vllm
7880
# 2. passing tokenizer is not necessary? no encoding/decoding is happending here
@@ -177,6 +179,14 @@ async def _loop_forever(self):
177179

178180
def _init_worker(self, all_kwargs: list[dict[str, Any]]):
179181
"""Initialize worker engine."""
182+
# TODO: For ascend NPU, when the corresponding vllm-ascend version is upgraded to v0.13.0,
183+
# please remove the VLLM_ASCEND_REQUIRED_ENV_VARS variable replacement action.
184+
# This is only a fix for vllm version < v0.13.0.
185+
if is_npu_available:
186+
for k in VLLM_ASCEND_REQUIRED_ENV_VARS:
187+
if k not in os.environ:
188+
os.environ[k] = VLLM_ASCEND_REQUIRED_ENV_VARS[k]
189+
180190
if not torch.distributed.is_initialized():
181191
initialize_global_process_group_ray()
182192
all_kwargs[0]["rank"] = int(os.environ["RANK"])

0 commit comments

Comments
 (0)