[algo, rollout, sglang] feat: Support router replay with sglang (verl-project#4840)

moehanabi · web-flow · commit e1cd47b87365 · 2026-01-09T13:18:08.000+08:00
### What does this PR do? Support router replay with sglang ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example maybe use with sgl-project/sglang#15751 if you want to set chunked_prefill_size = -1 ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
diff --git a/examples/router_replay/README.md b/examples/router_replay/README.md
@@ -69,4 +69,4 @@ actor_rollout_ref.actor.router_replay.mode="R3"
 actor_rollout_ref.rollout.enable_rollout_routing_replay=True
 ```
 
-R3 mode requires the rollout backend to support returning router selection results. Currently, this functionality is being tested based on the vllm implementation at https://github.com/vllm-project/vllm/pull/28284.
+R3 mode requires the rollout backend to support returning router selection results. Currently, this functionality is being tested based on the vllm implementation at https://github.com/vllm-project/vllm/pull/28284 and SGLang implementation at https://github.com/sgl-project/sglang/commit/bed301a5acaa9577c9aa706468bdf242f6a43051.
diff --git a/examples/router_replay/run_qwen30_a3b_megatron_sglang.sh b/examples/router_replay/run_qwen30_a3b_megatron_sglang.sh
@@ -0,0 +1,110 @@
+
+set -x
+
+NODES=6
+
+# R2: enable routing replay
+# R3: enable rollout routing replay
+# If enabling R3, please set actor_rollout_ref.rollout.enable_rollout_routing_replay=True
+# R3 example is based on SGLang related commit https://github.com/sgl-project/sglang/commit/bed301a5acaa9577c9aa706468bdf242f6a43051
+
+ROUTING_REPLAY_MODE="R3" 
+
+DIST_CKPT_PATH=""
+HF_MODEL_PATH=""
+TRAIN_DATA_PATH=""
+TEST_DATA_PATH=""
+
+export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping
+PP=6
+VPP=None
+TP=1
+EP=8
+ETP=1
+SGLANG_INFER_TP=4
+offload=True
+gpu_memory_utilization=0.65
+bs=3
+micro_bs=3
+use_dynamic_bsz=False
+max_prompt_length=512
+max_response_length=512
+ppo_mini_batch_size=3
+actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
+infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
+
+
+exper_name=Node${NODES}_bs${bs}_${PP}${TP}${EP}${ETP}_${SGLANG_INFER_TP}_minbs${ppo_mini_batch_size}_micro_bs${micro_bs}
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml' \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$TRAIN_DATA_PATH \
+    data.val_files=$TEST_DATA_PATH \
+    data.train_batch_size=$bs \
+    data.max_prompt_length=$max_prompt_length \
+    data.max_response_length=$max_response_length \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.use_fused_kernels=True \
+    actor_rollout_ref.model.path=$HF_MODEL_PATH \
+    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
+    actor_rollout_ref.actor.router_replay.mode=${ROUTING_REPLAY_MODE} \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type=flex \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.bias_activation_fusion=True \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32 \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True \
+    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=False \
+    actor_rollout_ref.actor.megatron.param_offload=${offload} \
+    actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
+    actor_rollout_ref.actor.megatron.grad_offload=${offload} \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$micro_bs \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
+    actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
+    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
+    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
+    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=$micro_bs \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=$SGLANG_INFER_TP \
+    actor_rollout_ref.rollout.name=sglang \
+    actor_rollout_ref.rollout.enable_rollout_routing_replay=True \
+    actor_rollout_ref.rollout.skip_tokenizer_init=True \
+    actor_rollout_ref.rollout.mode=async \
+    actor_rollout_ref.actor.megatron.use_mbridge=True \
+    actor_rollout_ref.rollout.gpu_memory_utilization=$gpu_memory_utilization \
+    actor_rollout_ref.rollout.n=8 \
+    actor_rollout_ref.rollout.enable_chunked_prefill=True \
+    actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=$micro_bs \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
+    actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
+    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
+    actor_rollout_ref.ref.megatron.param_offload=${offload} \
+    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
+    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console'] \
+    trainer.project_name='verl_grpo_example_gsm8k_math' \
+    trainer.experiment_name="$exper_name" \
+    trainer.nnodes=$NODES \
+    trainer.n_gpus_per_node=8 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=10 \
+    trainer.total_training_steps=50000 \
+    trainer.balance_batch=False \
+    trainer.val_before_train=False 2>&1
diff --git a/verl/experimental/agent_loop/agent_loop.py b/verl/experimental/agent_loop/agent_loop.py
@@ -587,7 +587,12 @@ async def _agent_loop_postprocess(self, output, **kwargs) -> _InternalAgentLoopO
         if output.routed_experts is not None:
             total_length = input_ids.shape[1]
             length, layer_num, topk_num = output.routed_experts.shape
-            experts_tensor = torch.from_numpy(output.routed_experts)
+            if isinstance(output.routed_experts, np.ndarray):
+                experts_tensor = torch.from_numpy(output.routed_experts)
+            elif isinstance(output.routed_experts, torch.Tensor):
+                experts_tensor = output.routed_experts
+            else:
+                raise TypeError(f"Unsupported type for routed_experts: {type(output.routed_experts)}")
             routed_experts = torch.zeros(1, total_length, layer_num, topk_num, dtype=experts_tensor.dtype)
 
             # Calculate start position: left padding means original prompt starts at the end
diff --git a/verl/workers/rollout/sglang_rollout/async_sglang_server.py b/verl/workers/rollout/sglang_rollout/async_sglang_server.py
@@ -200,6 +200,9 @@ async def launch_server(self, master_address: str = None, master_port: int = Non
             enable_weights_cpu_backup = True if self.rollout_mode == RolloutMode.COLOCATED else False
             args["enable_weights_cpu_backup"] = enable_weights_cpu_backup
 
+        if self.config.enable_rollout_routing_replay:
+            args.update({"enable_return_routed_experts": True})
+
         # NOTE: We can't directly call SGLang's launch_server since it's not an async function.
         # https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py
         sglang.srt.entrypoints.engine._set_envs_and_config = _set_envs_and_config
@@ -297,16 +300,22 @@ async def generate(
         sampling_params["max_new_tokens"] = max_new_tokens
         return_logprob = sampling_params.pop("logprobs", False)
 
-        request = GenerateReqInput(
-            rid=request_id,
-            input_ids=prompt_ids,
-            sampling_params=sampling_params,
-            return_logprob=return_logprob,
-            image_data=image_data,
+        request = {
+            "rid": request_id,
+            "input_ids": prompt_ids,
+            "sampling_params": sampling_params,
+            "return_logprob": return_logprob,
+            "image_data": image_data,
             # TODO: support video input for sglang
             # video_data=video_data,
-        )
-        output = await self.tokenizer_manager.generate_request(request, None).__anext__()
+        }
+
+        if self.config.enable_rollout_routing_replay:
+            request.update({"return_routed_experts": True})
+
+        generate_request = GenerateReqInput(**request)
+
+        output = await self.tokenizer_manager.generate_request(generate_request, None).__anext__()
         if return_logprob:
             output_token_logprobs = output["meta_info"]["output_token_logprobs"]
             log_probs, token_ids = zip(
@@ -315,7 +324,26 @@ async def generate(
         else:
             token_ids = output["output_ids"]
             log_probs = None
-        return TokenOutput(token_ids=token_ids, log_probs=log_probs)
+
+        routed_experts = None
+        if self.config.enable_rollout_routing_replay:
+            if self.config.skip_tokenizer_init:
+                routed_experts = output.get("meta_info", {}).get("routed_experts", None)
+            else:
+                from sglang.srt.layers.moe.routed_experts_capturer import extract_routed_experts_from_meta_info
+
+                hf_config = self.model_config.hf_config
+                if not hasattr(hf_config, "num_hidden_layers") or not hasattr(hf_config, "num_experts_per_tok"):
+                    raise AttributeError(
+                        "enable_rollout_routing_replay is set, but hf_config is missing "
+                        "'num_hidden_layers' or 'num_experts_per_tok'. This feature requires an MoE model "
+                        "configuration that defines these attributes."
+                    )
+                routed_experts = extract_routed_experts_from_meta_info(output).reshape(
+                    -1, hf_config.num_hidden_layers, hf_config.num_experts_per_tok
+                )
+
+        return TokenOutput(token_ids=token_ids, log_probs=log_probs, routed_experts=routed_experts)
 
 
 _rollout_worker_actor_cls = ray.remote(ServerAdapter)