TransferQueue
diff --git a/‎docs/advance/fully_async.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/advance/fully_async.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/advance/mtp.md‎
Lines changed: 84 additions & 0 deletions b/‎docs/advance/mtp.md‎
Lines changed: 84 additions & 0 deletions
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/mtp_trainer/runtime_env.yaml‎
Lines changed: 16 additions & 0 deletions b/‎examples/mtp_trainer/runtime_env.yaml‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron.sh‎
Lines changed: 144 additions & 0 deletions b/‎examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron.sh‎
Lines changed: 144 additions & 0 deletions
diff --git a/‎verl/experimental/fully_async_policy/README.md‎
Lines changed: 3 additions & 3 deletions b/‎verl/experimental/fully_async_policy/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎verl/experimental/fully_async_policy/README_zh.md‎
Lines changed: 3 additions & 3 deletions b/‎verl/experimental/fully_async_policy/README_zh.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎verl/models/mcore/model_forward.py‎
Lines changed: 13 additions & 1 deletion b/‎verl/models/mcore/model_forward.py‎
Lines changed: 13 additions & 1 deletion
@@ -63,7 +63,7 @@ Currently, the supported usage mode is Megatron/FSDP+vLLM/SGLang. vLLM/SGLang mu
 The overall architecture of fully_async_policy is shown in the figure below. fully_async_policy mainly consists of four
 parts: Rollouter, MessageQueue, Trainer, and ParameterSynchronizer.
 
-![fully_async_policy_structure](https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_structure.svg?raw=true)
+![fully_async_policy_structure](https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_structure.svg?raw=true)
 
 1. Rollouter generates sequences sample by sample and puts the generated samples into the MessageQueue, with the
    production speed controlled by freshness.
@@ -79,7 +79,7 @@ After we perform resource isolation, the time for rollout and train may be longe
 are used),
 but the overlap in their time consumption reduces the end-to-end time consumption.
 
-![fully_async_policy_revenue](https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_revenue.svg?raw=true)
+![fully_async_policy_revenue](https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_revenue.svg?raw=true)
 
 ## Usage
 
@@ -246,7 +246,7 @@ but the overlap in their time consumption reduces the end-to-end time consumptio
       generated after synchronization. This reduces the time to wait for active tasks to finish.
    3. As shown in figure d;
 
-![fully_async_policy_mode](https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_mode.svg?raw=true)
+![fully_async_policy_mode](https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_mode.svg?raw=true)
 
 ### Key Metrics
 
 
@@ -0,0 +1,84 @@
+# Guide to Using MTP in RL Training and Inference
+
+**Author**: `https://github.com/meituan-search`
+
+Last updated: 01/16/2026
+
+# 1. Scope of Support
+
+Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows:
+
+- **Training Engine**: Only supports the `mbridge + megatron` combination; other training engines are not compatible at this time;
+
+- **Inference Engine**: Compatible with all engines, but the model must be in the corresponding engine's compatibility list;
+
+- **Dependency Versions**:
+
+    - mbridge: Use the specified branch: [https://github.com/ArronHZG/mbridge/tree/feature/verl_mtp](https://github.com/ArronHZG/mbridge/tree/feature/verl_mtp) (will be merged into the main branch in the future);
+
+    - megatron: Use the latest dev version (commit: [23e092f41ec8bc659020e401ddac9576c1cfed7e](https://github.com/NVIDIA/Megatron-LM/tree/23e092f41ec8bc659020e401ddac9576c1cfed7e)), which supports MTP + CP training methods.
+
+# 2. MTP Training Configuration (Core Parameters)
+
+The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the `actor_rollout_ref.model.mtp` prefix:
+
+| Configuration Scenario | Core Parameters                                                                                                                                                                                                                                                                                               | Description                                             |
+|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
+| Load MTP Parameters Only | `enable=True`                                                                                                                                                                                                                                                                                              | VRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment              |
+| Full-Parameter MTP Training | `enable=True`<br>`enable_train=True`<br>`mtp_loss_scaling_factor=0.1`                                                                                                                                                                                                                              | MTP Loss will apply to all model parameters                            |
+| MTP Parameter-Only Training | `enable=True`<br>`enable_train=True`<br>`detach_encoder=True`                                                                                                                                                                                                                                      | Freeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters |
+| MTP Accelerated Rollout | 1. vLLM configuration:<br>`enable=True`<br>`enable_rollout=True`<br>`method="mtp"`<br>`num_speculative_tokens=1`<br>2. SGLang configuration:<br>`enable=True`<br>`enable_rollout=True`<br>`speculative_algorithm="EAGLE"`<br>`speculative_num_steps=2`<br>`speculative_eagle_topk=2`<br>`speculative_num_draft_tokens=4` | Achieve inference acceleration during the Rollout phase based on MTP                      |
+
+# 3. Experimental Results
+
+The experiment was conducted as follows:
+
+* model = mimo-7B-math
+* max_response_length = 8k
+
+Experiment chart:
+
+![fully_async_policy_revenue](
+https://github.com/ArronHZG/verl-community/blob/main/docs/mimo-7b-mtp.png?raw=true)
+
+**Scenarios with No Significant Effect**
+
+The following configurations will not have a noticeable impact on training results:
+
+1. The base model does not carry MTP parameters;
+
+2. The base model carries MTP parameters, but the MTP module is not trained;
+
+3. The base model carries MTP parameters and trains MTP, with `mtp_loss_scaling_factor=0`;
+
+4. The base model carries MTP parameters, trains MTP and detaches the encoder, with `mtp_loss_scaling_factor=0.1`.
+
+**Scenarios with Significant Effect**
+
+Only the following configuration will have a noticeable impact on training results:
+
+- The base model carries MTP parameters, MTP Loss applies to all model parameters, and `mtp_loss_scaling_factor=0.1`.
+
+**Recommended Training Method**
+
+It is recommended to adopt the `detach_encoder=True` approach for MTP training.
+
+# 4. Performance Notes for MTP in Rollout Inference
+
+The effectiveness of MTP-accelerated Rollout is significantly affected by **model size** and **inference hardware**. Key reference information is as follows:
+
+**Hardware Tensor Core Performance**
+
+| Hardware Model | FP16 Performance (TFLOPS) |
+|----------------|---------------------------|
+| H20  | 148            |
+| H800 | 1,671          |
+| H200 | 1,979          |
+
+**Measured Performance and Recommendations**
+
+Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%.
+
+- Current priority recommendation: Do not enable MTP acceleration during the inference phase for now;
+
+- Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance.
@@ -134,6 +134,7 @@ verl is fast with:
    advance/grafana_prometheus.md
    advance/fp8.md
    advance/async-on-policy-distill
+   advance/mtp.md
 
 .. toctree::
    :maxdepth: 1
 
@@ -0,0 +1,16 @@
+working_dir: ./
+
+excludes:
+  - ".git/"
+
+env_vars:
+  VLLM_USE_V1: "1"
+  HYDRA_FULL_ERROR: "1"
+  NCCL_NVLS_ENABLE: "0"
+  NCCL_SOCKET_IFNAME: "eth0"
+  TMPDIR: "/tmp"
+  CUDA_HOME: "/usr/local/cuda"
+  CUDA_TMPDIR: "/tmp"
+  CUDA_CACHE_PATH: "/tmp/cuda_cache"
+  HF_HOME: "/tmp/hf_home_mimo"
+  PYTHONPATH: "/tmp/hf_home_mimo/modules/"
@@ -0,0 +1,144 @@
+#!/usr/bin/env bash
+
+set -xeuo pipefail
+
+project_name='DAPO'
+exp_name='DAPO-mimo-7b-rl-megatron'
+
+adv_estimator=grpo
+
+use_kl_in_reward=False
+kl_coef=0.0
+use_kl_loss=False
+kl_loss_coef=0.0
+
+clip_ratio_low=0.2
+clip_ratio_high=0.28
+
+max_prompt_length=$((1024 * 2))
+max_response_length=$((1024 * 8))
+enable_overlong_buffer=True
+overlong_buffer_len=$((1024 * 4))
+overlong_penalty_factor=1.0
+
+loss_agg_mode="token-mean"
+
+train_prompt_bsz=128
+n_resp_per_prompt=16
+train_prompt_mini_bsz=32
+
+# Ray
+# RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"}
+# WORKING_DIR=${WORKING_DIR:-"${PWD}"}
+# RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/examples/mtp_trainer/runtime_env.yaml"}
+NNODES=${NNODES:-16}
+NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
+# Paths
+RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
+# very important! please modify the max_position_embeddings in config.json to 32768 after downloading from huggingface
+MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/MiMo-7B-RL"}
+CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
+TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
+TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
+
+# Algorithm
+temperature=1.0
+top_p=1.0
+top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
+val_top_p=0.7
+
+# Performance Related Parameter
+use_dynamic_bsz=True
+actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 2))
+infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 3))
+offload=True
+gen_tp=4
+train_tp=2
+train_pp=2
+train_cp=2
+
+common_params=(
+actor_rollout_ref.model.mtp.enable=True
+actor_rollout_ref.model.mtp.enable_train=True
+actor_rollout_ref.model.mtp.mtp_loss_scaling_factor=0.1
+actor_rollout_ref.model.mtp.detach_encoder=True
+)
+
+python -m verl.trainer.main_ppo \
+    --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml' \
+    data.train_files="${TRAIN_FILE}" \
+    data.val_files="${TEST_FILE}" \
+    data.prompt_key=prompt \
+    data.truncation='left' \
+    data.max_prompt_length=${max_prompt_length} \
+    data.max_response_length=${max_response_length} \
+    data.train_batch_size=${train_prompt_bsz} \
+    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
+    algorithm.adv_estimator=${adv_estimator} \
+    algorithm.use_kl_in_reward=${use_kl_in_reward} \
+    algorithm.kl_ctrl.kl_coef=${kl_coef} \
+    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
+    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
+    actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
+    actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
+    actor_rollout_ref.actor.clip_ratio_c=10.0 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.model.path="${MODEL_PATH}" \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
+    actor_rollout_ref.actor.optim.weight_decay=0.1 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
+    actor_rollout_ref.actor.megatron.param_offload=${offload} \
+    actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
+    actor_rollout_ref.actor.megatron.grad_offload=${offload} \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
+    actor_rollout_ref.actor.megatron.context_parallel_size=${train_cp} \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.actor.optim.clip_grad=1.0 \
+    actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
+    actor_rollout_ref.rollout.enable_chunked_prefill=True \
+    actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
+    actor_rollout_ref.rollout.temperature=${temperature} \
+    actor_rollout_ref.rollout.top_p=${top_p} \
+    actor_rollout_ref.rollout.top_k=${top_k} \
+    actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
+    actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
+    actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
+    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
+    actor_rollout_ref.rollout.val_kwargs.n=1 \
+    actor_rollout_ref.rollout.name=sglang \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} \
+    actor_rollout_ref.ref.megatron.context_parallel_size=${train_cp} \
+    actor_rollout_ref.ref.megatron.param_offload=${offload} \
+    reward_model.reward_manager=dapo \
+    +reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
+    +reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
+    +reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
+    +reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
+    +reward_model.reward_kwargs.max_resp_len=${max_response_length} \
+    trainer.logger='["console","tensorboard"]' \
+    trainer.project_name="${project_name}" \
+    trainer.experiment_name="${exp_name}" \
+    trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
+    trainer.nnodes="${NNODES}" \
+    trainer.val_before_train=False \
+    trainer.test_freq=10 \
+    trainer.save_freq=-1 \
+    trainer.total_epochs=10 \
+    trainer.resume_mode=auto \
+    trainer.log_val_generations=10 \
+    actor_rollout_ref.rollout.disable_log_stats=False \
+    actor_rollout_ref.rollout.prometheus.enable=True \
+    actor_rollout_ref.rollout.prometheus.port=44398 \
+    actor_rollout_ref.model.trust_remote_code=True \
+    data.trust_remote_code=True \
+    trainer.total_training_steps=400 \
+    actor_rollout_ref.actor.megatron.use_mbridge=True \
+    "${common_params[@]}"
@@ -65,7 +65,7 @@ The overall architecture of fully_async_policy is shown in the figure below. ful
 parts: Rollouter, MessageQueue, Trainer, and ParameterSynchronizer.
 
 ![fully_async_policy_structure](
-https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_structure.svg?raw=true)
+https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_structure.svg?raw=true)
 
 1. Rollouter generates sequences sample by sample and puts the generated samples into the MessageQueue, with the
    production speed controlled by freshness.
@@ -82,7 +82,7 @@ are used),
 but the overlap in their time consumption reduces the end-to-end time consumption.
 
 ![fully_async_policy_revenue](
-https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_revenue.svg?raw=true)
+https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_revenue.svg?raw=true)
 
 ## Usage
 
@@ -248,7 +248,7 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
     3. As shown in figure d;
 
 ![fully_async_policy_mode](
-https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_mode.svg?raw=true)
+https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_mode.svg?raw=true)
 
 ### Key Metrics
 
 
@@ -46,7 +46,7 @@ rollout的训练， 通过合理设置资源分配情况、参数同步频率等
 fully_async_policy的整体架构如下图所示，fully_async_policy主要由Rollouter、MessageQueue、Trainer、ParameterSynchronizer四部分组成。
 
 ![fully_async_policy_structure](
-https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_structure.svg?raw=true)
+https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_structure.svg?raw=true)
 
 1. Rollouter逐样本生成序列，并将生成的sample放入MessageQueue中，生产的速度受新鲜度控制。
 2. MessageQueue用于暂存Rollouter生成的sample。
@@ -59,7 +59,7 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
 但是相互之间的耗时overlap，端到端的耗时反而有所缩减。
 
 ![fully_async_policy_revenue](
-https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_revenue.svg?raw=true)
+https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_revenue.svg?raw=true)
 
 ## 使用方式
 
@@ -199,7 +199,7 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
     3. 如图d所示；
 
 ![fully_async_policy_mode](
-https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_async_policy_mode.svg?raw=true)
+https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_mode.svg?raw=true)
 
 ### 关键指标
 
 
@@ -17,6 +17,7 @@
 import torch
 
 from verl.utils.megatron_utils import unwrap_model
+from verl.workers.config import MtpConfig
 
 from .util import (
     postprocess_bshd,
@@ -41,6 +42,7 @@ def model_forward(
         logits_processor_args: dict = None,
         value_model=False,
         data_format: str = "thd",
+        mtp_config: MtpConfig = None,
     ):
         """Forward pass for models with sequence packing."""
         assert data_format in ["thd", "bshd"], "data_format must be 'thd' or 'bshd'"
@@ -65,10 +67,19 @@ def model_forward(
         batch_size, seq_len = attention_mask.shape[:2]
         if data_format == "thd":
             input_ids_rmpad, packed_seq_params = preprocess_packed_seqs(
-                input_ids, attention_mask, pre_process=pre_process, use_fp8_padding=use_fp8_padding
+                input_ids, attention_mask, pre_process=pre_process or post_process, use_fp8_padding=use_fp8_padding
             )
             input_ids_rmpad = input_ids_rmpad.contiguous()
 
+            # when pp > 1 and processor is not None, we need to pass the labels and loss_mask to the model
+            if mtp_config and mtp_config.enable_train and post_process:
+                args = {
+                    k: preprocess_packed_seqs(v, attention_mask, pre_process=True, use_fp8_padding=use_fp8_padding)[0]
+                    for k, v in logits_processor_args.items()
+                }
+                model_kwargs["labels"] = args["label"].contiguous()
+                model_kwargs["loss_mask"] = args["label_mask"].contiguous()
+
             input_args = dict(
                 input_ids=input_ids_rmpad,
                 attention_mask=None,
@@ -86,6 +97,7 @@ def model_forward(
                 input_args["attention_mask"] = attention_mask
 
             output_orig = model(**input_args)
+
             if post_process and logits_processor is not None:
                 args = {
                     k: preprocess_packed_seqs(v, attention_mask, pre_process=True, use_fp8_padding=use_fp8_padding)[0]