feat(robo2vlm_sft): support qwen3 vl model sft in rlinf (RLinf#781)

FxxxxU · web-flow · commit 187fde93dd88 · 2026-03-06T20:42:00.000+08:00
Signed-off-by: FxxxxU &lt;fu18801374388@163.com&gt;
diff --git a/.github/workflows/sft-e2e-tests.yml b/.github/workflows/sft-e2e-tests.yml
@@ -48,6 +48,7 @@ jobs:
           export UV_PYTHON_INSTALL_DIR=/workspace/dataset/.uv_python
           export MEGATRON_PATH=/workspace/dataset/Megatron-LM
           bash requirements/install.sh agentic
+          uv pip install transformers==4.57.1
 
       - name: SFT Robo2vlm train test
         timeout-minutes: 20
diff --git a/docs/source-en/rst_source/examples/embodied/sft_vlm.rst b/docs/source-en/rst_source/examples/embodied/sft_vlm.rst
@@ -1,16 +1,16 @@
-VLM Supervised Fine-Tuning (SFT)
+VLM Supervised Fine-Tuning
 ================================
 
 This document explains how to run **full-parameter supervised fine-tuning (Full-parameter SFT)** for VLM models in RLinf.
 
 This tutorial mainly focuses on two files:
 
 - Launch script: ``examples/sft/run_vlm_sft.sh``
-- Training config: ``examples/sft/config/custom_sft_vlm.yaml``
+- Training config: ``examples/sft/config/qwen2_5_sft_vlm.yaml``
 
 Launch Script: ``examples/sft/run_vlm_sft.sh``
 
-- The script uses ``examples/sft/config/custom_sft_vlm.yaml`` by default.
+- The script uses ``examples/sft/config/qwen2_5_sft_vlm.yaml`` by default.
 - Logs are redirected to: ``<repo>/logs/<timestamp>/``
 - Actual command:
 
@@ -21,7 +21,7 @@ Launch Script: ``examples/sft/run_vlm_sft.sh``
      --config-name <your_config_name> \
      runner.logger.log_path=<auto_generated_log_dir>
 
-Config Template: ``examples/sft/config/custom_sft_vlm.yaml``
+Config Template: ``examples/sft/config/qwen2_5_sft_vlm.yaml``
 
 The VLM config structure is similar to other RLinf training configs.  
 You mainly need to adapt ``data`` and ``actor.model`` for your VLM use case.
@@ -35,11 +35,11 @@ Preparation Before Running
    ``https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct``.
 3. Prepare Robo2VLM dataset:
    ``https://huggingface.co/datasets/keplerccc/Robo2VLM-1``.
-4. Edit ``examples/sft/config/custom_sft_vlm.yaml`` and run
+4. Edit ``examples/sft/config/qwen2_5_sft_vlm.yaml`` and run
    ``examples/sft/run_vlm_sft.sh``.
 
-Example YAML
-------------
+Example of Qwen2_5_VL_4B SFT
+----------------------------
 
 Important note: after downloading Robo2VLM, train and eval parquet files are mixed in one directory
 (e.g., ``train-00000-of-00262.parquet`` and ``test-0000X-of-00003.parquet``).
@@ -153,7 +153,7 @@ Run from repository root:
 
 Notes:
 
-- If no argument is provided, the script uses ``custom_sft_vlm`` by default.
+- If no argument is provided, the script uses ``qwen2_5_sft_vlm`` by default.
 - If your config name is different (e.g., ``my_vlm_config.yaml``), pass it as an argument:
 
 .. code:: bash
@@ -230,7 +230,7 @@ Update these fields first:
 - ``convertor.ckpt_path``: path to ``full_weights.pt``
 - ``convertor.save_path``: output HF model directory
 - ``model.model_path``: base model path
-- ``model.model_type``: model type (e.g., ``qwen2.5_vl``)
+- ``model.model_type``: model type (e.g., ``qwen2.5_vl`` , ``qwen3_vl`` or ``qwen3_vl_moe``)
 
 Run:
 
diff --git a/docs/source-zh/rst_source/examples/embodied/sft_vlm.rst b/docs/source-zh/rst_source/examples/embodied/sft_vlm.rst
@@ -6,13 +6,13 @@ VLM模型监督微调训练
 本教程重点需要关注两个文件：
 
 - 启动脚本：``examples/sft/run_vlm_sft.sh``
-- 训练配置：``examples/sft/config/custom_sft_vlm.yaml``
+- 训练配置：``examples/sft/config/qwen2_5_sft_vlm.yaml``
 
 ----------------------
 
 启动脚本：``examples/sft/run_vlm_sft.sh``
 
-- 当前脚本默认使用配置yaml文件 ``examples/sft/config/custom_sft_vlm.yaml``
+- 当前脚本默认使用配置yaml文件 ``examples/sft/config/qwen2_5_sft_vlm.yaml``
 - 重定向文件的输出在：``<repo>/logs/<timestamp>/``
 - 实际执行命令：
 
@@ -23,7 +23,7 @@ VLM模型监督微调训练
      --config-name <你的配置名> \
      runner.logger.log_path=<自动生成的日志目录>
 
-配置模板：``examples/sft/config/custom_sft_vlm.yaml``
+配置模板：``examples/sft/config/qwen2_5_sft_vlm.yaml``
 
  VLM 配置与 RLinf 中的其他 RL 训练文件结构基本一样，其中 ``data`` 和 ``actor.model`` 的具体值改为 VLM 场景。
 
@@ -33,9 +33,10 @@ VLM模型监督微调训练
 1. 准备好环境，下载 RLinf 官方镜像 ``rlinf/rlinf:math-rlinf0.1-torch2.6.0-sglang0.4.6.post5-vllm0.8.5-megatron0.13.0-te2.1``
 2. 准备好模型权重目录，下载网址 ``https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct``
 3. 准备好 Robo2VLM 数据集目录 ``https://huggingface.co/datasets/keplerccc/Robo2VLM-1``
-4. 修改 ``examples/sft/config/custom_sft_vlm.yaml`` 文件，运行脚本 ``examples/sft/run_vlm_sft.sh``
+4. 修改 ``examples/sft/config/qwen2_5_sft_vlm.yaml`` 文件，运行脚本 ``examples/sft/run_vlm_sft.sh``
 
-下面是实例 yaml 文件
+下面是 Qwen2.5-Vl-4B sft 的例子
+--------------------------------
 
 请注意，Robo2VLM数据集下载后由于它将 train 数据和 evaluate 数据放在一起，命名方式为 ``train-00000-of-00262.parquet`` 和 ``test-0000X-of-00003.parquet``，所以需要将它们分开，并分别放在不同的文件夹下，否则 RLinf 会直接读取整个数据集。
 
@@ -148,7 +149,7 @@ VLM模型监督微调训练
 
 说明：
 
-- 不传参数时，脚本默认 ``custom_sft_vlm``
+- 不传参数时，脚本默认 ``qwen2_5_sft_vlm``
 - 如果你文件名不同，比如 ``my_vlm_config.yaml``，就传参数：
 
 .. code:: bash
@@ -228,7 +229,7 @@ loss 曲线：
 - ``convertor.ckpt_path``：指向 ``full_weights.pt``
 - ``convertor.save_path``：输出 HF 权重目录
 - ``model.model_path``：原始基座模型路径
-- ``model.model_type``：对应模型类型（如 qwen2.5_vl）
+- ``model.model_type``：对应模型类型（如 ``qwen2.5_vl`` , ``qwen3_vl`` 或 ``qwen3_vl_moe`` ）
 
 运行命令：
 
diff --git a/examples/sft/config/custom_sft_openpi.yaml b/examples/sft/config/custom_sft_openpi.yaml
@@ -63,7 +63,7 @@ actor:
   # Override the default values in training_backend/fsdp
   fsdp_config:
     strategy: "fsdp"
-    sharding_strategy: "no_shard"
+    sharding_strategy: "full_shard"
     use_orig_params: False
     gradient_checkpointing: False # for openpi, gradient checkpointing is not supported, please do not change this value
     mixed_precision:
diff --git a/examples/sft/config/qwen2_5_sft_vlm.yaml b/examples/sft/config/qwen2_5_sft_vlm.yaml
@@ -75,12 +75,11 @@ actor:
     total_training_steps: ${runner.max_epochs}
     lr_warmup_steps: 200 
 
-  # Override the default values in training_backend/fsdp
   fsdp_config:
     strategy: "fsdp"
-    sharding_strategy: "no_shard"
+    sharding_strategy: "full_shard"
     use_orig_params: False
-    gradient_checkpointing: False # for openpi, gradient checkpointing is not supported, please do not change this value
+    gradient_checkpointing: False
     mixed_precision:
       param_dtype: bf16
       reduce_dtype: fp32
diff --git a/examples/sft/config/qwen3_sft_vlm.yaml b/examples/sft/config/qwen3_sft_vlm.yaml
@@ -0,0 +1,92 @@
+defaults:
+  - override hydra/job_logging: stdout
+
+hydra:
+  run:
+    dir: .
+  output_subdir: null
+
+cluster:
+  num_nodes: 1
+  component_placement:
+    actor: all
+
+runner:
+  task_type: sft
+  logger:
+    log_path: ${runner.output_dir}/${runner.experiment_name}
+    project_name: rlinf
+    experiment_name: ${runner.experiment_name}
+    logger_backends: ["tensorboard"] # wandb, swanlab
+
+  # sft runner use the len(dataset) as one epoch
+  max_epochs: 6000
+  max_steps: -1
+  # eval the model intenval
+  val_check_interval: 1000
+  # save the model interval
+  save_interval: 1000
+  experiment_name: qwen3_vl_sft
+  output_dir: ../results
+  resume_dir: null
+
+data:
+  type: vlm
+  dataset_name: "robo2vlmsft"
+  apply_chat_template: True
+  use_chat_template: True
+  # if train_data_paths is not None, the sft code will just eval the model
+  train_data_paths: "/path/to/Robo2VLM-1"
+  eval_data_paths: "/path/to/Robo2VLM-1"
+  prompt_key: "question"
+  choice_key: "choices"
+  answer_key: "correct_answer"
+  image_keys: ["image"]
+  max_prompt_length: 1024
+  lazy_loading: false
+  num_workers: 16
+  answer_separator: ""
+
+algorithm:
+  adv_type: gae
+
+actor:
+  group_name: "ActorGroup"
+  training_backend: "fsdp"
+  micro_batch_size: 4
+  eval_batch_size: 4
+  global_batch_size: 256
+  seed: 42
+
+  model:
+    model_type: "qwen3_vl"
+    precision: fp32
+    model_path: "/path/to/Qwen3-VL-4B-Instruct"
+    is_lora: False
+
+  optim:
+    lr: 1e-5
+    adam_beta1: 0.9
+    adam_beta2: 0.999
+    adam_eps: 1.0e-08
+    weight_decay: 0.01
+    clip_grad: 1.0
+    lr_scheduler: "cosine"
+    total_training_steps: ${runner.max_epochs}
+    lr_warmup_steps: 200 
+
+  fsdp_config:
+    strategy: "fsdp"
+    sharding_strategy: "full_shard"
+    use_orig_params: False
+    gradient_checkpointing: False
+    mixed_precision:
+      param_dtype: bf16
+      reduce_dtype: fp32
+      buffer_dtype: bf16
+
+reward:
+  use_reward_model: False
+
+critic:
+  use_critic_model: False
diff --git a/examples/sft/config/robotwin_sft_openpi.yaml b/examples/sft/config/robotwin_sft_openpi.yaml
@@ -72,7 +72,7 @@ actor:
   # Override the default values in training_backend/fsdp
   fsdp_config:
     strategy: "fsdp"
-    sharding_strategy: "no_shard"
+    sharding_strategy: "full_shard"
     use_orig_params: False
     gradient_checkpointing: False # for openpi, gradient checkpointing is not supported, please do not change this value
     mixed_precision:
diff --git a/examples/sft/run_vlm_sft.sh b/examples/sft/run_vlm_sft.sh
@@ -7,7 +7,7 @@ export SRC_FILE="${VLM_PATH}/train_vlm_sft.py"
 export PYTHONPATH=${REPO_PATH}:${LIBERO_REPO_PATH}:$PYTHONPATH
 
 if [ -z "$1" ]; then
-    CONFIG_NAME="custom_sft_vlm"
+    CONFIG_NAME="qwen2_5_sft_vlm"
 else
     CONFIG_NAME=$1
 fi
diff --git a/rlinf/hybrid_engines/fsdp/fsdp_model_manager.py b/rlinf/hybrid_engines/fsdp/fsdp_model_manager.py
@@ -193,28 +193,32 @@ def _optimize_with_liger_kernel(self, model: torch.nn.Module) -> None:
             from liger_kernel.transformers import (
                 apply_liger_kernel_to_qwen2,
                 apply_liger_kernel_to_qwen2_5_vl,
+                apply_liger_kernel_to_qwen3_moe,
+                apply_liger_kernel_to_qwen3_vl,
+                apply_liger_kernel_to_qwen3_vl_moe,
             )
 
+            LIGER_COMMON_KWARGS = {
+                "rope": True,
+                "rms_norm": True,
+                "swiglu": True,
+                "fused_linear_cross_entropy": True,
+            }
+
+            _liger_func_by_model = {
+                SupportedModel.QWEN2_5: apply_liger_kernel_to_qwen2,
+                SupportedModel.QWEN2_5_VL: apply_liger_kernel_to_qwen2_5_vl,
+                SupportedModel.QWEN2_5_VL_SFT: apply_liger_kernel_to_qwen2_5_vl,
+                SupportedModel.QWEN3_VL_SFT: apply_liger_kernel_to_qwen3_vl,
+                SupportedModel.QWEN3_MOE: apply_liger_kernel_to_qwen3_moe,
+                SupportedModel.QWEN3_VL_MOE_SFT: apply_liger_kernel_to_qwen3_vl_moe,
+            }
+
             MODEL_LIGER_KERNEL_APPLY_FUNC = {
-                SupportedModel.QWEN2_5: (
-                    apply_liger_kernel_to_qwen2,
-                    {
-                        "rope": True,
-                        "rms_norm": True,
-                        "swiglu": True,
-                        "fused_linear_cross_entropy": True,
-                    },
-                ),
-                SupportedModel.QWEN2_5_VL: (
-                    apply_liger_kernel_to_qwen2_5_vl,
-                    {
-                        "rope": True,
-                        "rms_norm": True,
-                        "swiglu": True,
-                        "fused_linear_cross_entropy": True,
-                    },
-                ),
+                model_type: (apply_fn, dict(LIGER_COMMON_KWARGS))
+                for model_type, apply_fn in _liger_func_by_model.items()
             }
+
             model_type = get_supported_model(
                 self._cfg.model.get("model_type", "").lower()
             )
diff --git a/rlinf/hybrid_engines/fsdp/utils.py b/rlinf/hybrid_engines/fsdp/utils.py
diff --git a/rlinf/workers/sft/fsdp_vlm_sft_worker.py b/rlinf/workers/sft/fsdp_vlm_sft_worker.py
diff --git a/tests/e2e_tests/sft/robo2vlm_sft_eval.yaml b/tests/e2e_tests/sft/robo2vlm_sft_eval.yaml
diff --git a/tests/e2e_tests/sft/robo2vlm_sft_train.yaml b/tests/e2e_tests/sft/robo2vlm_sft_train.yaml