[VLM, FSDP] Update Experiment Readme (#1079)

nanjiangwill · web-flow · commit b25776098b4c · 2025-12-10T17:52:00.000+08:00
diff --git a/examples/true_on_policy_vlm/README.md b/examples/true_on_policy_vlm/README.md
@@ -2,8 +2,21 @@
 
 This example demonstrates true on-policy training with Qwen3-VL dense model on FSDP. The core concepts and expected observations are the same as [true_on_policy](../true_on_policy/README.md).
 
+<p align="center">
+  <img src="diff.png" alt="Training Inference Log Prob Diff" width="800">
+</p>
 ## Usage
 
 ```bash
-python examples/true_on_policy_vlm/run_simple.py
+SLIME_SCRIPT_NUM_GPUS=8 python examples/true_on_policy_vlm/run_simple.py
 ```
+
+## How it is Implemented
+
+For the text backbone, please refer to [true_on_policy for the text-only model](../true_on_policy/README.md).
+
+For the VLM, we only need to ensure that the image encoder behaves as expected. Please refer to [SGLang#14636](https://github.com/sgl-project/sglang/pull/14636). We need to align numeric operation details between the two systems, so that the ViT forward pass matches the behavior in both SGLang and transformers.
+
+## Notes
+
+It is expected that the true-on-policy version is slower.
diff --git a/examples/true_on_policy_vlm/diff.png b/examples/true_on_policy_vlm/diff.png
diff --git a/slime/backends/fsdp_utils/actor.py b/slime/backends/fsdp_utils/actor.py
@@ -149,9 +149,9 @@ def init(self, args: Namespace, role: str, with_ref: bool = False) -> int:  # ty
     def get_model_cls(self):
         # Vision models have `vision_config` in the config
         if hasattr(self.hf_config, "vision_config"):
-            from transformers import AutoModelForVision2Seq
+            from transformers import AutoModelForImageTextToText
 
-            return AutoModelForVision2Seq
+            return AutoModelForImageTextToText
         else:
             from transformers import AutoModelForCausalLM