[Feature Request] Qwen3VL GRPO, SFT training

**Additional context**

Our customer would like to apply RL methods (GRPO, GSPO, and SPO) to VLM with MoE (such as Qwen3-VL).

Would it be possible to extend the current VLM support to Qwen3-VL?

(cc. @terrykong, @snowmanwwg )