[template] update qwen3_vl grounding dataset format (#6178)

Jintao-Huang · Jintao-Huang · commit c27dd1a29d41 · 2025-10-17T14:00:38.000+08:00
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -204,8 +204,12 @@ alpaca格式:
 ```
 使用这种类型的数据需要注意：
 - 不同模型grounding任务的特殊字符和数据集格式不同。
-- 不同模型对bbox是否归一化的处理不同。例如：qwen2.5-vl使用绝对坐标，而qwen2-vl、internvl2.5需要对bbox的坐标进行千分位坐标归一化。
+- 不同模型对bbox是否归一化的处理不同。例如：qwen2.5-vl使用绝对坐标，而qwen2/3-vl、internvl2.5需要对bbox的坐标进行千分位坐标归一化。
   - 注意：Qwen2.5-VL采用绝对坐标，因此要小心每次的图像缩放，如果使用方案一的数据集格式，你需要预先对图像进行resize（H和W需要是28的系数），并根据该尺寸缩放坐标点。如果使用方案二的数据集格式，ms-swift会帮助你处理图像的缩放问题，你依旧可以使用`MAX_PIXELS`或者`--max_pixels`等进行图像缩放（仅训练，推理场景，你依旧需要自己处理图像的缩放问题）。
+  - 对于Qwen2.5-VL/Qwen3-VL，你可以使用环境`QWENVL_BBOX_FORMAT='new'`（默认为'legacy'），以兼容[官方cookbook](https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb)格式。并将数据集定义成以下格式：
+  ```jsonl
+  {"messages": [{"role": "user", "content": "<image>找到图像中的<ref-object>"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["羊", "羊", "羊"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
+  ```
 
 2. 使用ms-swift的grounding数据格式：
 
@@ -215,8 +219,8 @@ alpaca格式:
 {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>帮我打开谷歌浏览器"}, {"role": "assistant", "content": "Action: click(start_box='<bbox>')"}], "images": ["/xxx/x.jpg"], "objects": {"ref": [], "bbox": [[615, 226]]}}
 ```
 该格式将自动转换数据集格式为对应模型的grounding任务格式，且选择对应模型的bbox归一化方式。该格式比通用格式多了objects字段，该字段包含的字段有：
-- ref: 用于替换`<ref-object>`。
-- bbox: 用于替换`<bbox>`。若bbox中每个box长度为2，则代表x和y坐标，若box长度为4，则代表2个点的x和y坐标。
+- ref: 用于替换`<ref-object>`。ref的长度需要与`<ref-object>`的数量一致。
+- bbox: 用于替换`<bbox>`。若bbox中每个box长度为2，则代表x和y坐标，若box长度为4，则代表2个点的x和y坐标。bbox的长度需要与`<bbox>`的数量一致。
   - 注意：`<ref-object>`和`<bbox>`并没有对应关系，ref和bbox各自替换各自的占位符。
 - bbox_type: 可选项为'real'，'norm1'。默认为'real'，即bbox为真实bbox值。若是'norm1'，则bbox已经归一化为0~1。
 - image_id: 该参数只有当bbox_type为'real'时生效。代表bbox对应的图片是第几张，用于缩放bbox。索引从0开始，默认全为第0张。
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -715,6 +715,8 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
 - FPS: 默认为2.0。
 - FPS_MIN_FRAMES: 默认为4。一段视频的最小抽帧数。
 - 🔥FPS_MAX_FRAMES: 默认为768。一段视频的最大抽帧数。
+- QWENVL_BBOX_FORMAT: grounding格式使用'legacy'还是'new'。'legacy'格式为：`<|object_ref_start|>一只狗<|object_ref_end|><|box_start|>(432,991),(1111,2077)<|box_end|>`，'new'格式参考：[Qwen3-VL cookbook](https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb)，并参考[grounding数据集格式文档](../Customization/自定义数据集.md#grounding)。默认为'legacy'。
+  - 注意：该环境变量适配Qwen2/2.5/3-VL和Qwen2.5/3-Omni系列模型。
 
 ### qwen2_audio
 - SAMPLING_RATE: 默认为16000。
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -216,8 +216,12 @@ For grounding (object detection) tasks, ms-swift supports two methods:
 When using this type of data, please note:
 
 - Different models have different special characters and data format for the grounding task.
-- The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
+- The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2/3-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
   - Note: Qwen2.5-VL uses absolute coordinates, so you need to be careful with image resizing each time. If you use the dataset format from Option 1, you need to resize the images in advance (height and width must be multiples of 28) and scale the coordinates accordingly. If you use the dataset format from Option 2, ms-swift will handle image resizing for you. You can still use `MAX_PIXELS` or `--max_pixels` for image resizing (training only; for inference, you still need to handle image resizing yourself).
+  - For Qwen2.5-VL/Qwen3-VL, you can set the environment variable `QWENVL_BBOX_FORMAT='new'` (default is `'legacy'`) to be compatible with the [official cookbook](https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb) format. Define your dataset in the following format:
+  ```jsonl
+  {"messages": [{"role": "user", "content": "<image>Locate the <ref-object> in the image"}, {"role": "assistant", "content": "[\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n\t{\"bbox_2d\": <bbox>, \"label\": \"<ref-object>\"}\n]"}], "images": ["cat.png"], "objects": {"ref": ["sheep", "sheep", "sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
+  ```
 
 2. Use ms-swift's grounding data format:
 
@@ -229,8 +233,8 @@ When using this type of data, please note:
 
 The format will automatically convert the dataset format to the corresponding model's grounding task format and select the appropriate model's bbox normalization method. Compared to the general format, this format includes an additional "objects" field, which contains the following subfields:
 
-- ref: Used to replace `<ref-object>`.
-- bbox: Used to replace `<bbox>`. If the length of each box in the bbox is 2, it represents the x and y coordinates. If the box length is 4, it represents the x and y coordinates of two points.
+- ref: Used to replace `<ref-object>`. The length of `ref` should match the number of `<ref-object>` instances.
+- bbox: Used to replace `<bbox>`. If the length of each box in the bbox is 2, it represents the x and y coordinates. If the box length is 4, it represents the x and y coordinates of two points. The length of `bbox` should match the number of `<bbox>` instances.
   - Note: `<ref-object>` and `<bbox>` do not have a corresponding relationship; references and bounding boxes replace their own placeholders separately.
 - bbox_type: Optional values are 'real' and 'norm1'. The default is 'real', meaning the bbox represents the actual bounding box value. If set to 'norm1', the bbox is normalized to the range 0~1.
 - image_id: This parameter is only effective when bbox_type is 'real'. It indicates the index of the image corresponding to the bbox, used for scaling the bbox. The index starts from 0, and the default is 0 for all.
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -735,7 +735,8 @@ These parameters have the same meaning as in `qwen_vl_utils<0.0.12` or the `qwen
 - FPS: Default is 2.0.
 - FPS_MIN_FRAMES: Default is 4. Minimum number of frames extracted from a video clip.
 - 🔥FPS_MAX_FRAMES: Default is 768. Maximum number of frames extracted from a video clip.
-
+- QWENVL_BBOX_FORMAT : Specifies whether to use `'legacy'` or `'new'` format for grounding. The `'legacy'` format is: `<|object_ref_start|>a dog<|object_ref_end|><|box_start|>(432,991),(1111,2077)<|box_end|>`. The `'new'` format refers to: [Qwen3-VL Cookbook](https://github.com/QwenLM/Qwen3-VL/blob/main/cookbooks/2d_grounding.ipynb). For dataset formatting, see the [Grounding Dataset Format Documentation](../Customization/Custom-dataset.md#grounding). Default: `'legacy'`.
+  - Note: This environment variable applies to Qwen2/2.5/3-VL and Qwen2.5/3-Omni series models.
 
 ### qwen2_audio
 - SAMPLING_RATE: Default is 16000
diff --git a/swift/llm/template/template/qwen.py b/swift/llm/template/template/qwen.py
@@ -264,6 +264,7 @@ class Qwen2VLTemplate(Template):
     def init_env_args(self):
         super().init_env_args()
         self.transformers_version = version.parse(transformers.__version__)
+        self.bbox_format = get_env_args('QWENVL_BBOX_FORMAT', str, 'legacy')
 
     def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int,
                     inputs: StdTemplateInputs) -> List[Context]:
@@ -296,10 +297,16 @@ def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int
             return tokens
 
     def replace_ref(self, ref: str, index: int, inputs: StdTemplateInputs) -> List[Context]:
-        return [f'<|object_ref_start|>{ref}<|object_ref_end|>']
+        if self.bbox_format == 'legacy':
+            return [f'<|object_ref_start|>{ref}<|object_ref_end|>']
+        else:
+            return [ref]
 
     def replace_bbox(self, bbox: List[int], index: int, inputs: StdTemplateInputs) -> List[Context]:
-        return [f'<|box_start|>{self._get_bbox_str(bbox)}<|box_end|>']
+        if self.bbox_format == 'legacy':
+            return [f'<|box_start|>{self._get_bbox_str(bbox)}<|box_end|>']
+        else:
+            return [str(bbox)]
 
     def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         encoded = super()._encode(inputs)