fix grounding

Jintao-Huang · Jintao-Huang · commit 247733bcafad · 2025-08-21T14:27:27.000+08:00
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -163,7 +163,7 @@ alpaca格式:
 
 #### grounding
 
-如果是grounding（物体检测）任务，SWIFT支持两种方式：
+如果是grounding（物体检测）任务，ms-swift支持两种方式：
 1. 直接使用对应模型grounding任务的数据集格式，例如qwen2-vl的格式如下：
 
 ```jsonl
@@ -176,7 +176,7 @@ alpaca格式:
 - 不同模型对bbox是否归一化的处理不同。例如：qwen2.5-vl使用绝对坐标，而qwen2-vl、internvl2.5需要对bbox的坐标进行千分位坐标归一化。
   - 注意：Qwen2.5-VL采用绝对坐标，因此要小心每次的图像缩放，如果使用方案一的数据集格式，你需要预先对图像进行resize（H和W需要是28的系数），并根据该尺寸缩放坐标点。如果使用方案二的数据集格式，ms-swift会帮助你处理图像的缩放问题，你依旧可以使用`MAX_PIXELS`或者`--max_pixels`等进行图像缩放（仅训练，推理场景，你依旧需要自己处理图像的缩放问题）。
 
-2. 使用SWIFT的grounding数据格式：
+2. 使用ms-swift的grounding数据格式：
 
 ```jsonl
 {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>描述图像"}, {"role": "assistant", "content": "<ref-object><bbox>和<ref-object><bbox>正在沙滩上玩耍"}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["一只狗", "一个女人"], "bbox": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}}
@@ -190,6 +190,22 @@ alpaca格式:
 - bbox_type: 可选项为'real'，'norm1'。默认为'real'，即bbox为真实bbox值。若是'norm1'，则bbox已经归一化为0~1。
 - image_id: 该参数只有当bbox_type为'real'时生效。代表bbox对应的图片是第几张，用于缩放bbox。索引从0开始，默认全为第0张。
 
+测试ms-swift格式的grounding数据格式的最终格式：
+```python
+import os
+os.environ["MAX_PIXELS"] = "1003520"
+from swift.llm import get_model_tokenizer, get_template
+
+_, tokenizer = get_model_tokenizer('Qwen/Qwen2.5-VL-7B-Instruct', load_model=False)
+template = get_template(tokenizer.model_meta.template, tokenizer)
+data = {...}
+template.set_mode('train')
+encoded = template.encode(data, return_template_inputs=True)
+print(f'[INPUT_IDS] {template.safe_decode(encoded["input_ids"])}\n')
+print(f'[LABELS] {template.safe_decode(encoded["labels"])}')
+print(f'images: {encoded["template_inputs"].images}')
+```
+
 ### 文生图格式
 
 ```jsonl
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -172,7 +172,7 @@ The data format for RLHF and sequence classification of multimodal models can re
 
 #### Grounding
 
-For grounding (object detection) tasks, SWIFT supports two methods:
+For grounding (object detection) tasks, ms-swift supports two methods:
 
 1. Directly use the data format of the grounding task corresponding to the model. For example, the format for qwen2-vl is as follows:
 
@@ -188,7 +188,7 @@ When using this type of data, please note:
 - The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
   - Note: Qwen2.5-VL uses absolute coordinates, so you need to be careful with image resizing each time. If you use the dataset format from Option 1, you need to resize the images in advance (height and width must be multiples of 28) and scale the coordinates accordingly. If you use the dataset format from Option 2, ms-swift will handle image resizing for you. You can still use `MAX_PIXELS` or `--max_pixels` for image resizing (training only; for inference, you still need to handle image resizing yourself).
 
-1. Use SWIFT's grounding data format:
+2. Use ms-swift's grounding data format:
 
 ```
 {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>Describe the image."}, {"role": "assistant", "content": "<ref-object><bbox> and <ref-object><bbox> are playing on the beach"}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["a dog", "a woman"], "bbox": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}}
@@ -204,6 +204,22 @@ The format will automatically convert the dataset format to the corresponding mo
 - bbox_type: Optional values are 'real' and 'norm1'. The default is 'real', meaning the bbox represents the actual bounding box value. If set to 'norm1', the bbox is normalized to the range 0~1.
 - image_id: This parameter is only effective when bbox_type is 'real'. It indicates the index of the image corresponding to the bbox, used for scaling the bbox. The index starts from 0, and the default is 0 for all.
 
+Testing the final format of the grounding data in ms-swift format:
+```python
+import os
+os.environ["MAX_PIXELS"] = "1003520"
+from swift.llm import get_model_tokenizer, get_template
+
+_, tokenizer = get_model_tokenizer('Qwen/Qwen2.5-VL-7B-Instruct', load_model=False)
+template = get_template(tokenizer.model_meta.template, tokenizer)
+data = {...}
+template.set_mode('train')
+encoded = template.encode(data, return_template_inputs=True)
+print(f'[INPUT_IDS] {template.safe_decode(encoded["input_ids"])}\n')
+print(f'[LABELS] {template.safe_decode(encoded["labels"])}')
+print(f'images: {encoded["template_inputs"].images}')
+```
+
 ### Text-to-Image Format
 
 ```jsonl
diff --git a/requirements/framework.txt b/requirements/framework.txt
@@ -19,7 +19,7 @@ numpy
 openai
 oss2
 pandas
-peft>=0.11,<0.17
+peft>=0.11,<0.18
 pillow
 PyYAML>=5.4
 requests
diff --git a/swift/llm/template/grounding.py b/swift/llm/template/grounding.py
@@ -5,6 +5,7 @@
 from typing import Any, List, Literal
 
 import requests
+from modelscope.hub.file_download import model_file_download
 from modelscope.hub.utils.utils import get_cache_dir
 from PIL import Image, ImageDraw, ImageFont
 
@@ -62,7 +63,6 @@ def draw_bbox(image: Image.Image,
               bbox: List[List[int]],
               norm_bbox: Literal['norm1000', 'none'] = 'norm1000'):
     bbox = deepcopy(bbox)
-    font_path = 'https://modelscope.cn/models/Qwen/Qwen-VL-Chat/resolve/master/SimSun.ttf'
     # norm bbox
     for i, box in enumerate(bbox):
         for i in range(len(box)):
@@ -82,7 +82,7 @@ def draw_bbox(image: Image.Image,
         color = color_mapping[box_ref]
         draw.rectangle([(left, top), (right, bottom)], outline=color, width=3)
     # draw text
-    file_path = download_file(font_path)
+    file_path = model_file_download('Qwen/Qwen-VL-Chat', 'SimSun.ttf')
     font = ImageFont.truetype(file_path, 20)
     for (left, top, _, _), box_ref in zip(bbox, ref):
         brightness = _calculate_brightness(