Skip to content

Commit 247733b

Browse files
committed
fix grounding
1 parent a7d2158 commit 247733b

File tree

4 files changed

+39
-7
lines changed

4 files changed

+39
-7
lines changed

docs/source/Customization/自定义数据集.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ alpaca格式:
163163

164164
#### grounding
165165

166-
如果是grounding(物体检测)任务,SWIFT支持两种方式
166+
如果是grounding(物体检测)任务,ms-swift支持两种方式
167167
1. 直接使用对应模型grounding任务的数据集格式,例如qwen2-vl的格式如下:
168168

169169
```jsonl
@@ -176,7 +176,7 @@ alpaca格式:
176176
- 不同模型对bbox是否归一化的处理不同。例如:qwen2.5-vl使用绝对坐标,而qwen2-vl、internvl2.5需要对bbox的坐标进行千分位坐标归一化。
177177
- 注意:Qwen2.5-VL采用绝对坐标,因此要小心每次的图像缩放,如果使用方案一的数据集格式,你需要预先对图像进行resize(H和W需要是28的系数),并根据该尺寸缩放坐标点。如果使用方案二的数据集格式,ms-swift会帮助你处理图像的缩放问题,你依旧可以使用`MAX_PIXELS`或者`--max_pixels`等进行图像缩放(仅训练,推理场景,你依旧需要自己处理图像的缩放问题)。
178178

179-
2. 使用SWIFT的grounding数据格式
179+
2. 使用ms-swift的grounding数据格式
180180

181181
```jsonl
182182
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>描述图像"}, {"role": "assistant", "content": "<ref-object><bbox>和<ref-object><bbox>正在沙滩上玩耍"}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["一只狗", "一个女人"], "bbox": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}}
@@ -190,6 +190,22 @@ alpaca格式:
190190
- bbox_type: 可选项为'real','norm1'。默认为'real',即bbox为真实bbox值。若是'norm1',则bbox已经归一化为0~1。
191191
- image_id: 该参数只有当bbox_type为'real'时生效。代表bbox对应的图片是第几张,用于缩放bbox。索引从0开始,默认全为第0张。
192192

193+
测试ms-swift格式的grounding数据格式的最终格式:
194+
```python
195+
import os
196+
os.environ["MAX_PIXELS"] = "1003520"
197+
from swift.llm import get_model_tokenizer, get_template
198+
199+
_, tokenizer = get_model_tokenizer('Qwen/Qwen2.5-VL-7B-Instruct', load_model=False)
200+
template = get_template(tokenizer.model_meta.template, tokenizer)
201+
data = {...}
202+
template.set_mode('train')
203+
encoded = template.encode(data, return_template_inputs=True)
204+
print(f'[INPUT_IDS] {template.safe_decode(encoded["input_ids"])}\n')
205+
print(f'[LABELS] {template.safe_decode(encoded["labels"])}')
206+
print(f'images: {encoded["template_inputs"].images}')
207+
```
208+
193209
### 文生图格式
194210

195211
```jsonl

docs/source_en/Customization/Custom-dataset.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ The data format for RLHF and sequence classification of multimodal models can re
172172

173173
#### Grounding
174174

175-
For grounding (object detection) tasks, SWIFT supports two methods:
175+
For grounding (object detection) tasks, ms-swift supports two methods:
176176

177177
1. Directly use the data format of the grounding task corresponding to the model. For example, the format for qwen2-vl is as follows:
178178

@@ -188,7 +188,7 @@ When using this type of data, please note:
188188
- The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
189189
- Note: Qwen2.5-VL uses absolute coordinates, so you need to be careful with image resizing each time. If you use the dataset format from Option 1, you need to resize the images in advance (height and width must be multiples of 28) and scale the coordinates accordingly. If you use the dataset format from Option 2, ms-swift will handle image resizing for you. You can still use `MAX_PIXELS` or `--max_pixels` for image resizing (training only; for inference, you still need to handle image resizing yourself).
190190

191-
1. Use SWIFT's grounding data format:
191+
2. Use ms-swift's grounding data format:
192192

193193
```
194194
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>Describe the image."}, {"role": "assistant", "content": "<ref-object><bbox> and <ref-object><bbox> are playing on the beach"}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["a dog", "a woman"], "bbox": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}}
@@ -204,6 +204,22 @@ The format will automatically convert the dataset format to the corresponding mo
204204
- bbox_type: Optional values are 'real' and 'norm1'. The default is 'real', meaning the bbox represents the actual bounding box value. If set to 'norm1', the bbox is normalized to the range 0~1.
205205
- image_id: This parameter is only effective when bbox_type is 'real'. It indicates the index of the image corresponding to the bbox, used for scaling the bbox. The index starts from 0, and the default is 0 for all.
206206

207+
Testing the final format of the grounding data in ms-swift format:
208+
```python
209+
import os
210+
os.environ["MAX_PIXELS"] = "1003520"
211+
from swift.llm import get_model_tokenizer, get_template
212+
213+
_, tokenizer = get_model_tokenizer('Qwen/Qwen2.5-VL-7B-Instruct', load_model=False)
214+
template = get_template(tokenizer.model_meta.template, tokenizer)
215+
data = {...}
216+
template.set_mode('train')
217+
encoded = template.encode(data, return_template_inputs=True)
218+
print(f'[INPUT_IDS] {template.safe_decode(encoded["input_ids"])}\n')
219+
print(f'[LABELS] {template.safe_decode(encoded["labels"])}')
220+
print(f'images: {encoded["template_inputs"].images}')
221+
```
222+
207223
### Text-to-Image Format
208224

209225
```jsonl

requirements/framework.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ numpy
1919
openai
2020
oss2
2121
pandas
22-
peft>=0.11,<0.17
22+
peft>=0.11,<0.18
2323
pillow
2424
PyYAML>=5.4
2525
requests

swift/llm/template/grounding.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from typing import Any, List, Literal
66

77
import requests
8+
from modelscope.hub.file_download import model_file_download
89
from modelscope.hub.utils.utils import get_cache_dir
910
from PIL import Image, ImageDraw, ImageFont
1011

@@ -62,7 +63,6 @@ def draw_bbox(image: Image.Image,
6263
bbox: List[List[int]],
6364
norm_bbox: Literal['norm1000', 'none'] = 'norm1000'):
6465
bbox = deepcopy(bbox)
65-
font_path = 'https://modelscope.cn/models/Qwen/Qwen-VL-Chat/resolve/master/SimSun.ttf'
6666
# norm bbox
6767
for i, box in enumerate(bbox):
6868
for i in range(len(box)):
@@ -82,7 +82,7 @@ def draw_bbox(image: Image.Image,
8282
color = color_mapping[box_ref]
8383
draw.rectangle([(left, top), (right, bottom)], outline=color, width=3)
8484
# draw text
85-
file_path = download_file(font_path)
85+
file_path = model_file_download('Qwen/Qwen-VL-Chat', 'SimSun.ttf')
8686
font = ImageFont.truetype(file_path, 20)
8787
for (left, top, _, _), box_ref in zip(bbox, ref):
8888
brightness = _calculate_brightness(

0 commit comments

Comments
 (0)