You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source_en/Customization/Custom-dataset.md
+18-2Lines changed: 18 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -172,7 +172,7 @@ The data format for RLHF and sequence classification of multimodal models can re
172
172
173
173
#### Grounding
174
174
175
-
For grounding (object detection) tasks, SWIFT supports two methods:
175
+
For grounding (object detection) tasks, ms-swift supports two methods:
176
176
177
177
1. Directly use the data format of the grounding task corresponding to the model. For example, the format for qwen2-vl is as follows:
178
178
@@ -188,7 +188,7 @@ When using this type of data, please note:
188
188
- The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
189
189
- Note: Qwen2.5-VL uses absolute coordinates, so you need to be careful with image resizing each time. If you use the dataset format from Option 1, you need to resize the images in advance (height and width must be multiples of 28) and scale the coordinates accordingly. If you use the dataset format from Option 2, ms-swift will handle image resizing for you. You can still use `MAX_PIXELS` or `--max_pixels` for image resizing (training only; for inference, you still need to handle image resizing yourself).
190
190
191
-
1. Use SWIFT's grounding data format:
191
+
2. Use ms-swift's grounding data format:
192
192
193
193
```
194
194
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<image>Describe the image."}, {"role": "assistant", "content": "<ref-object><bbox> and <ref-object><bbox> are playing on the beach"}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["a dog", "a woman"], "bbox": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}}
@@ -204,6 +204,22 @@ The format will automatically convert the dataset format to the corresponding mo
204
204
- bbox_type: Optional values are 'real' and 'norm1'. The default is 'real', meaning the bbox represents the actual bounding box value. If set to 'norm1', the bbox is normalized to the range 0~1.
205
205
- image_id: This parameter is only effective when bbox_type is 'real'. It indicates the index of the image corresponding to the bbox, used for scaling the bbox. The index starts from 0, and the default is 0 for all.
206
206
207
+
Testing the final format of the grounding data in ms-swift format:
208
+
```python
209
+
import os
210
+
os.environ["MAX_PIXELS"] ="1003520"
211
+
from swift.llm import get_model_tokenizer, get_template
0 commit comments