Skip to content

Commit 2fed8f7

Browse files
committed
fix pixtral-12b (#5007)
1 parent 71f3ba4 commit 2fed8f7

File tree

3 files changed

+3
-1
lines changed

3 files changed

+3
-1
lines changed

docs/source/Customization/自定义数据集.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,7 @@ alpaca格式:
166166
使用这种类型的数据需要注意:
167167
- 不同模型grounding任务的特殊字符和数据集格式不同。
168168
- 不同模型对bbox是否归一化的处理不同。例如:qwen2.5-vl使用绝对坐标,而qwen2-vl、internvl2.5需要对bbox的坐标进行千分位坐标归一化。
169+
- 注意:Qwen2.5-VL采用绝对坐标,因此要小心每次的图像缩放,如果使用方案一的数据集格式,你需要预先对图像进行resize(H和W需要是28的系数),并根据该尺寸缩放坐标点。如果使用方案二的数据集格式,ms-swift会帮助你处理图像的缩放问题,你依旧可以使用`MAX_PIXELS`或者`--max_pixels`等进行图像缩放(仅训练,推理场景,你依旧需要自己处理图像的缩放问题)。
169170

170171
2. 使用SWIFT的grounding数据格式:
171172

docs/source_en/Customization/Custom-dataset.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,7 @@ When using this type of data, please note:
179179

180180
- Different models have different special characters and data format for the grounding task.
181181
- The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
182+
- Note: Qwen2.5-VL uses absolute coordinates, so you need to be careful with image resizing each time. If you use the dataset format from Option 1, you need to resize the images in advance (height and width must be multiples of 28) and scale the coordinates accordingly. If you use the dataset format from Option 2, ms-swift will handle image resizing for you. You can still use `MAX_PIXELS` or `--max_pixels` for image resizing (training only; for inference, you still need to handle image resizing yourself).
182183

183184
1. Use SWIFT's grounding data format:
184185

swift/llm/template/template/pixtral.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
2424
idx_list = findall(input_ids, 10)
2525
if idx_list:
2626
image_inputs = processor.image_processor(images, patch_size=processor.patch_size, return_tensors='pt')
27-
encoded['pixel_values'] = image_inputs['pixel_values']
27+
encoded['pixel_values'] = image_inputs['pixel_values'].to(dtype=self.model_info.torch_dtype)
2828
encoded['image_sizes'] = image_sizes = image_inputs['image_sizes']
2929

3030
def _get_new_tokens(i):

0 commit comments

Comments
 (0)