fix pixtral-12b (#5007)

Jintao-Huang · Jintao-Huang · commit 2fed8f72a9a0 · 2025-07-18T15:41:01.000+08:00
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -166,6 +166,7 @@ alpaca格式:
 使用这种类型的数据需要注意：
 - 不同模型grounding任务的特殊字符和数据集格式不同。
 - 不同模型对bbox是否归一化的处理不同。例如：qwen2.5-vl使用绝对坐标，而qwen2-vl、internvl2.5需要对bbox的坐标进行千分位坐标归一化。
+  - 注意：Qwen2.5-VL采用绝对坐标，因此要小心每次的图像缩放，如果使用方案一的数据集格式，你需要预先对图像进行resize（H和W需要是28的系数），并根据该尺寸缩放坐标点。如果使用方案二的数据集格式，ms-swift会帮助你处理图像的缩放问题，你依旧可以使用`MAX_PIXELS`或者`--max_pixels`等进行图像缩放（仅训练，推理场景，你依旧需要自己处理图像的缩放问题）。
 
 2. 使用SWIFT的grounding数据格式：
 
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -179,6 +179,7 @@ When using this type of data, please note:
 
 - Different models have different special characters and data format for the grounding task.
 - The handling of bounding box normalization varies across different models: for example, qwen2.5-vl uses absolute coordinates, while qwen2-vl and internvl2.5 require bounding box coordinates to be normalized to the thousandth scale.
+  - Note: Qwen2.5-VL uses absolute coordinates, so you need to be careful with image resizing each time. If you use the dataset format from Option 1, you need to resize the images in advance (height and width must be multiples of 28) and scale the coordinates accordingly. If you use the dataset format from Option 2, ms-swift will handle image resizing for you. You can still use `MAX_PIXELS` or `--max_pixels` for image resizing (training only; for inference, you still need to handle image resizing yourself).
 
 1. Use SWIFT's grounding data format:
 
diff --git a/swift/llm/template/template/pixtral.py b/swift/llm/template/template/pixtral.py
@@ -24,7 +24,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         idx_list = findall(input_ids, 10)
         if idx_list:
             image_inputs = processor.image_processor(images, patch_size=processor.patch_size, return_tensors='pt')
-            encoded['pixel_values'] = image_inputs['pixel_values']
+            encoded['pixel_values'] = image_inputs['pixel_values'].to(dtype=self.model_info.torch_dtype)
             encoded['image_sizes'] = image_sizes = image_inputs['image_sizes']
 
             def _get_new_tokens(i):