qwen3-vl使用DPO训练过程loss降到0了，但是训练后的模型推理很奇怪

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

训练了3个epoch，训练到后面loss都降到0了，但是我相同的数据使用sft微调就没问题
训练的配置文件如下：
### model
model_name_or_path: 

image_max_pixels: 1894400 # our:1850*32*32
video_max_pixels: 16384
trust_remote_code: true

### method
stage: dpo
do_train: true
do_predict: false
freeze_vision_tower: False
freeze_multi_modal_projector: False
finetuning_type: lora # lora,freeze,full
lora_rank: 32
lora_alpha: 64
lora_target: all
enable_liger_kernel: True


### dataset
dataset: good_with_super_severe_dpo
template: qwen3_vl
cutoff_len: 10000
max_samples: 10000000
overwrite_cache: True
preprocessing_num_workers: 64
dataloader_num_workers: 16

### output
output_dir: 
logging_steps: 10 # 10
save_steps: 2000 # 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false

#train
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
num_train_epochs: 3.0
lr_scheduler_type: cosine
learning_rate: 1.0e-4
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000


部分训练数据如下：
"conversations": [
      {
        "from": "human",
        "value": "<image>text"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "{\"passed\": \"是\", \"level\": 0, \"class\": 5}"
    },
    "rejected": {
      "from": "gpt",
      "value": "{\"passed\": \"否\", \"level\": 3, \"class\": 5}"
    },
    "images": [
      "path"
    ]

但是训练之后推理得到如下的结果，pred是大模型推理出来的，文本甚至都没生成全：
"pred": "<think>\n\n</think>\n\n{\"passed\": \"是\", 0",
"label": "<think>\n\n</think>\n\n{\"passed\": \"是\", \"level\": 0, \"class\": 4}\n"
有没有大佬知道这个问题怎么解决

### Reproduction

```text
Put your message here.
```


### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen3-vl使用DPO训练过程loss降到0了，但是训练后的模型推理很奇怪 #9757

Reminder

System Info

model

method

dataset

output

Reproduction

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

qwen3-vl使用DPO训练过程loss降到0了，但是训练后的模型推理很奇怪 #9757

Description

Reminder

System Info

model

method

dataset

output

Reproduction

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions