How to gurantee the output.logits.shape[:-1] == labels.shape

![dpo](https://github.com/user-attachments/assets/7b4f4988-f403-4129-bfb4-7180a91eb94e)
How to gurantee the two the same? 
When I train a custom LLM in DPO, the loss cannot divergence. Is the reason for the two are different?