Generation repetition after MPO post-training

Thanks for releasing InternVL and the MPO-related work

I’ve used MPO on our model. Math reasoning benchmarks, such as MathVista and MathVision, have improved. However, the model starts to repeat in our experiments. I saw that the paper mentions this behavior is expected when applying DPO. I’m wondering whether a similar behavior is also expected with MPO, and how it can be mitigated.

Thank you.