The results were inconsistent between the two evaluations using the inference model classification

The results were inconsistent when the inference model of DeepSeek-R1 was used for classification evaluation, and the DeepSeek-Chat model did not have this problem. What is the reason for this?