You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h2class="title is-3">Comparison with existing RM</h2>
225
+
<divclass="content has-text-justified">
226
+
<p>
227
+
Existing top-ranked reward models on Reward Bench can perform pretty bad for best-of-N sampling in the coding scenarion, and sometime can underperform the greedy results. However, our AceCodeRM-7B consistently outperform them with an average of <b>6.9</b> improvement
228
+
</p>
226
229
<divclass="box m-5">
227
-
<divclass="content has-text-centered">
228
-
<imgsrc="static/images/ac_table4.png" alt="Comparion with other RM" width="95%"/>
229
-
<p> Existing top-ranked reward models on Reward Bench can perform pretty bad for best-of-N sampling in the coding scenarion, and sometime can underperform the greedy results. However, our AceCodeRM-7B consistently outperform them with an average of <b>6.9</b> improvement </p>
230
+
<divclass="content has-text-centered">
231
+
<imgsrc="static/images/ac_table4.png" alt="Comparion with other RM" class="center" width="80%"/>
<h2class="title is-3">Test case filtering matters</h2>
241
+
<divclass="content has-text-justified">
242
+
<p>
243
+
We also conduct experiments to investigate how filtering the test cases with a proxy model can affect the results. As shown in table, training RM on data after the filtering improve the performance significantly, especially for those hard code questions like MBPP-Plus and BigCodeBench-Hard (C/I). We believe this is because the test case filtering can ensure the remaining ones are consistent with each other and thus point to the same implicit program, which improves the quality of the rewards.
244
+
</p>
232
245
<divclass="box m-5">
233
-
<divclass="content has-text-centered">
234
-
<imgsrc="static/images/ac_table5.png" alt="Test case filtering matters" width="95%"/>
235
-
<p>We also conduct experiments to investigate how filtering the test cases with a proxy model can affect the results. As shown in table, training RM on data after the filtering improve the performance significantly, especially for those hard code questions like MBPP-Plus and BigCodeBench-Hard (C/I). We believe this is because the test case filtering can ensure the remaining ones are consistent with each other and thus point to the same implicit program, which improves the quality of the rewards.</p>
246
+
<divclass="content has-text-centered">
247
+
<imgsrc="static/images/ac_table5.png" alt="Test case filtering matters" class="center" width="80%"/>
We show that Qwen2.5-Coder is a better backbone for the reward model compared to Llama-3.1-8B. This is because the Qwen2.5-Coder models have been pre-trained on way more code-related data compared to the Llama-3.1 models, and thus more knowledgeable when tuning it into a reward model.
<p> We show that Qwen2.5-Coder is a better backbone for the reward model compared to Llama-3.1-8B. This is because the Qwen2.5-Coder models have been pre-trained on way more code-related data compared to the Llama-3.1 models, and thus more knowledgeable when tuning it into a reward model.</p>
0 commit comments