Skip to content

Commit dae1d19

Browse files
committed
more description
1 parent 8cc12f6 commit dae1d19

File tree

1 file changed

+15
-6
lines changed

1 file changed

+15
-6
lines changed

evalperf.html

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,24 +62,33 @@ <h1 class="text-nowrap mt-5" style="font-size: xx-large;">
6262

6363
<div class="container d-flex flex-column align-items-center">
6464
<div>
65-
🚀 Code Efficiency Evaluation requires:
65+
<p>🚀 LLM-oriented code efficiency evaluation requires:</p>
6666
<ul>
6767
<li><strong>Performance-exercising tasks & inputs --</strong> "all complexities are equal when N is small"
6868
</li>
6969
<li><strong>Meaningful compound metric --</strong> avg. speedup does not fit multi-task evaluation
7070
</li>
7171
</ul>
72-
<p>Based on <strong>Differential Performance Evaluation</strong>, the EvalPerf dataset (current
73-
version 20240328) includes:</p>
72+
<p>🛍️ Based on <a href="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf">our methodology</a>,
73+
the
74+
EvalPerf dataset (current version 20240328) includes:</p>
7475
<ul>
7576
<li>118 performance-exercising tasks</li>
7677
<li>Each task is equipped with a <i>computationally challenging test input</i> generated by the SaS
7778
generator</li>
78-
<li>Differential Performance Score (DPS): <i>"DPS=80"</i> means <i>"submissions can outperform 80%
79-
of LLM solutions..."</i></li>
80-
<li>Pairwise comparison of LLMs' code efficiency over common passing tasks to ablate correctness impact
79+
<li>Differential Performance Score (DPS): <i>"DPS=80"</i> means <i>"submissions can outperform 80% LLM
80+
solutions"</i></li>
81+
</ul>
82+
83+
<p>🦾 The reliability of EvalPerf comes from:</p>
84+
<ul>
85+
<li><b>Correctness ablation:</b> Pairwise comparison of LLMs' code efficiency over common passing tasks</li>
86+
<li><b>Anti-flakiness:</b> (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions
87+
as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test
88+
bed. -- These leads to low cross-platform variation (Paper Tab. 2)
8189
</li>
8290
</ul>
91+
8392
Check out our <a href="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf">COLM'24 poster</a> for
8493
a more detailed overview!
8594
</div>

0 commit comments

Comments
 (0)