@@ -62,24 +62,33 @@ <h1 class="text-nowrap mt-5" style="font-size: xx-large;">
6262
6363 < div class ="container d-flex flex-column align-items-center ">
6464 < div >
65- 🚀 Code Efficiency Evaluation requires:
65+ < p > 🚀 LLM-oriented code efficiency evaluation requires:</ p >
6666 < ul >
6767 < li > < strong > Performance-exercising tasks & inputs --</ strong > "all complexities are equal when N is small"
6868 </ li >
6969 < li > < strong > Meaningful compound metric --</ strong > avg. speedup does not fit multi-task evaluation
7070 </ li >
7171 </ ul >
72- < p > Based on < strong > Differential Performance Evaluation</ strong > , the EvalPerf dataset (current
73- version 20240328) includes:</ p >
72+ < p > 🛍️ Based on < a href ="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf "> our methodology</ a > ,
73+ the
74+ EvalPerf dataset (current version 20240328) includes:</ p >
7475 < ul >
7576 < li > 118 performance-exercising tasks</ li >
7677 < li > Each task is equipped with a < i > computationally challenging test input</ i > generated by the SaS
7778 generator</ li >
78- < li > Differential Performance Score (DPS): < i > "DPS=80"</ i > means < i > "submissions can outperform 80%
79- of LLM solutions..."</ i > </ li >
80- < li > Pairwise comparison of LLMs' code efficiency over common passing tasks to ablate correctness impact
79+ < li > Differential Performance Score (DPS): < i > "DPS=80"</ i > means < i > "submissions can outperform 80% LLM
80+ solutions"</ i > </ li >
81+ </ ul >
82+
83+ < p > 🦾 The reliability of EvalPerf comes from:</ p >
84+ < ul >
85+ < li > < b > Correctness ablation:</ b > Pairwise comparison of LLMs' code efficiency over common passing tasks</ li >
86+ < li > < b > Anti-flakiness:</ b > (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions
87+ as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test
88+ bed. -- These leads to low cross-platform variation (Paper Tab. 2)
8189 </ li >
8290 </ ul >
91+
8392 Check out our < a href ="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf "> COLM'24 poster</ a > for
8493 a more detailed overview!
8594 </ div >
0 commit comments