more description

ganler · ganler · commit dae1d191389c · 2024-10-27T17:11:18.000-05:00
diff --git a/evalperf.html b/evalperf.html
@@ -62,24 +62,33 @@ <h1 class="text-nowrap mt-5" style="font-size: xx-large;">
 
       <div class="container d-flex flex-column align-items-center">
         <div>
-          🚀 Code Efficiency Evaluation requires:
+          <p>🚀 LLM-oriented code efficiency evaluation requires:</p>
           <ul>
             <li><strong>Performance-exercising tasks & inputs --</strong> "all complexities are equal when N is small"
             </li>
             <li><strong>Meaningful compound metric --</strong> avg. speedup does not fit multi-task evaluation
             </li>
           </ul>
-          <p>Based on <strong>Differential Performance Evaluation</strong>, the EvalPerf dataset (current
-            version 20240328) includes:</p>
+          <p>🛍️ Based on <a href="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf">our methodology</a>,
+            the
+            EvalPerf dataset (current version 20240328) includes:</p>
           <ul>
             <li>118 performance-exercising tasks</li>
             <li>Each task is equipped with a <i>computationally challenging test input</i> generated by the SaS
               generator</li>
-            <li>Differential Performance Score (DPS): <i>"DPS=80"</i> means <i>"submissions can outperform 80%
-                of LLM solutions..."</i></li>
-            <li>Pairwise comparison of LLMs' code efficiency over common passing tasks to ablate correctness impact
+            <li>Differential Performance Score (DPS): <i>"DPS=80"</i> means <i>"submissions can outperform 80% LLM
+                solutions"</i></li>
+          </ul>
+
+          <p>🦾 The reliability of EvalPerf comes from:</p>
+          <ul>
+            <li><b>Correctness ablation:</b> Pairwise comparison of LLMs' code efficiency over common passing tasks</li>
+            <li><b>Anti-flakiness:</b> (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions
+              as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test
+              bed. -- These leads to low cross-platform variation (Paper Tab. 2)
             </li>
           </ul>
+
           Check out our <a href="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf">COLM'24 poster</a> for
           a more detailed overview!
         </div>