Update index.html

dreaming-panda · web-flow · commit ee95ef8144a7 · 2025-06-05T23:19:15.000-04:00
diff --git a/index.html b/index.html
@@ -238,7 +238,7 @@ <h2 class="title is-3" style="text-align: center;">
             </p>
 
             <p>
-              Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>. <em>These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training—where parameter scaling saturates, test-time accuracy continues to improve through increased generation.</em>
+              Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>. While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks <em>our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns (Sutskever, 2024), TTS continues to benefit from increased token generation and more reasoning trials. </em>
             </p>
           </div>
         </div>