Update index.html

dreaming-panda · web-flow · commit e5b8f66b8809 · 2025-06-05T23:26:26.000-04:00
diff --git a/index.html b/index.html
@@ -238,7 +238,12 @@ <h2 class="title is-3" style="text-align: center;">
             </p>
 
             <p>
-              Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>. While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks, our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns, TTS continues to benefit from increased token generation and more reasoning trials. 
+              Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>.
+            </p>
+
+            <p>
+              While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks, our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns, TTS continues to benefit from increased token generation and more reasoning trials. 
+            </p>
             </p>
           </div>
         </div>