Skip to content

Commit e5b8f66

Browse files
Update index.html
1 parent 72d3bd5 commit e5b8f66

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

index.html

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,12 @@ <h2 class="title is-3" style="text-align: center;">
238238
</p>
239239

240240
<p>
241-
Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>. While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks, our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns, TTS continues to benefit from increased token generation and more reasoning trials.
241+
Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>.
242+
</p>
243+
244+
<p>
245+
While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks, our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns, TTS continues to benefit from increased token generation and more reasoning trials.
246+
</p>
242247
</p>
243248
</div>
244249
</div>

0 commit comments

Comments
 (0)