Skip to content

Commit ee95ef8

Browse files
Update index.html
1 parent 54c610b commit ee95ef8

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ <h2 class="title is-3" style="text-align: center;">
238238
</p>
239239

240240
<p>
241-
Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>. <em>These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training—where parameter scaling saturates, test-time accuracy continues to improve through increased generation.</em>
241+
Furthermore, we demonstrate that <strong style="color: #4b6cb7;">sparse attention</strong> unlocks new scaling opportunities by mitigating KV memory overhead, enabling <em style="color: #27ae60;">longer generations</em> and <em style="color: #27ae60;">more parallel reasoning trials</em> within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over <strong style="color: #27ae60;">60 points</strong> gains in low-cost regimes and over <strong style="color: #27ae60;">5 points</strong> gains in high-cost regimes for problem-solving accuracy on <em style="color: #27ae60;">AIME</em>, encompassing evaluations on state-of-the-art MoEs. <a href="#figure2" style="color: #2c6eab; text-decoration: underline;">(See Figure 2)</a>. While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks <em>our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns (Sutskever, 2024), TTS continues to benefit from increased token generation and more reasoning trials. </em>
242242
</p>
243243
</div>
244244
</div>

0 commit comments

Comments
 (0)