Skip to content

Commit d51c79f

Browse files
committed
Add news, update numbers for local reproduction
1 parent 7ccb560 commit d51c79f

File tree

2 files changed

+2
-1
lines changed

2 files changed

+2
-1
lines changed

docs/index.template.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,7 @@
289289
<p>News</p>
290290
</div>
291291
<div class="message-body">
292+
<p><strong><time>2025-09-13</time></strong> <a href="https://logicstar.ai">LogicStar</a> claims the first place on SWT-Verified, achieving almost 80% accuracy. Meanwhile, we release a new version of SWT-Bench, resolving various issues in evaluation grading. This results generally in increasing previously reported scores between 2-3%. Special thanks to all contributors!</p>
292293
<p><strong><time>2025-08-22</time></strong> The 1st and 3rd place on SWT-Verified are reclaimed by the latest release of <a href="https://all-hands.dev">OpenHands</a>, equipped with the newly released <a href="https://openai.com/index/introducing-gpt-5/">GPT-5</a> and <a href="https://openai.com/index/introducing-gpt-5/">GPT-5-mini</a>, respectively.</p>
293294
<p><strong><time>2025-08-11</time></strong> <a href="https://arxiv.org/abs/2508.06365">e-Otter++</a> claims the first position on the leaderboard with 50.7% and 60.7% on Lite and Verified respectively. They improve upon prior <a href="https://arxiv.org/abs/2502.05368v2">Otter</a> by more deeply integrating execution feedback and heterogeneous prompts in the generation loop.</p>
294295
<p><strong><time>2025-07-28</time></strong> <a href="https://github.com/uw-swag/AssertFlip">AssertFlip</a> demonstrates a method to generate test cases by flipping the semantics of generated passing tests, achieving superior performance with a success rate of 35.1% on SWT-Bench Lite and 43.4% on Verified.</p>

docs/runs.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ verified,,Otter,GPT-4o,31.6,37.6,2025-03-10,unittest
2727
verified,,OpenHands,Cl. Sonnet 3.5,27.7,52.9,2025-02-28,unittest
2828
verified,,LIBRO,GPT-4o,17.8,38.0,2025-02-28,unittest
2929
verified,,Zero-Shot Plus,GPT-4o + BM25,14.3,34.0,2025-02-28,unittest
30-
verified,new,LogicStar AI,L*Agent v1, 79.9, 66.5,2025-09-12,unittest
30+
verified,new,LogicStar AI,L*Agent v1, 79.9, 66.5,2025-09-13,unittest

0 commit comments

Comments
 (0)