Skip to content

Commit 443e03b

Browse files
authored
Merge pull request #36 from logic-star-ai/feat/updated_eval
Update run scores for latest release
2 parents bc8cc81 + d51c79f commit 443e03b

File tree

5 files changed

+22
-13
lines changed

5 files changed

+22
-13
lines changed

docs/approaches.csv

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@ Aider,https://aider.chat,aider,https://github.com/logic-star-ai/swt-bench?tab=re
1212
AutoCodeRover,https://autocoderover.dev,autocoderover,https://github.com/logic-star-ai/swt-bench?tab=readme-ov-file#evaluation-results
1313
LIBRO,https://arxiv.org/abs/2209.11515,kaist,https://github.com/logic-star-ai/swt-bench?tab=readme-ov-file#evaluation-results
1414
Otter++,https://arxiv.org/abs/2502.05368v1,ibm,https://files.sri.inf.ethz.ch/swt-bench/otter/
15-
Otter,https://arxiv.org/abs/2502.05368v1,ibm,https://files.sri.inf.ethz.ch/swt-bench/otter/
15+
Otter,https://arxiv.org/abs/2502.05368v1,ibm,https://files.sri.inf.ethz.ch/swt-bench/otter/
16+
LogicStar AI,https://logicstar.ai/,logicstar,https://logicstar.ai/blog/logicstar-on-test-generation-benchmark-swt

docs/index.template.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,7 @@
289289
<p>News</p>
290290
</div>
291291
<div class="message-body">
292+
<p><strong><time>2025-09-13</time></strong> <a href="https://logicstar.ai">LogicStar</a> claims the first place on SWT-Verified, achieving almost 80% accuracy. Meanwhile, we release a new version of SWT-Bench, resolving various issues in evaluation grading. This results generally in increasing previously reported scores between 2-3%. Special thanks to all contributors!</p>
292293
<p><strong><time>2025-08-22</time></strong> The 1st and 3rd place on SWT-Verified are reclaimed by the latest release of <a href="https://all-hands.dev">OpenHands</a>, equipped with the newly released <a href="https://openai.com/index/introducing-gpt-5/">GPT-5</a> and <a href="https://openai.com/index/introducing-gpt-5/">GPT-5-mini</a>, respectively.</p>
293294
<p><strong><time>2025-08-11</time></strong> <a href="https://arxiv.org/abs/2508.06365">e-Otter++</a> claims the first position on the leaderboard with 50.7% and 60.7% on Lite and Verified respectively. They improve upon prior <a href="https://arxiv.org/abs/2502.05368v2">Otter</a> by more deeply integrating execution feedback and heterogeneous prompts in the generation loop.</p>
294295
<p><strong><time>2025-07-28</time></strong> <a href="https://github.com/uw-swag/AssertFlip">AssertFlip</a> demonstrates a method to generate test cases by flipping the semantics of generated passing tests, achieving superior performance with a success rate of 35.1% on SWT-Bench Lite and 43.4% on Verified.</p>

docs/orgs.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ ibm,IBM,https://www.ibm.com/,./static/images/logos/IBM.svg
44
aws,Amazon Web Services,https://aws.amazon.com/q/,./static/images/logos/Amazon_Web_Services_Logo.svg
55
uw,University of Waterloo (SWAG Lab),https://github.com/uw-swag/AssertFlip,./static/images/logos/uw.svg
66
allhands,All Hands AI,https://all-hands.dev/,./static/images/logos/allhands.svg
7-
logicstar,LogicStar,https://logicstar.ai/,./static/images/logos/logicstar.png
7+
logicstar,LogicStar,https://logicstar.ai/,./static/images/logos/logicstar_symbol_navy.svg
88
swe-agent,SWE-agent,https://swe-agent.com/,./static/images/logos/swe-agent.svg
99
aider,Aider,https://aider.chat/,./static/images/logos/aider.png
1010
autocoderover,AutoCodeRover,https://autocoderover.net,./static/images/logos/autocoderover.svg

docs/runs.csv

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
table_type,emojis,model_name,model_details,success_rate,coverage_increase,date,data_mode
22
lite,,AEGIS,,47.8,26.0,2025-02-17,reproduction
3-
lite,new,e-Otter++,Claude 3.7 Sonnet,50.7,56.4,2025-08-11,unittest
4-
lite,,Amazon Q Developer Agent,v20250405-dev,37.7,52.7,2025-04-10,unittest
5-
lite,new,AssertFlip,GPT-4o,35.1,44.2,2025-07-28,unittest
3+
lite,,e-Otter++,Claude 3.7 Sonnet,52.5,56.4,2025-08-11,unittest
4+
lite,,Amazon Q Developer Agent,v20250405-dev,39.9,52.7,2025-04-10,unittest
5+
lite,,AssertFlip,GPT-4o,38.0,44.2,2025-07-28,unittest
66
lite,,OpenHands,"Cl. Sonnet 3.5, CI setup",28.3,52.4,2025-02-18,unittest
77
lite,,OpenHands,"Cl. Sonnet 3.5, vanilla",22.8,43.6,2025-02-18,unittest
88
lite,,SWE-Agent+,GPT-4,18.5,27.6,2024-05-22,unittest
@@ -17,13 +17,14 @@ lite,,AutoCodeRover,GPT-4,9.1,17.9,2024-05-22,unittest
1717
lite,,LIBRO,GPT-4,14.1,23.8,2024-05-22,unittest
1818
lite,,Zero-Shot Plus,GPT-4 + BM25,9.4,21.5,2024-05-22,unittest
1919
lite,,Zero-Shot Base,GPT-4 + BM25,3.6,7.6,2024-05-22,unittest
20-
verified,new,OpenHands,GPT-5,75.8,66.3,2025-08-22,unittest
21-
verified,new,e-Otter++,Claude 3.7 Sonnet,60.7,62.3,2025-08-11,unittest
22-
verified,new,OpenHands,GPT-5-mini,56.8,60.4,2025-08-22,unittest
23-
verified,,Amazon Q Developer Agent,v20250405-dev,49.0,57.4,2025-04-10,unittest
24-
verified,new,AssertFlip,GPT-4o,43.4,47.4,2025-07-28,unittest
25-
verified,,Otter++,GPT-4o,37.0,42.8,2025-03-10,unittest
26-
verified,,Otter,GPT-4o,31.4,37.6,2025-03-10,unittest
20+
verified,new,OpenHands,GPT-5,79.8,66.3,2025-08-22,unittest
21+
verified,,e-Otter++,Claude 3.7 Sonnet,62.1,62.3,2025-08-11,unittest
22+
verified,new,OpenHands,GPT-5-mini,62.4,60.6,2025-08-22,unittest
23+
verified,,Amazon Q Developer Agent,v20250405-dev,51.0,57.4,2025-04-10,unittest
24+
verified,,AssertFlip,GPT-4o,45.5,47.4,2025-07-28,unittest
25+
verified,,Otter++,GPT-4o,37.4,42.8,2025-03-10,unittest
26+
verified,,Otter,GPT-4o,31.6,37.6,2025-03-10,unittest
2727
verified,,OpenHands,Cl. Sonnet 3.5,27.7,52.9,2025-02-28,unittest
2828
verified,,LIBRO,GPT-4o,17.8,38.0,2025-02-28,unittest
29-
verified,,Zero-Shot Plus,GPT-4o + BM25,14.3,34.0,2025-02-28,unittest
29+
verified,,Zero-Shot Plus,GPT-4o + BM25,14.3,34.0,2025-02-28,unittest
30+
verified,new,LogicStar AI,L*Agent v1, 79.9, 66.5,2025-09-13,unittest
Lines changed: 6 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)