Skip to content

Commit ad5405d

Browse files
committed
docs: update papers
1 parent 829fb70 commit ad5405d

22 files changed

+1452
-1335
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,9 @@ Based on a systematic review of **196 papers and online resources**, this survey
103103

104104
*Benchmarks for evaluating issue resolution systems*
105105

106+
- `(2026-03)` **BeyondSWE**: BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.03194) [![Website](https://img.shields.io/badge/Website-paper-5B9BD5?logo=googlechrome&logoColor=white)](https://aweai-team.github.io/BeyondSWE/) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/AweAI-Team/BeyondSWE) [![HuggingFace](https://img.shields.io/badge/HuggingFace-dataset-ff7e21?logo=huggingface&logoColor=white)](https://huggingface.co/datasets/AweAI-Team/BeyondSWE)
106107
- `(2026-02)` **SWE Context Bench**: SWE Context Bench: A Benchmark for Context Learning in Coding [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2602.08316)
108+
- `(2026-02)` **SWE-ABS**: SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.00520)
107109
- `(2025-12)` **SWE-InfraBench**: SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code [![OpenReview](https://img.shields.io/badge/OpenReview-paper-8C1B13?logo=openreview&logoColor=white)](https://openreview.net/forum?id=XX0ciUwfXa)
108110
- `(2025-12)` **SWE-EVO**: SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.18470)
109111
- `(2025-11)` **SWE-Sharp-Bench**: SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2511.02352)
@@ -130,6 +132,9 @@ Based on a systematic review of **196 papers and online resources**, this survey
130132
*Datasets for training issue resolution agents*
131133

132134
- `(2026-02)` **SWE-Universe**: SWE-Universe: Scale Real-World Verifiable Environments to Millions [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://www.arxiv.org/abs/2602.02361)
135+
- `(2026-02)` **SWE-rebench V2**: SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.23866)
136+
- `(2026-02)` **Scale-SWE**: Immersion in the GitHub Universe: Scaling Coding Agents to Mastery [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.09892) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/AweAI-Team/ScaleSWE) [![HuggingFace](https://img.shields.io/badge/HuggingFace-dataset-ff7e21?logo=huggingface&logoColor=white)](https://huggingface.co/collections/AweAI-Team/scale-swe)
137+
- `(2026-01)` **daVinci-Dev**: daVinci-Dev: Agent-native Mid-training for Software Engineering [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.18418) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/GAIR-NLP/daVinci-Dev) [![HuggingFace](https://img.shields.io/badge/HuggingFace-dataset-ff7e21?logo=huggingface&logoColor=white)](https://huggingface.co/datasets/GAIR/daVinci-Dev)
133138
- `(2025-06)` **Skywork-SWE**: Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2506.19290)
134139
- `(2025-05)` **SWELoc**: SweRank: Software Issue Localization with Code Ranking [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2505.07849)
135140
- `(2025-04)` **Multi-SWE-RL**: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2504.02605v1) [![OpenReview](https://img.shields.io/badge/OpenReview-paper-8C1B13?logo=openreview&logoColor=white)](https://openreview.net/forum?id=MhBZzkz4h9)
@@ -157,6 +162,7 @@ Based on a systematic review of **196 papers and online resources**, this survey
157162

158163
*Collaborative multi-agent frameworks*
159164

165+
- `(2026-03)` **SWE-Adept**: SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.01327)
160166
- `(2025-08)` **Meta-RAG**: Meta-RAG on Large Codebases Using Code Summarization [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.02611)
161167
- `(2025-07)` **SWE-Debate**: SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2507.23348v1)
162168
- `(2025-06)` **AgentScope**: SWE-Bench - AgentScope [![Website](https://img.shields.io/badge/Website-paper-5B9BD5?logo=googlechrome&logoColor=white)](https://doc.agentscope.io/v0/en/tutorial/swe.html)
@@ -187,6 +193,7 @@ Based on a systematic review of **196 papers and online resources**, this survey
187193

188194
*Methods leveraging external tools*
189195

196+
- `(2026-03)` **SWE-Adept**: SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.01327)
190197
- `(2026-02)` **Closing the Loop**: Closing the Loop: Universal Repository Representation with RPG-Encoder [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.02084) [![Website](https://img.shields.io/badge/Website-paper-5B9BD5?logo=googlechrome&logoColor=white)](https://ayanami2003.github.io/RPG-Encoder/) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/microsoft/RPG-ZeroRepo)
191198
- `(2026-01)` **SWE-Tester**: SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.13713)
192199
- `(2025-12)` **GraphLocator**: GraphLocator: Graph-guided Causal Reasoning for Issue Localization [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.22469)
@@ -235,6 +242,7 @@ Based on a systematic review of **196 papers and online resources**, this survey
235242

236243
*Models trained via supervised learning*
237244

245+
- `(2026-02)` **Scale-SWE**: Immersion in the GitHub Universe: Scaling Coding Agents to Mastery [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.09892) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/AweAI-Team/ScaleSWE) [![HuggingFace](https://img.shields.io/badge/HuggingFace-dataset-ff7e21?logo=huggingface&logoColor=white)](https://huggingface.co/collections/AweAI-Team/scale-swe)
238246
- `(2026-01)` **SWE-Lego**: SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.01426)
239247
- `(2026-01)` **SWE-Replay**: SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.22129)
240248
- `(2025-12)` **SWE-Compressor**: Context as a Tool: Context Management for Long-Horizon SWE-Agents [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.22087)
@@ -258,6 +266,7 @@ Based on a systematic review of **196 papers and online resources**, this survey
258266
- `(2026-02)` **SWE-Protégé**: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.22124)
259267
- `(2026-02)` **SWE-MiniSandbox**: SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.11210v1) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](http://github.com/lblankl/SWE-MiniSandbox)
260268
- `(2026-01)` **MiMo-V2-Flash**: MiMo-V2-Flash Technical Report [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.02780)
269+
- `(2026-01)` **SWE-Manager**: SWE-Manager: Selecting and Synthesizing Golden Proposals Before Coding [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.22956) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/shuaijiumei/SWE-Manager)
261270
- `(2025-12)` **Self-play SWE-RL**: Toward Training Superintelligent Software Agents through Self-Play SWE-RL [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.18552)
262271
- `(2025-12)` **SWE-Playground**: Training Versatile Coding Agents in Synthetic Environments [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.12216)
263272
- `(2025-12)` **SWE-RM**: SWE-RM: Execution-free Feedback For Software Engineering Agents [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.21919)
@@ -308,6 +317,8 @@ Based on a systematic review of **196 papers and online resources**, this survey
308317
*Techniques for collecting training data*
309318

310319
- `(2026-02)` **DockSmith**: DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.00592) [![HuggingFace](https://img.shields.io/badge/HuggingFace-dataset-ff7e21?logo=huggingface&logoColor=white)](https://huggingface.co/collections/8sj7df9k8m5x8/docksmith)
320+
- `(2026-02)` **SWE-rebench V2**: SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.23866)
321+
- `(2026-02)` **Scale-SWE**: Immersion in the GitHub Universe: Scaling Coding Agents to Mastery [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.09892) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/AweAI-Team/ScaleSWE) [![HuggingFace](https://img.shields.io/badge/HuggingFace-dataset-ff7e21?logo=huggingface&logoColor=white)](https://huggingface.co/collections/AweAI-Team/scale-swe)
311322
- `(2026-01)` **MEnvAgent**: MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2601.22859) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/ernie-research/MEnvAgent)
312323
- `(2025-12)` **Multi-Docker-Eval**: Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.06915)
313324
- `(2025-08)` **RepoForge**: RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2508.01550)
@@ -321,6 +332,7 @@ Based on a systematic review of **196 papers and online resources**, this survey
321332
*Approaches for synthetic data generation*
322333

323334
- `(2026-02)` **SWE-World**: SWE-World: Building Software Engineering Agents in Docker-Free Environments [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.03419) [![GitHub](https://img.shields.io/badge/GitHub-repo-24292F?logo=github&logoColor=white)](https://github.com/RUCAIBox/SWE-World)
335+
- `(2026-02)` **SWE-Hub**: SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.00575)
324336
- `(2025-09)` **SWE-Mirror**: SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2509.08724)
325337
- `(2025-06)` **SWE-Flow**: Synthesizing Software Engineering Data in a Test-Driven Manner [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2506.09003v2) [![OpenReview](https://img.shields.io/badge/OpenReview-paper-8C1B13?logo=openreview&logoColor=white)](https://openreview.net/forum?id=P9DQ2IExgS)
326338
- `(2025-04)` **R2E-Gym**: R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents [![arXiv](https://img.shields.io/badge/arXiv-paper-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2504.07164) [![OpenReview](https://img.shields.io/badge/OpenReview-paper-8C1B13?logo=openreview&logoColor=white)](https://openreview.net/forum?id=7evvwwdo3z)

data/papers_data_analysis.yaml

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,30 @@
1-
- short_name: SWE-bench Verified
2-
title: Introducing SWE-bench Verified | OpenAI
3-
authors: OpenAI
4-
year: '2024'
5-
venue: '-'
6-
month: 2024-08
1+
- short_name: Data contamination
2+
title: Does SWE-Bench-Verified Test Agent Ability or Model Memory?
3+
authors: Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan
4+
year: '2025'
5+
venue: arXiv preprint arXiv:2512.10218
6+
month: 2025-12
77
links:
8-
website: https://openai.com/index/introducing-swe-bench-verified/
9-
- short_name: Patch Correctness
10-
title: Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study
11-
authors: You Wang, Michael Pradel, Zhongxin Liu
8+
arxiv: https://arxiv.org/abs/2512.10218
9+
- short_name: Rigorous agentic benchmarks
10+
title: Establishing Best Practices for Building Rigorous Agentic Benchmarks
11+
authors: Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha
12+
Cui, Sayash Kapoor et al.
1213
year: '2025'
13-
venue: arXiv preprint arXiv:2503.15223
14-
month: 2025-03
14+
venue: arXiv preprint arXiv:2507.02825
15+
month: 2025-07
1516
links:
16-
arxiv: https://arxiv.org/abs/2503.15223
17+
arxiv: https://arxiv.org/abs/2507.02825
18+
- short_name: SPICE
19+
title: "SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity,\n \
20+
\ Test Coverage, and Effort Estimation"
21+
authors: Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang,
22+
Yihao Chen, Zhilong Chen, Arthur Leung et al.
23+
year: '2025'
24+
venue: ASE 2025
25+
month: 2025-07
26+
links:
27+
arxiv: https://arxiv.org/abs/2507.09108v5
1728
- short_name: UTBoost
1829
title: 'UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench'
1930
authors: Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang
@@ -30,15 +41,6 @@
3041
month: 2025-06
3142
links:
3243
arxiv: https://arxiv.org/abs/2506.17812
33-
- short_name: Rigorous agentic benchmarks
34-
title: Establishing Best Practices for Building Rigorous Agentic Benchmarks
35-
authors: Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha
36-
Cui, Sayash Kapoor et al.
37-
year: '2025'
38-
venue: arXiv preprint arXiv:2507.02825
39-
month: 2025-07
40-
links:
41-
arxiv: https://arxiv.org/abs/2507.02825
4244
- short_name: The SWE-Bench Illusion
4345
title: 'The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason'
4446
authors: Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam
@@ -57,21 +59,19 @@
5759
month: 2025-04
5860
links:
5961
doi: http://dx.doi.org/10.1109/ICSE-Companion66252.2025.00075
60-
- short_name: SPICE
61-
title: "SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity,\n \
62-
\ Test Coverage, and Effort Estimation"
63-
authors: Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang,
64-
Yihao Chen, Zhilong Chen, Arthur Leung et al.
62+
- short_name: Patch Correctness
63+
title: Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study
64+
authors: You Wang, Michael Pradel, Zhongxin Liu
6565
year: '2025'
66-
venue: ASE 2025
67-
month: 2025-07
66+
venue: arXiv preprint arXiv:2503.15223
67+
month: 2025-03
6868
links:
69-
arxiv: https://arxiv.org/abs/2507.09108v5
70-
- short_name: Data contamination
71-
title: Does SWE-Bench-Verified Test Agent Ability or Model Memory?
72-
authors: Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan
73-
year: '2025'
74-
venue: arXiv preprint arXiv:2512.10218
75-
month: 2025-12
69+
arxiv: https://arxiv.org/abs/2503.15223
70+
- short_name: SWE-bench Verified
71+
title: Introducing SWE-bench Verified | OpenAI
72+
authors: OpenAI
73+
year: '2024'
74+
venue: '-'
75+
month: 2024-08
7676
links:
77-
arxiv: https://arxiv.org/abs/2512.10218
77+
website: https://openai.com/index/introducing-swe-bench-verified/

0 commit comments

Comments
 (0)