RepoGym turns GitHub history into a reinforcement learning environment for AI coding agents.
A reinforcement learning environment for AI coding agents, built on real Git history.
RepoGym is a scalable research environment for training and evaluating AI coding agents on verified real-world bug-fixing tasks extracted automatically from Git history. Unlike static benchmarks, RepoGym provides a live execution environment with exact reward signals derived from test suite delta, enabling online RL training without human annotation.
RepoGym mines verified bug-fixing commits from open source repositories, runs agent patches in isolated Docker sandboxes, and returns an exact reward signal based on test suite delta. No synthetic problems. No toy benchmarks. Real code, real tests, real feedback.
Most coding benchmarks measure whether a model can solve toy problems.
Real software engineering is different:
- large codebases
- multi-file reasoning
- real test suites
- regressions and edge cases
RepoGym converts real bug-fixing commits from open source repositories into reproducible environments where AI agents can practice fixing real code.
| Feature | HumanEval | SWE-bench | RepoGym |
|---|---|---|---|
| Scale | Fixed (164 tasks) | Fixed (2,294 tasks) | Unlimited (Live Mining) |
| Source | Synthetic | Curated Issues | Automated History Mining |
| Signal | Pass/Fail | Pass/Fail | Reward Delta (+0.00x) |
| Primary Use | Evaluation | Evaluation | Iterative RL Training |
RepoGym is not just a benchmark; it is a training ground. By providing a continuous reward signal instead of a binary result, it allows models to learn from partial progress—perfect for RL-based fine-tuning.
Every merged bugfix commit is a labeled training example: the broken state before the fix, the corrected state after, and a test suite that distinguishes them. RepoGym automates the extraction of these pairs and turns them into a structured environment any agent can interact with over HTTP.
graph TD
A[Git History] --> B(Extract Bugfix Commits)
B --> C{RepoGym Environment}
C -->|Fetch Task| D[Autonomous Agent]
D -->|Generate Patch| E[Docker Sandbox]
E -->|Run Tests| F[Reward Signal]
F -->|RL Feedback| D
F -->|Log History| G[Live Dashboard]
The reward is simple and verifiable:
reward = score(patched) − score(broken)
= tests_passed_after / total − tests_passed_before / total
A perfect patch scores +1.0. A patch that breaks things scores negative. No LLM judge, no rubric, no ambiguity.
# 1. build and start the server
cargo build --release
cargo run -- serve --port 8080
# 2. get a task
curl http://localhost:8080/task
# 3. submit a patch
curl -X POST http://localhost:8080/submit \
-H "Content-Type: application/json" \
-d '{"task_id": 1, "agent_id": "my-agent", "patch": "<unified diff>"}'
# 4. view the dashboard
open http://localhost:8080The dashboard is the Scoreboard for AI Training. Think of it like a gym for AI models where they "lift weights" (fix code) and get a score on how well they did.
- 20,376 Verified Tasks: Our library of real bugs from Django, Flask, and Requests. These are not synthetic problems; they are real mistakes human developers fixed in the past.
- Average Reward (0.000): Across all AIs working right now, this is the current baseline. As you train your model, you will see this number turn green and move positive.
A log of every single attempt an AI makes to fix a bug:
- Score (e.g., 1.000): How many tests passed after the AI applied its fix. 1.000 is a perfect fix.
- Delta (e.g., +0.008): This is the Reward Signal. It tells the AI: "You made the code 0.8% more correct than it was before."
- Green (+): The AI fixed a test that was previously failing. This is how it learns.
- Red (-): The AI's "fix" accidentally broke something that was previously working (Regression).
Ranks AIs based on their overall success. Currently, asbestos-local is at the baseline, ready for reinforcement learning to drive its score up.
To train an AI using Reinforcement Learning (RL), the agent needs a "Reward" to know if it's doing a good job. Usually, this requires a human to check the work.
RepoGym does it automatically. By using Docker to run real tests and returning the reward signal instantly, it allows an AI to "practice" coding 24/7 without human supervision.
Try this: Click the VIEW button on any of those rows. It will show you exactly what the AI was trying to fix and the "cheat sheet" (the actual human fix) for that bug.
# pull the sandbox image
docker pull python:3.11-slim
# build the CLI
cargo build --releaseRepoGym is designed to ingest from any Git repository. More repos means more tasks.
# ingest repos
cargo run -- ingest https://github.com/psf/requests --dest ./repos/requests
cargo run -- ingest https://github.com/pallets/flask --dest ./repos/flask
cargo run -- ingest https://github.com/fastapi/fastapi --dest ./repos/fastapi
cargo run -- ingest https://github.com/encode/httpx --dest ./repos/httpx
cargo run -- ingest https://github.com/pytest-dev/pytest --dest ./repos/pytest
# extract and save tasks
cargo run -- save-tasks ./repos/requests requests
cargo run -- save-tasks ./repos/flask flask
cargo run -- save-tasks ./repos/fastapi fastapiCurrent dataset: 20,376 verified tasks across psf/requests, pallets/flask, and django/django.
| Method | Route | Description |
|---|---|---|
GET |
/task |
Returns a random task. Add ?repo=requests to filter by repo. |
POST |
/submit |
Accepts a patch, runs the sandbox, returns the reward signal. |
GET |
/leaderboard |
Agent rankings by average score. |
GET |
/ |
Live dashboard. |
{
"id": 4401,
"repo": "requests",
"description": "fix models.py encoding fallback",
"base_commit": "af52bf79",
"fix_commit": "6d3d5e59",
"files_modified": ["requests/models.py", "tests/test_requests.py"],
"test_command": "pytest tests/"
}{
"task_id": 4401,
"agent_id": "claude-opus-4-5",
"patch": "--- a/requests/models.py\n+++ b/requests/models.py\n..."
}{
"score": 1.000,
"delta": +0.005,
"tests_passed": 206,
"tests_failed": 0
}RepoGym exposes a clean loop any agent can run:
fetch task → read broken code → generate patch → submit → log reward → repeat
Example Python agent using the Anthropic API:
import httpx
import anthropic
REPOGYM = "http://localhost:8080"
client = anthropic.Anthropic()
def run_agent(n=10):
for _ in range(n):
task = httpx.get(f"{REPOGYM}/task?repo=requests").json()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Fix this bug as a unified diff patch.
Repo: {task['repo']}
Description: {task['description']}
Files: {', '.join(task['files_modified'])}
Base commit: {task['base_commit']}
Output only the patch."""
}]
)
result = httpx.post(f"{REPOGYM}/submit", json={
"task_id": task["id"],
"agent_id": "claude-opus-4-5",
"patch": response.content[0].text
}).json()
print(f"{task['description']}")
print(f"score {result['score']:.3f} delta {result['delta']:+.3f}")src/
core/ git ingestion, commit filtering, task generation
sandbox/ docker execution, patch application, test running
db/ sqlite persistence, task and submission storage
api/ axum REST server, dashboard, leaderboard
- Asbestos — on-device autonomous agent with tool execution. Supports low-latency local training using Qwen 3.5 0.8B (Quantized GGUF).
- SWE-bench — related benchmark, handcurated, ~2,000 tasks
- HumanEval — synthetic coding benchmark, 164 problems
MIT
@software{asante2026repogym,
author = {Asante, Jeffrey},
title = {RepoGym: An Automated RL Environment for AI Coding Agents},
year = {2026},
url = {https://github.com/jeffasante/repogym}
}