RepoGym

RepoGym turns GitHub history into a reinforcement learning environment for AI coding agents.

A reinforcement learning environment for AI coding agents, built on real Git history.

RepoGym is a scalable research environment for training and evaluating AI coding agents on verified real-world bug-fixing tasks extracted automatically from Git history. Unlike static benchmarks, RepoGym provides a live execution environment with exact reward signals derived from test suite delta, enabling online RL training without human annotation.

RepoGym mines verified bug-fixing commits from open source repositories, runs agent patches in isolated Docker sandboxes, and returns an exact reward signal based on test suite delta. No synthetic problems. No toy benchmarks. Real code, real tests, real feedback.

Why RepoGym exists

Most coding benchmarks measure whether a model can solve toy problems.

Real software engineering is different:

large codebases
multi-file reasoning
real test suites
regressions and edge cases

RepoGym converts real bug-fixing commits from open source repositories into reproducible environments where AI agents can practice fixing real code.

Comparison with existing benchmarks

Feature	HumanEval	SWE-bench	RepoGym
Scale	Fixed (164 tasks)	Fixed (2,294 tasks)	Unlimited (Live Mining)
Source	Synthetic	Curated Issues	Automated History Mining
Signal	Pass/Fail	Pass/Fail	Reward Delta (+0.00x)
Primary Use	Evaluation	Evaluation	Iterative RL Training

RepoGym is not just a benchmark; it is a training ground. By providing a continuous reward signal instead of a binary result, it allows models to learn from partial progress—perfect for RL-based fine-tuning.

How it works

Every merged bugfix commit is a labeled training example: the broken state before the fix, the corrected state after, and a test suite that distinguishes them. RepoGym automates the extraction of these pairs and turns them into a structured environment any agent can interact with over HTTP.

graph TD
    A[Git History] --> B(Extract Bugfix Commits)
    B --> C{RepoGym Environment}
    C -->|Fetch Task| D[Autonomous Agent]
    D -->|Generate Patch| E[Docker Sandbox]
    E -->|Run Tests| F[Reward Signal]
    F -->|RL Feedback| D
    F -->|Log History| G[Live Dashboard]

The reward is simple and verifiable:

reward = score(patched) − score(broken)
       = tests_passed_after / total − tests_passed_before / total

A perfect patch scores +1.0. A patch that breaks things scores negative. No LLM judge, no rubric, no ambiguity.

Quick start

# 1. build and start the server
cargo build --release
cargo run -- serve --port 8080

# 2. get a task
curl http://localhost:8080/task

# 3. submit a patch
curl -X POST http://localhost:8080/submit \
  -H "Content-Type: application/json" \
  -d '{"task_id": 1, "agent_id": "my-agent", "patch": "<unified diff>"}'

# 4. view the dashboard
open http://localhost:8080

Dashboard

The dashboard is the Scoreboard for AI Training. Think of it like a gym for AI models where they "lift weights" (fix code) and get a score on how well they did.

1. The Core Metrics

20,376 Verified Tasks: Our library of real bugs from Django, Flask, and Requests. These are not synthetic problems; they are real mistakes human developers fixed in the past.
Average Reward (0.000): Across all AIs working right now, this is the current baseline. As you train your model, you will see this number turn green and move positive.

2. Live Activity Feed (The "Plays")

A log of every single attempt an AI makes to fix a bug:

Score (e.g., 1.000): How many tests passed after the AI applied its fix. 1.000 is a perfect fix.
Delta (e.g., +0.008): This is the Reward Signal. It tells the AI: "You made the code 0.8% more correct than it was before."
Green (+): The AI fixed a test that was previously failing. This is how it learns.
Red (-): The AI's "fix" accidentally broke something that was previously working (Regression).

3. Leaderboard

Ranks AIs based on their overall success. Currently, asbestos-local is at the baseline, ready for reinforcement learning to drive its score up.

Why this matters

To train an AI using Reinforcement Learning (RL), the agent needs a "Reward" to know if it's doing a good job. Usually, this requires a human to check the work.

RepoGym does it automatically. By using Docker to run real tests and returning the reward signal instantly, it allows an AI to "practice" coding 24/7 without human supervision.

Try this: Click the VIEW button on any of those rows. It will show you exactly what the AI was trying to fix and the "cheat sheet" (the actual human fix) for that bug.

Setup

# pull the sandbox image
docker pull python:3.11-slim

# build the CLI
cargo build --release

Scaling the dataset

RepoGym is designed to ingest from any Git repository. More repos means more tasks.

# ingest repos
cargo run -- ingest https://github.com/psf/requests --dest ./repos/requests
cargo run -- ingest https://github.com/pallets/flask --dest ./repos/flask
cargo run -- ingest https://github.com/fastapi/fastapi --dest ./repos/fastapi
cargo run -- ingest https://github.com/encode/httpx --dest ./repos/httpx
cargo run -- ingest https://github.com/pytest-dev/pytest --dest ./repos/pytest

# extract and save tasks
cargo run -- save-tasks ./repos/requests requests
cargo run -- save-tasks ./repos/flask flask
cargo run -- save-tasks ./repos/fastapi fastapi

Current dataset: 20,376 verified tasks across psf/requests, pallets/flask, and django/django.

API

Method	Route	Description
`GET`	`/task`	Returns a random task. Add `?repo=requests` to filter by repo.
`POST`	`/submit`	Accepts a patch, runs the sandbox, returns the reward signal.
`GET`	`/leaderboard`	Agent rankings by average score.
`GET`	`/`	Live dashboard.

Task response

{
  "id": 4401,
  "repo": "requests",
  "description": "fix models.py encoding fallback",
  "base_commit": "af52bf79",
  "fix_commit": "6d3d5e59",
  "files_modified": ["requests/models.py", "tests/test_requests.py"],
  "test_command": "pytest tests/"
}

Submit request

{
  "task_id": 4401,
  "agent_id": "claude-opus-4-5",
  "patch": "--- a/requests/models.py\n+++ b/requests/models.py\n..."
}

Reward response

{
  "score": 1.000,
  "delta": +0.005,
  "tests_passed": 206,
  "tests_failed": 0
}

Agent integration

RepoGym exposes a clean loop any agent can run:

fetch task → read broken code → generate patch → submit → log reward → repeat

Example Python agent using the Anthropic API:

import httpx
import anthropic

REPOGYM = "http://localhost:8080"
client = anthropic.Anthropic()

def run_agent(n=10):
    for _ in range(n):
        task = httpx.get(f"{REPOGYM}/task?repo=requests").json()

        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Fix this bug as a unified diff patch.

Repo: {task['repo']}
Description: {task['description']}
Files: {', '.join(task['files_modified'])}
Base commit: {task['base_commit']}

Output only the patch."""
            }]
        )

        result = httpx.post(f"{REPOGYM}/submit", json={
            "task_id": task["id"],
            "agent_id": "claude-opus-4-5",
            "patch": response.content[0].text
        }).json()

        print(f"{task['description']}")
        print(f"score {result['score']:.3f}  delta {result['delta']:+.3f}")

Project structure

src/
  core/       git ingestion, commit filtering, task generation
  sandbox/    docker execution, patch application, test running
  db/         sqlite persistence, task and submission storage
  api/        axum REST server, dashboard, leaderboard

Asbestos — on-device autonomous agent with tool execution. Supports low-latency local training using Qwen 3.5 0.8B (Quantized GGUF).
SWE-bench — related benchmark, handcurated, ~2,000 tasks
HumanEval — synthetic coding benchmark, 164 problems

License

MIT

Citation

@software{asante2026repogym,
  author = {Asante, Jeffrey},
  title = {RepoGym: An Automated RL Environment for AI Coding Agents},
  year = {2026},
  url = {https://github.com/jeffasante/repogym}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepoGym

Why RepoGym exists

Comparison with existing benchmarks

How it works

Quick start

Dashboard

1. The Core Metrics

2. Live Activity Feed (The "Plays")

3. Leaderboard

Why this matters

Setup

Scaling the dataset

API

Task response

Submit request

Reward response

Agent integration

Project structure

Related

License

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RepoGym

Why RepoGym exists

Comparison with existing benchmarks

How it works

Quick start

Dashboard

1. The Core Metrics

2. Live Activity Feed (The "Plays")

3. Leaderboard

Why this matters

Setup

Scaling the dataset

API

Task response

Submit request

Reward response

Agent integration

Project structure

Related

License

Citation