Skip to content

Latest commit

 

History

History
283 lines (204 loc) · 8.75 KB

File metadata and controls

283 lines (204 loc) · 8.75 KB

RepoGym

RepoGym turns GitHub history into a reinforcement learning environment for AI coding agents.

A reinforcement learning environment for AI coding agents, built on real Git history.

RepoGym is a scalable research environment for training and evaluating AI coding agents on verified real-world bug-fixing tasks extracted automatically from Git history. Unlike static benchmarks, RepoGym provides a live execution environment with exact reward signals derived from test suite delta, enabling online RL training without human annotation.

RepoGym mines verified bug-fixing commits from open source repositories, runs agent patches in isolated Docker sandboxes, and returns an exact reward signal based on test suite delta. No synthetic problems. No toy benchmarks. Real code, real tests, real feedback.

Why RepoGym exists

Most coding benchmarks measure whether a model can solve toy problems.

Real software engineering is different:

  • large codebases
  • multi-file reasoning
  • real test suites
  • regressions and edge cases

RepoGym converts real bug-fixing commits from open source repositories into reproducible environments where AI agents can practice fixing real code.


Comparison with existing benchmarks

Feature HumanEval SWE-bench RepoGym
Scale Fixed (164 tasks) Fixed (2,294 tasks) Unlimited (Live Mining)
Source Synthetic Curated Issues Automated History Mining
Signal Pass/Fail Pass/Fail Reward Delta (+0.00x)
Primary Use Evaluation Evaluation Iterative RL Training

RepoGym is not just a benchmark; it is a training ground. By providing a continuous reward signal instead of a binary result, it allows models to learn from partial progress—perfect for RL-based fine-tuning.


How it works

Every merged bugfix commit is a labeled training example: the broken state before the fix, the corrected state after, and a test suite that distinguishes them. RepoGym automates the extraction of these pairs and turns them into a structured environment any agent can interact with over HTTP.

graph TD
    A[Git History] --> B(Extract Bugfix Commits)
    B --> C{RepoGym Environment}
    C -->|Fetch Task| D[Autonomous Agent]
    D -->|Generate Patch| E[Docker Sandbox]
    E -->|Run Tests| F[Reward Signal]
    F -->|RL Feedback| D
    F -->|Log History| G[Live Dashboard]
Loading

The reward is simple and verifiable:

reward = score(patched) − score(broken)
       = tests_passed_after / total − tests_passed_before / total

A perfect patch scores +1.0. A patch that breaks things scores negative. No LLM judge, no rubric, no ambiguity.


Quick start

# 1. build and start the server
cargo build --release
cargo run -- serve --port 8080

# 2. get a task
curl http://localhost:8080/task

# 3. submit a patch
curl -X POST http://localhost:8080/submit \
  -H "Content-Type: application/json" \
  -d '{"task_id": 1, "agent_id": "my-agent", "patch": "<unified diff>"}'

# 4. view the dashboard
open http://localhost:8080

Dashboard

RepoGym Dashboard

The dashboard is the Scoreboard for AI Training. Think of it like a gym for AI models where they "lift weights" (fix code) and get a score on how well they did.

1. The Core Metrics

  • 20,376 Verified Tasks: Our library of real bugs from Django, Flask, and Requests. These are not synthetic problems; they are real mistakes human developers fixed in the past.
  • Average Reward (0.000): Across all AIs working right now, this is the current baseline. As you train your model, you will see this number turn green and move positive.

2. Live Activity Feed (The "Plays")

A log of every single attempt an AI makes to fix a bug:

  • Score (e.g., 1.000): How many tests passed after the AI applied its fix. 1.000 is a perfect fix.
  • Delta (e.g., +0.008): This is the Reward Signal. It tells the AI: "You made the code 0.8% more correct than it was before."
  • Green (+): The AI fixed a test that was previously failing. This is how it learns.
  • Red (-): The AI's "fix" accidentally broke something that was previously working (Regression).

3. Leaderboard

Ranks AIs based on their overall success. Currently, asbestos-local is at the baseline, ready for reinforcement learning to drive its score up.


Why this matters

To train an AI using Reinforcement Learning (RL), the agent needs a "Reward" to know if it's doing a good job. Usually, this requires a human to check the work.

RepoGym does it automatically. By using Docker to run real tests and returning the reward signal instantly, it allows an AI to "practice" coding 24/7 without human supervision.

Try this: Click the VIEW button on any of those rows. It will show you exactly what the AI was trying to fix and the "cheat sheet" (the actual human fix) for that bug.


Setup

# pull the sandbox image
docker pull python:3.11-slim

# build the CLI
cargo build --release

Scaling the dataset

RepoGym is designed to ingest from any Git repository. More repos means more tasks.

# ingest repos
cargo run -- ingest https://github.com/psf/requests --dest ./repos/requests
cargo run -- ingest https://github.com/pallets/flask --dest ./repos/flask
cargo run -- ingest https://github.com/fastapi/fastapi --dest ./repos/fastapi
cargo run -- ingest https://github.com/encode/httpx --dest ./repos/httpx
cargo run -- ingest https://github.com/pytest-dev/pytest --dest ./repos/pytest

# extract and save tasks
cargo run -- save-tasks ./repos/requests requests
cargo run -- save-tasks ./repos/flask flask
cargo run -- save-tasks ./repos/fastapi fastapi

Current dataset: 20,376 verified tasks across psf/requests, pallets/flask, and django/django.


API

Method Route Description
GET /task Returns a random task. Add ?repo=requests to filter by repo.
POST /submit Accepts a patch, runs the sandbox, returns the reward signal.
GET /leaderboard Agent rankings by average score.
GET / Live dashboard.

Task response

{
  "id": 4401,
  "repo": "requests",
  "description": "fix models.py encoding fallback",
  "base_commit": "af52bf79",
  "fix_commit": "6d3d5e59",
  "files_modified": ["requests/models.py", "tests/test_requests.py"],
  "test_command": "pytest tests/"
}

Submit request

{
  "task_id": 4401,
  "agent_id": "claude-opus-4-5",
  "patch": "--- a/requests/models.py\n+++ b/requests/models.py\n..."
}

Reward response

{
  "score": 1.000,
  "delta": +0.005,
  "tests_passed": 206,
  "tests_failed": 0
}

Agent integration

RepoGym exposes a clean loop any agent can run:

fetch task → read broken code → generate patch → submit → log reward → repeat

Example Python agent using the Anthropic API:

import httpx
import anthropic

REPOGYM = "http://localhost:8080"
client = anthropic.Anthropic()

def run_agent(n=10):
    for _ in range(n):
        task = httpx.get(f"{REPOGYM}/task?repo=requests").json()

        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Fix this bug as a unified diff patch.

Repo: {task['repo']}
Description: {task['description']}
Files: {', '.join(task['files_modified'])}
Base commit: {task['base_commit']}

Output only the patch."""
            }]
        )

        result = httpx.post(f"{REPOGYM}/submit", json={
            "task_id": task["id"],
            "agent_id": "claude-opus-4-5",
            "patch": response.content[0].text
        }).json()

        print(f"{task['description']}")
        print(f"score {result['score']:.3f}  delta {result['delta']:+.3f}")

Project structure

src/
  core/       git ingestion, commit filtering, task generation
  sandbox/    docker execution, patch application, test running
  db/         sqlite persistence, task and submission storage
  api/        axum REST server, dashboard, leaderboard

Related


License

MIT


Citation

@software{asante2026repogym,
  author = {Asante, Jeffrey},
  title = {RepoGym: An Automated RL Environment for AI Coding Agents},
  year = {2026},
  url = {https://github.com/jeffasante/repogym}
}