A benchmark for evaluating AI coding agents on concurrency bug tasks.
There are two ways to run Spaghetti Bench:
Pull the pre-built Docker image:
docker pull vasumv/spaghetti-bench:latestOr build from source using Nix:
nix build .#dockerImage
docker load < resultThe Docker image includes:
- Python 3.13 with all dependencies
- OpenJDK 21
- Fray - JVM concurrency testing tool
- Standard Unix utilities (bash, grep, find, sed, awk, git, tmux)
Start the container:
docker run -it --rm \
-e LLM_API_KEY="your-api-key" \
vasumv/spaghetti-bench:latest \
bashPrerequisites:
- Python 3.11+
- Java 21+
- Fray installed and available in PATH
Install Fray:
# Clone and build Fray
git clone https://github.com/cmu-pasta/fray.git
cd fray
./gradlew build
# Add to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH=$PATH:/path/to/fray/build/install/fray/binInstall Python dependencies:
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .Build JUnit runner:
cd helpers/junit-runner
./gradlew build
cd ../..Set API key:
export LLM_API_KEY="your-api-key"Clone the repository:
git clone https://github.com/cmu-pasta/spaghetti-bench.git
cd spaghetti-benchSet your API key:
export LLM_API_KEY="your-api-key"Run all tasks from a benchmark:
python src/concurrency_bench/run_agent.py \
--tasks-file src/concurrency_bench/sctbench.jsonl \
--task-type fix_bug \
--model-id bedrock/global.anthropic.claude-sonnet-4-5-20250929-v1:0Run on real-world Kafka bugs:
python src/concurrency_bench/run_agent.py \
--tasks-file src/concurrency_bench/kafka.jsonl \
--task-type fix_bug \
--model-id bedrock/global.anthropic.claude-sonnet-4-5-20250929-v1:0Run a single task:
python src/concurrency_bench/run_agent.py \
--tasks-file src/concurrency_bench/sctbench.jsonl \
--task-type fix_bug \
--model-id bedrock/global.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--instance-id Reorder3Badsrc/concurrency_bench/sctbench.jsonl- SCTBench synthetic bugs (28 tasks)src/concurrency_bench/kafka.jsonl- Apache Kafka bugs (11 tasks)src/concurrency_bench/all.jsonl- All tasks combined (39 tasks)
| Option | Required | Description |
|---|---|---|
--tasks-file |
Yes | Path to JSONL file containing tasks |
--task-type |
Yes | Task type: fix_bug |
--model-id |
Yes | Model ID (must be LiteLLM compatible) |
--instance-id |
No | Run only the specified task |
--results-dir |
No | Directory to save results (default: results/) |
--enable-fray-tools |
No | Give agent access to Fray rerun tool |
--keep-result |
No | Keep temporary workspace after completion |
--repetition |
No | Repetition/experiment ID for results path |
--timeout |
No | Agent execution timeout in seconds (default: 1200 = 20 minutes) |
================================================================================
Running task: Reorder3Bad
Description: Memory ordering bug with concurrent reads and writes
Task type: fix_bug
Model: bedrock/global.anthropic.claude-sonnet-4-5-20250929-v1:0
================================================================================
Created workdir: /tmp/concurrency_bench_Reorder3Bad_abc123/
Copied benchmarks/SCTBench/cs/origin/Reorder3Bad.java
Starting agent...
...
Agent finished!
Verifying results...
Success: True
Saved conversation to: results/fix_bug/sctbench/Reorder3Bad.json
Results are saved in a structured format:
results/
└── {model_id}/
└── {with_fray|without_fray}/
└── {rep_id}/
└── {task_type}/
└── {benchmark_category}/
├── {instance_id}.json
└── {instance_id}.patch
Each JSON file contains:
- Task metadata (instance_id, description, category)
- Model information
- Success/failure status
- Setup and verification output
- Full conversation event stream (messages, tool calls, responses)
Each .patch file contains a git diff of the changes made by the agent.
View your local agent conversations interactively:
cd viz
python3 serve_traces.pyThen open http://localhost:8001 in your browser.
SCTBench is a suite of concurrency bugs translated to Java, located in benchmarks/SCTBench/:
- cs/origin/ - Original bugs (races, atomicity violations, deadlocks)
Real-world bugs from open-source projects:
- Apache Kafka - 11 concurrency bugs from the Kafka streams library
- Full repository is cloned at bug-triggering commit
- Tests run with Fray to systematically explore thread interleavings
-
Tasks (
src/concurrency_bench/tasks/)FixBugTask: Identify and fix concurrency bugsTriggerBugTask(WIP): Write test cases that reproduce bugs- Task loaders handle building and running benchmarks
-
Agents (
src/concurrency_bench/agents/)FixBugAgent: Specialized in fixing concurrency issuesTriggerBugAgent(WIP): Specialized in creating reproducible test cases- Built on OpenHands Agent SDK
-
Runner (
src/concurrency_bench/run_agent.py)- Loads tasks from JSONL
- Creates isolated workspace per task
- Runs agent and verifies results
- Saves full conversation data
Load Task → Create Workspace → Copy Files → Run Agent → Verify → Save Results → Cleanup
Tasks are defined in JSONL format (one JSON object per line):
SCTBench example:
{"instance_id": "Reorder3Bad", "path": "benchmarks/SCTBench/cs/origin/Reorder3Bad.java", "description": "Memory ordering bug", "benchmark_category": "sctbench", "subcategory": "cs/origin", "loader": "SCTBenchLoader"}Real-world example:
{"instance_id": "Kafka_KAFKA-18418", "repo_url": "https://github.com/apache/kafka.git", "commit": "3.8.0", "test_class": "org.apache.kafka.streams.KafkaStreamsTest", "test_method": "shouldReturnFalseOnCloseWhenThreadsHaventTerminated", "description": "Race condition in shutdown", "benchmark_category": "real-world", "subcategory": "kafka", "loader": "KafkaLoader"}Required fields:
instance_id: Unique task identifierloader: Class name that handles build/run (e.g.,SCTBenchLoader)benchmark_category: Category (e.g.,sctbench,real-world)description: Human-readable description
- Add benchmark files to
benchmarks/ - Create a task loader in
src/concurrency_bench/tasks/loaders/Seekafka_loader.pyas an example. - Add task entries to a JSONL file
- Run with
--tasks-file your_tasks.jsonl
Note: you must first verify that Fray is consistently able to find the bug for each new task. Otherwise, the setup step will fail and the task will not execute.
The Docker image is built from flake.nix. After making changes:
nix build .#dockerImage
docker load < result@misc{spaghettibench2025,
title={Spaghetti Bench: Evaluating AI Agents on Concurrency Bug Fixes},
author={Vikram, Vasudev and Li, Ao and Padhye, Rohan},
year={2025},
url={https://github.com/cmu-pasta/spaghetti-bench}
}This project is licensed under the MIT License.
Contributions are welcome! Please feel free to create an issue / submit PRs.