Skip to content

Commit 5fe641c

Browse files
committed
Expand benchmark workflow and documentation
1 parent a7354a5 commit 5fe641c

File tree

6 files changed

+171
-60
lines changed

6 files changed

+171
-60
lines changed

.github/workflows/ace-benchmark.yml

Lines changed: 44 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,42 @@ on:
88

99
jobs:
1010
run-benchmark:
11-
name: Run ACE benchmark
11+
name: Run ACE benchmark (${{ matrix.name }})
1212
runs-on: ubuntu-latest
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
include:
17+
- name: finance-baseline
18+
dataset: benchmarks/finance_subset.jsonl
19+
variant: baseline
20+
output: results/benchmark/baseline_finance.json
21+
use_gt: "true"
22+
temperature: "0.5"
23+
- name: finance-ace-gt
24+
dataset: benchmarks/finance_subset.jsonl
25+
variant: ace_full
26+
output: results/benchmark/ace_finance_gt.json
27+
use_gt: "true"
28+
temperature: "0.5"
29+
- name: finance-ace-no-gt
30+
dataset: benchmarks/finance_subset.jsonl
31+
variant: ace_full
32+
output: results/benchmark/ace_finance_no_gt.json
33+
use_gt: "false"
34+
temperature: "0.5"
35+
- name: agent-baseline
36+
dataset: benchmarks/agent_small.jsonl
37+
variant: baseline
38+
output: results/benchmark/baseline_agent.json
39+
use_gt: "true"
40+
temperature: "0.8"
41+
- name: agent-ace
42+
dataset: benchmarks/agent_small.jsonl
43+
variant: ace_full
44+
output: results/benchmark/ace_agent.json
45+
use_gt: "true"
46+
temperature: "0.8"
1347
env:
1448
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
1549
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
@@ -43,15 +77,16 @@ jobs:
4377
- name: Ensure benchmark results directory exists
4478
run: mkdir -p results/benchmark
4579

46-
- name: Baseline run
47-
run: python scripts/run_benchmark.py benchmarks/finance_subset.jsonl baseline --output results/benchmark/baseline_finance.json
48-
49-
- name: ACE run
50-
run: python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full --output results/benchmark/ace_finance.json
80+
- name: Run benchmark (${{ matrix.variant }} | ${{ matrix.dataset }})
81+
env:
82+
ACE_BENCHMARK_USE_GROUND_TRUTH: ${{ matrix.use_gt }}
83+
ACE_BENCHMARK_TEMPERATURE: ${{ matrix.temperature }}
84+
run: |
85+
python scripts/run_benchmark.py ${{ matrix.dataset }} ${{ matrix.variant }} --output ${{ matrix.output }}
5186
52-
- name: Upload benchmark artifacts
87+
- name: Upload benchmark artifact
5388
uses: actions/upload-artifact@v4
5489
with:
55-
name: ace-benchmark-results
56-
path: results/benchmark/*.json
90+
name: ace-benchmark-${{ matrix.name }}
91+
path: ${{ matrix.output }}
5792
if-no-files-found: error

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,9 @@ python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full --outpu
150150

151151
# ACE vs baseline live loop comparison (ACE + EE harness)
152152
python benchmarks/run_live_loop_benchmark.py --backend dspy --episodes 10
153+
154+
# Trigger the CI workflow (optional)
155+
gh workflow run ace-benchmark.yml
153156
```
154157

155158
Key metrics in the JSON output:

docs/combined_quickstart.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,3 +112,52 @@ routes feedback back into the curator. Replace the dummy client with a real
112112
latency or cost.
113113
* Feed the `ExperienceBuffer` into training dashboards or analytics systems to
114114
monitor adoption of new playbook bullets.
115+
116+
## CI Benchmarks
117+
118+
The repository includes a GitHub Actions workflow
119+
(`.github/workflows/ace-benchmark.yml`) that runs the finance and agent
120+
benchmarks under several configurations:
121+
122+
- Finance baseline vs ACE (with ground-truth feedback),
123+
- Finance ACE with ground-truth disabled (reflector relies solely on execution
124+
cues),
125+
- Agent baseline vs ACE on `benchmarks/agent_small.jsonl` at a higher generator
126+
temperature.
127+
128+
Each matrix entry runs in an isolated job, initialises a fresh SQLite schema,
129+
and uploads its metrics JSON as an artifact (for example,
130+
`ace-benchmark-finance-ace-no-gt`).
131+
132+
### Triggering the workflow
133+
134+
1. Store `OPENROUTER_API_KEY` (or another provider key) as a repository secret.
135+
2. From the **Actions** tab, choose **ACE Benchmark → Run workflow** (manual) or
136+
rely on the automatic trigger for pushes to `main`.
137+
3. After the run completes, download the artifacts. You’ll find:
138+
- `baseline_finance.json`, `ace_finance_gt.json`,
139+
`ace_finance_no_gt.json`,
140+
- `baseline_agent.json`, `ace_agent.json`.
141+
142+
### Environment knobs
143+
144+
`scripts/run_benchmark.py` respects the following environment variables (also
145+
used by the workflow matrix):
146+
147+
- `ACE_BENCHMARK_TEMPERATURE` – overrides the generator temperature for both CoT
148+
and ReAct variants.
149+
- `ACE_BENCHMARK_USE_GROUND_TRUTH` – set to `false`/`0`/`off` to withhold
150+
ground-truth answers from the reflector (accuracy is still evaluated against
151+
ground truth).
152+
153+
Example local invocation:
154+
155+
```bash
156+
ACE_BENCHMARK_TEMPERATURE=0.5 \
157+
ACE_BENCHMARK_USE_GROUND_TRUTH=false \
158+
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full \
159+
--output results/benchmark/ace_finance_no_gt.json
160+
```
161+
162+
The resulting JSON files provide the raw evidence (accuracy, promotions,
163+
increments, auto-format corrections) that mirrors the tables in the ACE paper.

results/benchmark/ace_finance.json

Lines changed: 19 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,54 +2,44 @@
22
"variant": "ace_full",
33
"total": 26,
44
"correct": 26,
5-
"promotions": 10,
5+
"promotions": 18,
66
"quarantines": 0,
7-
"new_bullets": 2,
8-
"increments": 5,
7+
"new_bullets": 4,
8+
"increments": 1,
99
"latency_ms": [],
1010
"failures": [],
11-
"auto_corrections": [
11+
"format_corrections": [
1212
{
13-
"task_id": "fin-002",
14-
"original_answer": "38%",
13+
"task_id": "fin-006",
14+
"original_answer": "The ROI for the investment is 30%.",
1515
"corrected_answer": "30%"
1616
},
17+
{
18+
"task_id": "fin-007",
19+
"original_answer": "The net profit margin is 15%.",
20+
"corrected_answer": "15%"
21+
},
22+
{
23+
"task_id": "fin-026",
24+
"original_answer": "The retention ratio is 0.75 (or 75%).",
25+
"corrected_answer": "75%"
26+
}
27+
],
28+
"auto_corrections": [
1729
{
1830
"task_id": "fin-014",
1931
"original_answer": "19.13%",
2032
"corrected_answer": "19.11%"
2133
},
2234
{
2335
"task_id": "fin-019",
24-
"original_answer": "38.00%",
36+
"original_answer": "21.60%",
2537
"corrected_answer": "21.65%"
2638
},
2739
{
2840
"task_id": "fin-020",
2941
"original_answer": "188.77",
3042
"corrected_answer": "188.71"
31-
},
32-
{
33-
"task_id": "fin-025",
34-
"original_answer": "6.00%",
35-
"corrected_answer": "8.40%"
36-
}
37-
],
38-
"format_corrections": [
39-
{
40-
"task_id": "fin-006",
41-
"original_answer": "The ROI is (240 / 800) * 100 = 30%.",
42-
"corrected_answer": "30%"
43-
},
44-
{
45-
"task_id": "fin-007",
46-
"original_answer": "The net profit margin is 15%.",
47-
"corrected_answer": "15%"
48-
},
49-
{
50-
"task_id": "fin-026",
51-
"original_answer": "The retention ratio is 0.75 (or 75%).",
52-
"corrected_answer": "75%"
5343
}
5444
]
5545
}
Lines changed: 35 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,66 @@
11
{
22
"variant": "baseline",
33
"total": 26,
4-
"correct": 26,
4+
"correct": 24,
55
"promotions": 0,
66
"quarantines": 0,
7-
"new_bullets": 5,
8-
"increments": 83,
7+
"new_bullets": 62,
8+
"increments": 16,
99
"latency_ms": [],
10-
"failures": [],
11-
"format_corrections": [
12-
{
13-
"task_id": "fin-006",
14-
"original_answer": "The ROI is 30%.",
15-
"corrected_answer": "30%"
16-
},
10+
"failures": [
1711
{
18-
"task_id": "fin-007",
19-
"original_answer": "The net profit margin is 15%.",
20-
"corrected_answer": "15%"
12+
"task_id": "fin-009",
13+
"answer": "75.54",
14+
"ground_truth": "78.25"
2115
},
2216
{
2317
"task_id": "fin-026",
24-
"original_answer": "Retention Ratio = (1200 - 300) / 1200 = 900 / 1200 = 0.75 or 75%",
25-
"corrected_answer": "75%"
18+
"answer": "0.75",
19+
"ground_truth": "75%"
2620
}
2721
],
2822
"auto_corrections": [
23+
{
24+
"task_id": "fin-002",
25+
"original_answer": "29%",
26+
"corrected_answer": "30%"
27+
},
2928
{
3029
"task_id": "fin-014",
31-
"original_answer": "30.0%",
30+
"original_answer": "19.13%",
3231
"corrected_answer": "19.11%"
3332
},
3433
{
3534
"task_id": "fin-019",
36-
"original_answer": "19.66%",
35+
"original_answer": "21.64%",
3736
"corrected_answer": "21.65%"
3837
},
38+
{
39+
"task_id": "fin-023",
40+
"original_answer": "8.33%",
41+
"corrected_answer": "9.38%"
42+
},
3943
{
4044
"task_id": "fin-024",
41-
"original_answer": "5247.77",
45+
"original_answer": "3579.14",
4246
"corrected_answer": "3685.69"
4347
},
4448
{
4549
"task_id": "fin-025",
46-
"original_answer": "9.60%",
50+
"original_answer": "10.00%",
4751
"corrected_answer": "8.40%"
4852
}
53+
],
54+
"format_corrections": [
55+
{
56+
"task_id": "fin-006",
57+
"original_answer": "The ROI is 30%.",
58+
"corrected_answer": "30%"
59+
},
60+
{
61+
"task_id": "fin-007",
62+
"original_answer": "The net profit margin is 15.0%.",
63+
"corrected_answer": "15%"
64+
}
4965
]
5066
}

scripts/run_benchmark.py

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@
44

55
import argparse
66
import json
7+
import math
78
import os
89
import re
910
from dataclasses import dataclass
1011
from pathlib import Path
1112
from typing import Dict, List
12-
import math
1313

1414
import dspy
1515
from dotenv import load_dotenv
@@ -177,15 +177,33 @@ def run_variant(tasks: List[Dict], variant: VariantConfig) -> Dict:
177177

178178
model_name = configure_lm()
179179

180+
default_temperature = 0.2 if variant.enable_react else 0.7
181+
temperature_override = os.getenv("ACE_BENCHMARK_TEMPERATURE")
182+
if temperature_override:
183+
try:
184+
default_temperature = float(temperature_override)
185+
except ValueError:
186+
logger.warning("invalid_temperature_override", value=temperature_override)
187+
180188
generator = (
181-
create_react_generator(model=model_name) if variant.enable_react else create_cot_generator(model=model_name)
189+
create_react_generator(model=model_name, temperature=default_temperature)
190+
if variant.enable_react
191+
else create_cot_generator(model=model_name, temperature=default_temperature)
182192
)
183193
reflector = GroundedReflector(model=model_name)
184194
curator = CuratorService()
185195
merge_coordinator = MergeCoordinator(curator) if variant.enable_merge_coordinator else None
186196
refinement_scheduler = None
187197
runtime_adapter = None
188198

199+
use_ground_truth_env = os.getenv("ACE_BENCHMARK_USE_GROUND_TRUTH")
200+
use_ground_truth = True
201+
if use_ground_truth_env is not None:
202+
use_ground_truth = use_ground_truth_env.strip().lower() not in {"0", "false", "off", "no"}
203+
204+
metrics["generator_temperature"] = default_temperature
205+
metrics["reflector_use_ground_truth"] = use_ground_truth
206+
189207
with get_session() as session:
190208
stage_manager = StageManager(session)
191209
curator_service = curator
@@ -242,7 +260,7 @@ def run_variant(tasks: List[Dict], variant: VariantConfig) -> Dict:
242260
answer=original_answer,
243261
confidence=result.confidence,
244262
bullets_referenced=result.bullets_referenced,
245-
ground_truth=task.get("ground_truth"),
263+
ground_truth=task.get("ground_truth") if use_ground_truth else None,
246264
domain="benchmark",
247265
)
248266

0 commit comments

Comments
 (0)