@@ -112,3 +112,52 @@ routes feedback back into the curator. Replace the dummy client with a real
112112 latency or cost.
113113* Feed the ` ExperienceBuffer ` into training dashboards or analytics systems to
114114 monitor adoption of new playbook bullets.
115+
116+ ## CI Benchmarks
117+
118+ The repository includes a GitHub Actions workflow
119+ (` .github/workflows/ace-benchmark.yml ` ) that runs the finance and agent
120+ benchmarks under several configurations:
121+
122+ - Finance baseline vs ACE (with ground-truth feedback),
123+ - Finance ACE with ground-truth disabled (reflector relies solely on execution
124+ cues),
125+ - Agent baseline vs ACE on ` benchmarks/agent_small.jsonl ` at a higher generator
126+ temperature.
127+
128+ Each matrix entry runs in an isolated job, initialises a fresh SQLite schema,
129+ and uploads its metrics JSON as an artifact (for example,
130+ ` ace-benchmark-finance-ace-no-gt ` ).
131+
132+ ### Triggering the workflow
133+
134+ 1 . Store ` OPENROUTER_API_KEY ` (or another provider key) as a repository secret.
135+ 2 . From the ** Actions** tab, choose ** ACE Benchmark → Run workflow** (manual) or
136+ rely on the automatic trigger for pushes to ` main ` .
137+ 3 . After the run completes, download the artifacts. You’ll find:
138+ - ` baseline_finance.json ` , ` ace_finance_gt.json ` ,
139+ ` ace_finance_no_gt.json ` ,
140+ - ` baseline_agent.json ` , ` ace_agent.json ` .
141+
142+ ### Environment knobs
143+
144+ ` scripts/run_benchmark.py ` respects the following environment variables (also
145+ used by the workflow matrix):
146+
147+ - ` ACE_BENCHMARK_TEMPERATURE ` – overrides the generator temperature for both CoT
148+ and ReAct variants.
149+ - ` ACE_BENCHMARK_USE_GROUND_TRUTH ` – set to ` false ` /` 0 ` /` off ` to withhold
150+ ground-truth answers from the reflector (accuracy is still evaluated against
151+ ground truth).
152+
153+ Example local invocation:
154+
155+ ``` bash
156+ ACE_BENCHMARK_TEMPERATURE=0.5 \
157+ ACE_BENCHMARK_USE_GROUND_TRUTH=false \
158+ python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full \
159+ --output results/benchmark/ace_finance_no_gt.json
160+ ```
161+
162+ The resulting JSON files provide the raw evidence (accuracy, promotions,
163+ increments, auto-format corrections) that mirrors the tables in the ACE paper.
0 commit comments