This repository provides a compact sandbox for experimenting with a simple, understandable IR system. The aim is to enable an agent to iteratively optimize indexing and ranking strategies, evaluate each iteration with trec_eval, and preserve only the changes that improve overall retrieval effectiveness while avoiding significant performance regressions.
Current accepted leader codex/search-rerank-span improves MAP from 0.2080 on to 0.2410 (+0.0330 (+15.9%)). It also raises P@5 from 0.4320 to 0.4720.
| Branch | Issue | MAP | MAP Δ | P@5 | P@20 | R-prec | bpref | recall | Index (s) | Search (s) |
|---|---|---|---|---|---|---|---|---|---|---|
original |
- | 0.2080 | baseline | 0.4320 | 0.3660 | 0.2563 | 0.2880 | 0.5634 | 9.89 | 0.42 |
codex/search-bm25-rsj |
#1 | 0.2349 | +0.0269 | 0.4440 | 0.3910 | 0.2741 | 0.3036 | 0.5986 | 9.75 | 0.24 |
codex/search-skip-metadata-fields |
#6 | 0.2350 | +0.0001 | 0.4480 | 0.3920 | 0.2758 | 0.3040 | 0.5986 | 9.71 | 0.23 |
codex/search-headline-boost |
#10 | 0.2355 | +0.0005 | 0.4520 | 0.3910 | 0.2768 | 0.3046 | 0.6007 | 8.99 | 0.19 |
codex/search-bm25-b-030 |
#14 | 0.2365 | +0.0010 | 0.4600 | 0.3980 | 0.2801 | 0.3048 | 0.6016 | 8.83 | 0.20 |
codex/search-prf |
#23 | 0.2396 | +0.0031 | 0.4640 | 0.3960 | 0.2840 | 0.3071 | 0.6031 | 10.42 | 0.21 |
codex/search-bm25-grid-search |
#25 | 0.2402 | +0.0006 | 0.4680 | 0.3950 | 0.2826 | 0.3062 | 0.6029 | 10.61 | 0.22 |
codex/search-rerank-span |
#27 | 0.2410 | +0.0008 | 0.4720 | 0.3950 | 0.2826 | 0.3065 | 0.6029 | 10.03 | 0.24 |
Legend
MAP: Mean Average Precision. A single overall ranking-quality score across all queries; higher is better.P@5andP@20: How many of the top 5 or top 20 results are relevant. Higher means better early precision.R-prec: Precision after retrievingRresults, whereRis the number of relevant documents for that query. Higher is better.bpref: A relevance metric that is more tolerant of incomplete judgment sets. Higher is better.recall(num_rel_ret / num_rel): Fraction of all judged-relevant documents that were retrieved anywhere in the run. higher is better.Index (s): Median wall-clock indexing time in seconds across benchmark runs; lower is better.Search (s): Median wall-clock search time in seconds for the full topics file across benchmark runs; lower is better.
Generated files:
docs/metrics/branch-comparisons.tsvdocs/metrics/branch-comparisons.md
This project is inspired by two upstream efforts:
- karpathy/autoresearch, which frames software improvement as an autonomous experiment loop driven by branch-based iteration and measurable outcomes. That repository is MIT-licensed.
- andrewtrotman/JASSjr, which provides the minimal BM25 search engine foundation and the teaching-oriented WSJ/TREC setup that this repository adapts. JASSjr is BSD-2-Clause licensed and this repo keeps that upstream attribution in derived source files and includes the BSD-2-Clause text in LICENSE.txt.
The goal here is to bring the autonomous experiment-management ideas from autoresearch into information retrieval, and to further test the hypothesis that an agent can improve any system as long as it has a measurable objective.
This project is licensed under the MIT License, except for JassJr related code which is included in this repository, which are licensed under their respective open-source licenses, please see THIRD_PARTY_NOTICES.txt for details.
- experimenting with indexing logic in
index/JASSjr_index.go - experimenting with ranking and query processing in
search/JASSjr_search.go - validating end-to-end behavior with a tiny smoke fixture
- evaluating real retrieval quality on the WSJ/TREC setup
- benchmarking indexing and search time so quality gains do not come with unacceptable slowdowns
This repo is intentionally small so an automated agent can understand the full workflow and iterate quickly.
Use these commands in order:
git checkout main
git pull --ff-only
./tests/smoke.sh
./tools/eval_wsj.sh /absolute/path/to/your/wsj.xml
./tools/benchmark_wsj.sh /absolute/path/to/your/wsj.xml
./tools/update_metrics_dashboard.shWhat they do:
./tests/smoke.shRuns a tiny fixture-based smoke evaluation to catch obvious breakage../tools/eval_wsj.shBuilds the search engine, runs the TREC topics, and records a timestampedtrec_evalsummary../tools/benchmark_wsj.shRuns a few indexing and search benchmarks and records timestamped timings../tools/update_metrics_dashboard.shExports the README branch metrics and refreshes the metrics table embedded in this README../tools/compare_branch_to_main.sh <branch>Compares the latest evaluation and benchmark artifacts for a branch against the active baseline onmain../tools/export_metrics_history.sh [branch]Exports a TSV time series from saved artifacts so MAP and benchmark medians can be reviewed over time../tools/export_branch_comparisons.shExports the latest compatible artifact from every non-main branch as a branch-vs-main comparison TSV.
Artifacts are grouped by the current git branch.
Evaluation summaries are written to:
experiment_evaluations/<branch>/
Benchmark summaries are written to:
experiment_benchmarks/<branch>/
This makes it easy to compare experiments branch by branch while keeping raw outputs out of git history. The experiment_evaluations/original/ and experiment_benchmarks/original/ folders are immutable initialization archives and must never be refreshed or overwritten.
Each new research loop should begin by refreshing the active baseline on main:
git checkout main
git pull --ff-only
./tests/smoke.sh
./tools/eval_wsj.sh /absolute/path/to/your/wsj.xml
./tools/benchmark_wsj.sh /absolute/path/to/your/wsj.xmlAfter that, create or update an experiment branch from the refreshed main baseline and require the branch to beat the newest main evaluation and benchmark artifacts before approving a PR.
To compare a branch against the current production baseline:
./tools/compare_branch_to_main.sh codex/search-my-ideaTo export a history TSV for the production branch:
./tools/export_metrics_history.sh > main-metrics.tsvTo refresh the committed dashboard assets:
./tools/update_metrics_dashboard.shThe raw history export from tools/export_metrics_history.sh is designed for simple analysis of:
- retrieval effectiveness over time, especially
map - benchmark medians over time for indexing and search
That TSV also includes enough metadata to filter or separate runs:
collection,topics, andqrelsfor evaluation rowscollection,topics,smoke_topics, anditerationsfor benchmark rows
That matters because you may occasionally record toy or verification runs alongside full WSJ/TREC runs. For production reporting, filter to the real WSJ collection and the standard 51-100 topics/qrels before comparing map or the benchmark medians.
A change is worth keeping only if:
- the smoke test still passes
trec_evalimproves overall retrieval effectiveness relative to the latest compatible evaluation onmain- indexing and search benchmarks do not show a serious regression relative to the latest compatible benchmark on
main
In practice, map is the main headline metric, but Rprec, P_10, bpref, and recip_rank should also be watched.
index/Index construction logic.search/Query evaluation and ranking logic.tests/fixtures/Tiny smoke-test collection and toy qrels/topics.tools/smoke_eval.shLightweight shell-based smoke evaluation.tools/eval_wsj.shFull WSJ/TREC evaluation with timestamped reports.tools/benchmark.shGeneric timing helper used by benchmark scripts.tools/benchmark_wsj.shBranch-aware indexing and search benchmark runner.tools/update_metrics_dashboard.shRefreshes the committed metrics TSV, generated Markdown table, and README metrics section.tools/compare_branch_to_main.shBranch-vs-main artifact comparison helper.tools/export_metrics_history.shTSV exporter for long-run metric history and reporting.tools/export_branch_comparisons.shTSV exporter for the README branch metrics table.tools/render_metrics_table.pyMarkdown table renderer for the committed README metrics section.program.mdThe agent operating plan for autonomous search experiments.