Skip to content

Commit 45a8058

Browse files
tompakeman-oaiminh-hoqueCopilot
authored
Tompakeman/realtime evals/types tooling (#2471)
Co-authored-by: Minhajul Hoque <84698472+minh-hoque@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: minh-hoque <minh.hoque@gmail.com>
1 parent 65df55a commit 45a8058

31 files changed

+7632
-322
lines changed

AGENTS.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,8 @@ These are considered priority 0 issues for this repo, in addition to the normal
3737
- Check that code identifiers remain descriptive (no leftover placeholder names) and that repeated values are factored into constants when practical.
3838
- Ensure notebooks or scripts document any required environment variables instead of hard-coding secrets or keys.
3939
- Confirm metadata files (`registry.yaml`, `authors.yaml`) stay in sync with new or relocated content.
40+
41+
## Recent Learnings
42+
43+
- **Realtime eval shared imports can resolve the wrong module under pytest** -> Add `shared/__init__.py` and ensure tests prepend `examples/evals/realtime_evals` to `sys.path` before importing `shared.*` -> Prevents collection failures caused by unrelated installed packages named `shared`.
44+
- **Run-level grades can be overweighted by long simulations** -> Store turn-level grades on the matching turn and trace-level grades on one row per simulation instead of copying them onto every row -> Keeps `results.csv` row semantics intact and prevents summary means from favoring longer conversations.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Python bytecode and caches
2+
__pycache__/
3+
*.py[cod]
4+
*.pyo
5+
6+
# Virtual environments
7+
.venv/
8+
venv/
9+
env/
10+
ENV/
11+
12+
# Python tooling caches
13+
.mypy_cache/
14+
.pytest_cache/
15+
.ruff_cache/
16+
.coverage
17+
htmlcov/
18+
19+
# Jupyter
20+
.ipynb_checkpoints/
21+
22+
# Local Python version management
23+
.python-version
24+
25+
# uv local cache
26+
.uv-cache/
27+
28+
# Harness outputs
29+
*_harness/results/
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
UV_BIN := $(strip $(shell command -v uv 2>/dev/null))
2+
PYTHON_BIN := $(strip $(shell command -v python 2>/dev/null || command -v python3 2>/dev/null))
3+
VENV_DIR := .venv
4+
VENV_PYTHON := $(VENV_DIR)/bin/python
5+
6+
ifeq ($(UV_BIN),)
7+
RUFF := $(VENV_DIR)/bin/ruff
8+
MYPY := $(VENV_DIR)/bin/mypy
9+
PYTEST := $(VENV_DIR)/bin/pytest
10+
else
11+
RUFF := uv run --with ruff ruff
12+
MYPY := uv run --with mypy --with pandas-stubs --with types-seaborn --with types-tqdm mypy
13+
PYTEST := uv run --with pytest pytest
14+
endif
15+
16+
.PHONY: install format lint lint-fix typecheck test
17+
18+
install:
19+
ifeq ($(UV_BIN),)
20+
$(PYTHON_BIN) -m venv $(VENV_DIR)
21+
$(VENV_PYTHON) -m pip install -r requirements.txt -r requirements-dev.txt
22+
else
23+
uv venv $(VENV_DIR)
24+
uv sync --group dev
25+
endif
26+
27+
format:
28+
$(RUFF) format .
29+
30+
lint:
31+
$(RUFF) check .
32+
33+
lint-fix:
34+
$(RUFF) check . --fix
35+
36+
typecheck:
37+
$(MYPY) --explicit-package-bases .
38+
39+
test:
40+
$(PYTEST)

examples/evals/realtime_evals/README.md

Lines changed: 34 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,20 +11,33 @@ Depending on your realtime eval maturity, point Codex (or your preferred coding
1111

1212
## Quickstart
1313

14-
Python 3.9+ required.
14+
Python 3.12+ required.
1515

1616
```bash
17-
pip install -r requirements.txt
17+
make install
18+
source .venv/bin/activate
1819
export OPENAI_API_KEY="your_api_key"
1920
```
2021

21-
Run a first command per harness:
22+
`make install` creates the local `.venv` and installs both runtime and dev dependencies. It uses `uv` when available and otherwise falls back to `python -m venv` plus `pip install -r requirements.txt -r requirements-dev.txt`.
2223

23-
- Crawl: `python crawl_harness/run_realtime_evals.py`
24+
Run a first command per harness. If uv is not installed, replace `uv run` with `python` and run these scripts with your `.venv` activated:
25+
26+
- Crawl: `uv run python crawl_harness/run_realtime_evals.py`
2427
- Walk: install ffmpeg (`brew install ffmpeg`), then:
25-
- `python walk_harness/generate_audio.py`
26-
- `python walk_harness/run_realtime_evals.py`
27-
- Run: `python run_harness/run_realtime_evals.py --max-examples 1`
28+
- `uv run python walk_harness/generate_audio.py`
29+
- `uv run python walk_harness/run_realtime_evals.py`
30+
- Run: `uv run python run_harness/run_realtime_evals.py --max-examples 1`
31+
32+
## Dev commands
33+
Use the root `Makefile` for common checks. Run `make install` first to create `.venv`. These targets work with or without `uv`: when `uv` is installed they run through `uv run`, and otherwise they use the matching tool binaries from the local `.venv`.
34+
35+
- `make install`
36+
- `make format`
37+
- `make lint`
38+
- `make lint-fix`
39+
- `make typecheck`
40+
- `make test`
2841

2942
## [Crawl (synthetic single-turn)](./crawl_harness)
3043

@@ -38,7 +51,7 @@ Run a first command per harness:
3851
Run:
3952

4053
```
41-
python crawl_harness/run_realtime_evals.py
54+
uv run python crawl_harness/run_realtime_evals.py
4255
```
4356

4457
## [Walk (saved audio replay)](./walk_harness)
@@ -53,8 +66,8 @@ python crawl_harness/run_realtime_evals.py
5366
Run:
5467

5568
```
56-
python walk_harness/generate_audio.py
57-
python walk_harness/run_realtime_evals.py
69+
uv run python walk_harness/generate_audio.py
70+
uv run python walk_harness/run_realtime_evals.py
5871
```
5972

6073
## [Run (multi-turn simulation)](./run_harness)
@@ -69,7 +82,7 @@ python walk_harness/run_realtime_evals.py
6982
Run:
7083

7184
```
72-
python run_harness/run_realtime_evals.py
85+
uv run python run_harness/run_realtime_evals.py
7386
```
7487

7588
## Results layout
@@ -78,8 +91,17 @@ Each harness writes into its own `results/` folder:
7891

7992
- `results/<run_id>/results.csv`: per-example outputs and grades
8093
- `results/<run_id>/summary.json`: aggregate metrics
94+
- `results/<run_id>/plots/*.png`: warm-editorial summary charts for scores, latency, tokens, and run shape
8195
- `results/<run_id>/events/*.jsonl`: full realtime event stream per datapoint
8296

97+
The shared typed representation of these artifacts for the crawl, walk, and run harnesses lives in `shared/result_types.py`.
98+
99+
To render charts for an existing run after the fact:
100+
101+
```bash
102+
uv run python plot_eval_results.py --run-dir run_harness/results/<run_id>
103+
```
104+
83105
## Common CLI flags
84106

85107
All harnesses share a core set of flags so you can switch between them easily:
@@ -88,5 +110,6 @@ All harnesses share a core set of flags so you can switch between them easily:
88110
- `--model` (assistant under test), `--system-prompt-file`, `--tools-file`
89111
- `--chunk-ms`, `--sample-rate-hz`, `--real-time`
90112
- `--max-examples` (quick smoke tests)
113+
- `--skip-plots` (skip post-run PNG chart generation)
91114

92115
Run harness adds multi-turn and simulator-specific flags (see its README).

examples/evals/realtime_evals/crawl_harness/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Each run writes:
2828

2929
- `results.csv`: Per-example outputs and grades.
3030
- `summary.json`: Aggregate metrics (latency + correctness).
31+
- `plots/*.png`: Styled score, latency, token, and status charts.
3132
- `audio/<example_id>/input.wav`: The TTS input audio.
3233
- `audio/<example_id>/output.wav`: The model output audio (if any).
3334
- `events/<example_id>.jsonl`: Realtime event stream for the datapoint.
@@ -66,6 +67,7 @@ Common options:
6667
- `--output-audio-format`: Output audio format (default `pcm16`).
6768
- `--real-time`: Stream audio in real-time cadence.
6869
- `--max-examples`: Limit number of examples for quick checks.
70+
- `--skip-plots`: Skip post-run PNG chart generation.
6971

7072
## Adapt
7173

0 commit comments

Comments
 (0)