|
| 1 | +# Benchmarks (agentjson) |
| 2 | + |
| 3 | +This document explains what `benchmarks/bench.py` measures and **why** the suites are structured the way they are, based on the Slack thread: |
| 4 | + |
| 5 | +- “LLM outputs sometimes break JSON or wrap it (‘json입니다~ …’, markdown fences, etc.)” |
| 6 | +- “Huge JSON (GB-scale root arrays / big data) might want parallel parsing” |
| 7 | +- “A probabilistic parser that can return multiple candidates (Top‑K) is valuable” |
| 8 | + |
| 9 | +> TL;DR: `agentjson` is not trying to beat `orjson` on valid JSON microbenchmarks. The primary win is **success rate** and **drop-in adoption** in LLM pipelines. |
| 10 | +
|
| 11 | +## How to run |
| 12 | + |
| 13 | +Because `agentjson` ships a top-level `orjson` shim, you must use **two separate environments** if you want to compare with the real `orjson` package: |
| 14 | + |
| 15 | +```bash |
| 16 | +# Env A: real orjson |
| 17 | +python -m venv .venv-orjson |
| 18 | +source .venv-orjson/bin/activate |
| 19 | +python -m pip install orjson ujson |
| 20 | +python benchmarks/bench.py |
| 21 | + |
| 22 | +# Env B: agentjson (includes the shim) |
| 23 | +python -m venv .venv-agentjson |
| 24 | +source .venv-agentjson/bin/activate |
| 25 | +python -m pip install agentjson ujson |
| 26 | +python benchmarks/bench.py |
| 27 | +``` |
| 28 | + |
| 29 | +Tune run sizes with env vars: |
| 30 | + |
| 31 | +```bash |
| 32 | +BENCH_MICRO_NUMBER=20000 BENCH_MICRO_REPEAT=5 \ |
| 33 | +BENCH_MESSY_NUMBER=2000 BENCH_MESSY_REPEAT=5 \ |
| 34 | +BENCH_TOPK_NUMBER=500 BENCH_TOPK_REPEAT=5 \ |
| 35 | +BENCH_LARGE_MB=5,20 BENCH_LARGE_NUMBER=3 BENCH_LARGE_REPEAT=3 \ |
| 36 | +BENCH_NESTED_MB=5,20 BENCH_NESTED_NUMBER=1 BENCH_NESTED_REPEAT=3 \ |
| 37 | +BENCH_NESTED_FORCE_PARALLEL=0 \ |
| 38 | +BENCH_CLI_MMAP_MB=512 \ |
| 39 | +python benchmarks/bench.py |
| 40 | +``` |
| 41 | + |
| 42 | +### Mapping suites to PRs (006 / 101 / 102) |
| 43 | + |
| 44 | +- **PR‑006 (CLI mmap)**: run `cli_mmap_suite` by setting `BENCH_CLI_MMAP_MB`. |
| 45 | + - This runs the Rust CLI twice on the same large file: default mmap vs `--no-mmap`. |
| 46 | + - It uses `/usr/bin/time` when available to capture max RSS. |
| 47 | +- **PR‑101 (parallel delimiter indexer)**: use `large_root_array_suite` and increase `BENCH_LARGE_MB` (e.g. `200,1000`) to find the crossover where parallel indexing starts paying off. |
| 48 | +- **PR‑102 (nested huge value / corpus)**: use `nested_corpus_suite` to benchmark `scale_target_keys=["corpus"]` with `allow_parallel` on/off. |
| 49 | + |
| 50 | +## Suite 1 — LLM messy JSON suite (primary) |
| 51 | + |
| 52 | +### What it tests |
| 53 | + |
| 54 | +Real LLM outputs often include things that break strict JSON parsers: |
| 55 | + |
| 56 | +- Markdown fences: ```` ```json ... ``` ```` |
| 57 | +- Prefix/suffix junk: `"json 입니다~ {...} 감사합니다"` |
| 58 | +- “Almost JSON”: single quotes, unquoted keys, trailing commas |
| 59 | +- Python literals: `True/False/None` |
| 60 | +- Missing commas |
| 61 | +- Smart quotes: `“ ”` |
| 62 | +- Missing closing brackets/quotes |
| 63 | + |
| 64 | +The benchmark uses a fixed set of 10 cases (see `LLM_MESSY_CASES` in `benchmarks/bench.py`). |
| 65 | + |
| 66 | +### Metrics |
| 67 | + |
| 68 | +- `success`: no exception was raised |
| 69 | +- `correct`: parsed value equals the expected Python object |
| 70 | +- `best time / case`: the best observed per-case attempt time (lower is better) |
| 71 | + |
| 72 | +### Concrete example (drop-in impact) |
| 73 | + |
| 74 | +**Input (not strict JSON):** |
| 75 | + |
| 76 | +````text |
| 77 | +preface```json |
| 78 | +{"a":1} |
| 79 | +```suffix |
| 80 | +```` |
| 81 | + |
| 82 | +Strict parsers: |
| 83 | + |
| 84 | +```python |
| 85 | +import json |
| 86 | +import orjson # real orjson |
| 87 | + |
| 88 | +json.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError |
| 89 | +orjson.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError |
| 90 | +``` |
| 91 | + |
| 92 | +With `agentjson` as an `orjson` drop-in (same call site): |
| 93 | + |
| 94 | +```python |
| 95 | +import os |
| 96 | +import orjson # agentjson shim |
| 97 | + |
| 98 | +os.environ["JSONPROB_ORJSON_MODE"] = "auto" |
| 99 | +orjson.loads('preface```json\n{"a":1}\n```suffix') # -> {"a": 1} |
| 100 | +``` |
| 101 | + |
| 102 | +This is the core PR pitch: **don’t change code**, just switch the package and flip a mode when needed. |
| 103 | + |
| 104 | +### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64) |
| 105 | + |
| 106 | +| Library / mode | Success | Correct | Best time / case | |
| 107 | +|---|---:|---:|---:| |
| 108 | +| `json` (strict) | 0/10 | 0/10 | n/a | |
| 109 | +| `ujson` (strict) | 0/10 | 0/10 | n/a | |
| 110 | +| `orjson` (strict, real) | 0/10 | 0/10 | n/a | |
| 111 | +| `orjson` (auto, agentjson shim) | 10/10 | 10/10 | 45.9 µs | |
| 112 | +| `agentjson.parse(mode=auto)` | 10/10 | 10/10 | 39.8 µs | |
| 113 | +| `agentjson.parse(mode=probabilistic)` | 10/10 | 10/10 | 39.7 µs | |
| 114 | + |
| 115 | +## Suite 2 — Top‑K repair suite (secondary) |
| 116 | + |
| 117 | +### What it tests |
| 118 | + |
| 119 | +When input is ambiguous, a “repair” might have multiple plausible interpretations. |
| 120 | + |
| 121 | +`agentjson` can return multiple candidates (`top_k`) with confidence scores, so downstream code can: |
| 122 | + |
| 123 | +- validate with schema/business rules, |
| 124 | +- pick the best candidate, |
| 125 | +- or decide to re-ask an LLM only when confidence is low. |
| 126 | + |
| 127 | +### Concrete example (why Top‑K matters) |
| 128 | + |
| 129 | +**Input (ambiguous suffix):** |
| 130 | + |
| 131 | +```text |
| 132 | +{"a":1,"b":2,"c":3, nonsense nonsense |
| 133 | +``` |
| 134 | + |
| 135 | +A plausible “best effort” interpretation is to **truncate** the garbage suffix and return: |
| 136 | + |
| 137 | +```python |
| 138 | +{"a": 1, "b": 2, "c": 3} |
| 139 | +``` |
| 140 | + |
| 141 | +But another plausible interpretation is to treat `nonsense` as a missing key/value pair and produce something else. |
| 142 | + |
| 143 | +In the benchmark run, this case shows up exactly as: |
| 144 | + |
| 145 | +- **Top‑1 hit** misses (not the expected value), |
| 146 | +- but **Top‑K hit (K=5)** succeeds (the expected value is present in the candidate list). |
| 147 | + |
| 148 | +### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64) |
| 149 | + |
| 150 | +| Metric | Value | |
| 151 | +|---|---:| |
| 152 | +| Top‑1 hit rate | 7/8 | |
| 153 | +| Top‑K hit rate (K=5) | 8/8 | |
| 154 | +| Avg candidates returned | 1.25 | |
| 155 | +| Avg best confidence | 0.57 | |
| 156 | +| Best time / case | 92.7 µs | |
| 157 | + |
| 158 | +## Suite 3 — Large root-array parsing (big data angle) |
| 159 | + |
| 160 | +### What it tests |
| 161 | + |
| 162 | +This suite generates a **single large root array** like: |
| 163 | + |
| 164 | +```json |
| 165 | +[{"id":0,"value":"test"},{"id":0,"value":"test"}, ...] |
| 166 | +``` |
| 167 | + |
| 168 | +and measures how long `loads(...)` takes for sizes like 5MB and 20MB. |
| 169 | + |
| 170 | +For comparing `json/ujson/orjson`, use **Env A (real orjson)**. In Env B, `import orjson` is the shim. |
| 171 | + |
| 172 | +### Example results (Env A: real `orjson`, 2025-12-13) |
| 173 | + |
| 174 | +| Library | 5 MB | 20 MB | |
| 175 | +|---|---:|---:| |
| 176 | +| `json.loads(str)` | 52.3 ms | 209.6 ms | |
| 177 | +| `ujson.loads(str)` | 42.2 ms | 176.1 ms | |
| 178 | +| `orjson.loads(bytes)` (real) | 24.6 ms | 115.9 ms | |
| 179 | + |
| 180 | +`benchmarks/bench.py` also measures `agentjson.scale(serial|parallel)` (Env B). On 5–20MB inputs the parallel path is slower due to overhead; it’s intended for much larger payloads (GB‑scale root arrays). |
| 181 | + |
| 182 | +## Suite 3b — Nested `corpus` suite (targeted huge value) |
| 183 | + |
| 184 | +This is the “realistic Slack payload” shape: |
| 185 | + |
| 186 | +```json |
| 187 | +{ "corpus": [ ... huge ... ], "x": 0 } |
| 188 | +``` |
| 189 | + |
| 190 | +Why this matters: |
| 191 | + |
| 192 | +- Parallelizing a *root array* is useful, but many real payloads wrap the big array under a key (`corpus`, `rows`, `events`, …). |
| 193 | +- The `scale_target_keys=["corpus"]` option exists to target that nested value. |
| 194 | + |
| 195 | +`benchmarks/bench.py` includes `nested_corpus_suite` which compares: |
| 196 | + |
| 197 | +- `agentjson.scale_pipeline(no_target)` — baseline (no targeting) |
| 198 | +- `agentjson.scale_pipeline(corpus, serial)` — targeting without forcing parallel |
| 199 | +- `agentjson.scale_pipeline(corpus, parallel)` — optional: targeting with forced parallel (`allow_parallel=True`, set `BENCH_NESTED_FORCE_PARALLEL=1`) |
| 200 | + |
| 201 | +Important nuance: |
| 202 | + |
| 203 | +- This suite uses **DOM** mode (`scale_output="dom"`) so `split_mode` shows whether nested targeting triggered (see `rust/src/scale.rs::try_nested_target_split`). |
| 204 | +- Wiring nested targeting into **tape** mode (`scale_output="tape"`) is the next-step work for true “huge nested value without DOM” workloads. |
| 205 | + |
| 206 | +## Suite 4 — Valid JSON microbench (context) |
| 207 | + |
| 208 | +This is included to set expectations only. |
| 209 | + |
| 210 | +Example results (Env A: real `orjson`, 2025-12-13): |
| 211 | + |
| 212 | +| Library | `loads` | `dumps` | |
| 213 | +|---|---:|---:| |
| 214 | +| `json` | 1.789 µs | 2.397 µs | |
| 215 | +| `ujson` | 1.170 µs | 0.940 µs | |
| 216 | +| `orjson` (real) | 0.398 µs | 0.215 µs | |
| 217 | + |
| 218 | +Example results (Env B: `agentjson` shim, strict, 2025-12-13): |
| 219 | + |
| 220 | +| Library | `loads` | `dumps` | |
| 221 | +|---|---:|---:| |
| 222 | +| `orjson` (agentjson shim, strict) | 8.764 µs | 2.923 µs | |
| 223 | + |
| 224 | +## What to emphasize in a PR |
| 225 | + |
| 226 | +- **LLM pipeline reliability** (Slack core): strict parsers fail 0/10; `agentjson` succeeds 10/10 on “json입니다~ …”, markdown fences, smart quotes, trailing commas, missing closers, etc. |
| 227 | +- **Drop-in adoption path**: keep `import orjson; orjson.loads(...)` everywhere; flip `JSONPROB_ORJSON_MODE=auto` only where you want “repair/scale on failure”. |
| 228 | +- **Top‑K + confidence** (Noah’s “probabilistic parser”): return multiple candidates so downstream code can validate with schema/business rules before retrying an LLM. |
| 229 | +- **Honest big-data story** (Kangwook’s skepticism): parallel split/merge has overhead; it’s meant for multi‑GB JSON (and file-backed mmap), not 5–20MB toy inputs. |
| 230 | + |
| 231 | +## CLI mmap benchmark (PR‑006) |
| 232 | + |
| 233 | +If you want to benchmark “batch parsing huge files without allocating a giant `Vec<u8>`”, `benchmarks/bench.py` can compare: |
| 234 | + |
| 235 | +- default (mmap) |
| 236 | +- `--no-mmap` (read into memory) |
| 237 | + |
| 238 | +Run: |
| 239 | + |
| 240 | +```bash |
| 241 | +cd rust && cargo build --release |
| 242 | +cd .. |
| 243 | +BENCH_CLI_MMAP_MB=512 python benchmarks/bench.py |
| 244 | +``` |
| 245 | + |
| 246 | +Notes: |
| 247 | + |
| 248 | +- The suite runs `--scale-output tape --debug` to avoid printing the full parsed value for huge inputs. |
| 249 | +- In some restricted macOS environments, we can’t capture max RSS reliably; the suite reports wall time only. |
| 250 | +- For real PR validation, use much larger sizes (GB‑scale) and multiple runs; this suite is primarily a reproducible harness. |
0 commit comments