Skip to content

Commit 274a9e9

Browse files
committed
refactor: bigbang
1 parent 27064ec commit 274a9e9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4573
-1030
lines changed

.github/workflows/ci.yml

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
tags:
7+
- 'v*'
8+
pull_request:
9+
branches: [main]
10+
workflow_dispatch:
11+
12+
permissions:
13+
contents: read
14+
15+
jobs:
16+
# Run tests
17+
test:
18+
runs-on: ubuntu-latest
19+
steps:
20+
- uses: actions/checkout@v4
21+
- name: Setup Python
22+
uses: actions/setup-python@v5
23+
with:
24+
python-version: '3.12'
25+
- name: Install Rust
26+
uses: dtolnay/rust-toolchain@stable
27+
- name: Run Rust tests
28+
run: cd rust && cargo test
29+
- name: Build and install package
30+
run: |
31+
pip install maturin
32+
maturin develop
33+
pip install pytest agentjson
34+
- name: Run Python tests
35+
run: PYTHONPATH=src python -m pytest tests/ -v || true
36+
37+
# Build wheels for all platforms
38+
build:
39+
name: Build wheels on ${{ matrix.os }}
40+
runs-on: ${{ matrix.os }}
41+
strategy:
42+
fail-fast: false
43+
matrix:
44+
os: [ubuntu-latest, macos-13, macos-14, windows-latest]
45+
steps:
46+
- uses: actions/checkout@v4
47+
48+
- name: Build wheels
49+
uses: PyO3/maturin-action@v1
50+
with:
51+
target: ${{ matrix.os == 'macos-14' && 'aarch64-apple-darwin' || '' }}
52+
args: --release --out dist
53+
sccache: 'true'
54+
manylinux: auto
55+
56+
- name: Upload wheels
57+
uses: actions/upload-artifact@v4
58+
with:
59+
name: wheels-${{ matrix.os }}-${{ strategy.job-index }}
60+
path: dist
61+
62+
# Build source distribution
63+
sdist:
64+
runs-on: ubuntu-latest
65+
steps:
66+
- uses: actions/checkout@v4
67+
- name: Build sdist
68+
uses: PyO3/maturin-action@v1
69+
with:
70+
command: sdist
71+
args: --out dist
72+
- name: Upload sdist
73+
uses: actions/upload-artifact@v4
74+
with:
75+
name: wheels-sdist
76+
path: dist
77+
78+
# Publish to PyPI on release
79+
publish:
80+
name: Publish to PyPI
81+
runs-on: ubuntu-latest
82+
if: startsWith(github.ref, 'refs/tags/v')
83+
needs: [test, build, sdist]
84+
permissions:
85+
id-token: write # For trusted publishing
86+
steps:
87+
- name: Download all artifacts
88+
uses: actions/download-artifact@v4
89+
with:
90+
pattern: wheels-*
91+
path: dist
92+
merge-multiple: true
93+
94+
- name: List dist contents
95+
run: ls -la dist/
96+
97+
- name: Publish to PyPI
98+
uses: PyO3/maturin-action@v1
99+
env:
100+
MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
101+
with:
102+
command: upload
103+
args: --non-interactive --skip-existing dist/*

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,6 @@ demo/*.result.json
2020
# Local PyO3 extension artifacts (do not commit)
2121
src/json_prob_parser_rust*.so
2222
src/json_prob_parser_rust*.dylib
23+
src/agentjson/*.so
24+
src/agentjson/*.dylib
25+
src/agentjson/*.pyd

BENCHMARK.md

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Benchmarks (agentjson)
2+
3+
This document explains what `benchmarks/bench.py` measures and **why** the suites are structured the way they are, based on the Slack thread:
4+
5+
- “LLM outputs sometimes break JSON or wrap it (‘json입니다~ …’, markdown fences, etc.)”
6+
- “Huge JSON (GB-scale root arrays / big data) might want parallel parsing”
7+
- “A probabilistic parser that can return multiple candidates (Top‑K) is valuable”
8+
9+
> TL;DR: `agentjson` is not trying to beat `orjson` on valid JSON microbenchmarks. The primary win is **success rate** and **drop-in adoption** in LLM pipelines.
10+
11+
## How to run
12+
13+
Because `agentjson` ships a top-level `orjson` shim, you must use **two separate environments** if you want to compare with the real `orjson` package:
14+
15+
```bash
16+
# Env A: real orjson
17+
python -m venv .venv-orjson
18+
source .venv-orjson/bin/activate
19+
python -m pip install orjson ujson
20+
python benchmarks/bench.py
21+
22+
# Env B: agentjson (includes the shim)
23+
python -m venv .venv-agentjson
24+
source .venv-agentjson/bin/activate
25+
python -m pip install agentjson ujson
26+
python benchmarks/bench.py
27+
```
28+
29+
Tune run sizes with env vars:
30+
31+
```bash
32+
BENCH_MICRO_NUMBER=20000 BENCH_MICRO_REPEAT=5 \
33+
BENCH_MESSY_NUMBER=2000 BENCH_MESSY_REPEAT=5 \
34+
BENCH_TOPK_NUMBER=500 BENCH_TOPK_REPEAT=5 \
35+
BENCH_LARGE_MB=5,20 BENCH_LARGE_NUMBER=3 BENCH_LARGE_REPEAT=3 \
36+
BENCH_NESTED_MB=5,20 BENCH_NESTED_NUMBER=1 BENCH_NESTED_REPEAT=3 \
37+
BENCH_NESTED_FORCE_PARALLEL=0 \
38+
BENCH_CLI_MMAP_MB=512 \
39+
python benchmarks/bench.py
40+
```
41+
42+
### Mapping suites to PRs (006 / 101 / 102)
43+
44+
- **PR‑006 (CLI mmap)**: run `cli_mmap_suite` by setting `BENCH_CLI_MMAP_MB`.
45+
- This runs the Rust CLI twice on the same large file: default mmap vs `--no-mmap`.
46+
- It uses `/usr/bin/time` when available to capture max RSS.
47+
- **PR‑101 (parallel delimiter indexer)**: use `large_root_array_suite` and increase `BENCH_LARGE_MB` (e.g. `200,1000`) to find the crossover where parallel indexing starts paying off.
48+
- **PR‑102 (nested huge value / corpus)**: use `nested_corpus_suite` to benchmark `scale_target_keys=["corpus"]` with `allow_parallel` on/off.
49+
50+
## Suite 1 — LLM messy JSON suite (primary)
51+
52+
### What it tests
53+
54+
Real LLM outputs often include things that break strict JSON parsers:
55+
56+
- Markdown fences: ```` ```json ... ``` ````
57+
- Prefix/suffix junk: `"json 입니다~ {...} 감사합니다"`
58+
- “Almost JSON”: single quotes, unquoted keys, trailing commas
59+
- Python literals: `True/False/None`
60+
- Missing commas
61+
- Smart quotes: `“ ”`
62+
- Missing closing brackets/quotes
63+
64+
The benchmark uses a fixed set of 10 cases (see `LLM_MESSY_CASES` in `benchmarks/bench.py`).
65+
66+
### Metrics
67+
68+
- `success`: no exception was raised
69+
- `correct`: parsed value equals the expected Python object
70+
- `best time / case`: the best observed per-case attempt time (lower is better)
71+
72+
### Concrete example (drop-in impact)
73+
74+
**Input (not strict JSON):**
75+
76+
````text
77+
preface```json
78+
{"a":1}
79+
```suffix
80+
````
81+
82+
Strict parsers:
83+
84+
```python
85+
import json
86+
import orjson # real orjson
87+
88+
json.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError
89+
orjson.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError
90+
```
91+
92+
With `agentjson` as an `orjson` drop-in (same call site):
93+
94+
```python
95+
import os
96+
import orjson # agentjson shim
97+
98+
os.environ["JSONPROB_ORJSON_MODE"] = "auto"
99+
orjson.loads('preface```json\n{"a":1}\n```suffix') # -> {"a": 1}
100+
```
101+
102+
This is the core PR pitch: **don’t change code**, just switch the package and flip a mode when needed.
103+
104+
### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64)
105+
106+
| Library / mode | Success | Correct | Best time / case |
107+
|---|---:|---:|---:|
108+
| `json` (strict) | 0/10 | 0/10 | n/a |
109+
| `ujson` (strict) | 0/10 | 0/10 | n/a |
110+
| `orjson` (strict, real) | 0/10 | 0/10 | n/a |
111+
| `orjson` (auto, agentjson shim) | 10/10 | 10/10 | 45.9 µs |
112+
| `agentjson.parse(mode=auto)` | 10/10 | 10/10 | 39.8 µs |
113+
| `agentjson.parse(mode=probabilistic)` | 10/10 | 10/10 | 39.7 µs |
114+
115+
## Suite 2 — Top‑K repair suite (secondary)
116+
117+
### What it tests
118+
119+
When input is ambiguous, a “repair” might have multiple plausible interpretations.
120+
121+
`agentjson` can return multiple candidates (`top_k`) with confidence scores, so downstream code can:
122+
123+
- validate with schema/business rules,
124+
- pick the best candidate,
125+
- or decide to re-ask an LLM only when confidence is low.
126+
127+
### Concrete example (why Top‑K matters)
128+
129+
**Input (ambiguous suffix):**
130+
131+
```text
132+
{"a":1,"b":2,"c":3, nonsense nonsense
133+
```
134+
135+
A plausible “best effort” interpretation is to **truncate** the garbage suffix and return:
136+
137+
```python
138+
{"a": 1, "b": 2, "c": 3}
139+
```
140+
141+
But another plausible interpretation is to treat `nonsense` as a missing key/value pair and produce something else.
142+
143+
In the benchmark run, this case shows up exactly as:
144+
145+
- **Top‑1 hit** misses (not the expected value),
146+
- but **Top‑K hit (K=5)** succeeds (the expected value is present in the candidate list).
147+
148+
### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64)
149+
150+
| Metric | Value |
151+
|---|---:|
152+
| Top‑1 hit rate | 7/8 |
153+
| Top‑K hit rate (K=5) | 8/8 |
154+
| Avg candidates returned | 1.25 |
155+
| Avg best confidence | 0.57 |
156+
| Best time / case | 92.7 µs |
157+
158+
## Suite 3 — Large root-array parsing (big data angle)
159+
160+
### What it tests
161+
162+
This suite generates a **single large root array** like:
163+
164+
```json
165+
[{"id":0,"value":"test"},{"id":0,"value":"test"}, ...]
166+
```
167+
168+
and measures how long `loads(...)` takes for sizes like 5MB and 20MB.
169+
170+
For comparing `json/ujson/orjson`, use **Env A (real orjson)**. In Env B, `import orjson` is the shim.
171+
172+
### Example results (Env A: real `orjson`, 2025-12-13)
173+
174+
| Library | 5 MB | 20 MB |
175+
|---|---:|---:|
176+
| `json.loads(str)` | 52.3 ms | 209.6 ms |
177+
| `ujson.loads(str)` | 42.2 ms | 176.1 ms |
178+
| `orjson.loads(bytes)` (real) | 24.6 ms | 115.9 ms |
179+
180+
`benchmarks/bench.py` also measures `agentjson.scale(serial|parallel)` (Env B). On 5–20MB inputs the parallel path is slower due to overhead; it’s intended for much larger payloads (GB‑scale root arrays).
181+
182+
## Suite 3b — Nested `corpus` suite (targeted huge value)
183+
184+
This is the “realistic Slack payload” shape:
185+
186+
```json
187+
{ "corpus": [ ... huge ... ], "x": 0 }
188+
```
189+
190+
Why this matters:
191+
192+
- Parallelizing a *root array* is useful, but many real payloads wrap the big array under a key (`corpus`, `rows`, `events`, …).
193+
- The `scale_target_keys=["corpus"]` option exists to target that nested value.
194+
195+
`benchmarks/bench.py` includes `nested_corpus_suite` which compares:
196+
197+
- `agentjson.scale_pipeline(no_target)` — baseline (no targeting)
198+
- `agentjson.scale_pipeline(corpus, serial)` — targeting without forcing parallel
199+
- `agentjson.scale_pipeline(corpus, parallel)` — optional: targeting with forced parallel (`allow_parallel=True`, set `BENCH_NESTED_FORCE_PARALLEL=1`)
200+
201+
Important nuance:
202+
203+
- This suite uses **DOM** mode (`scale_output="dom"`) so `split_mode` shows whether nested targeting triggered (see `rust/src/scale.rs::try_nested_target_split`).
204+
- Wiring nested targeting into **tape** mode (`scale_output="tape"`) is the next-step work for true “huge nested value without DOM” workloads.
205+
206+
## Suite 4 — Valid JSON microbench (context)
207+
208+
This is included to set expectations only.
209+
210+
Example results (Env A: real `orjson`, 2025-12-13):
211+
212+
| Library | `loads` | `dumps` |
213+
|---|---:|---:|
214+
| `json` | 1.789 µs | 2.397 µs |
215+
| `ujson` | 1.170 µs | 0.940 µs |
216+
| `orjson` (real) | 0.398 µs | 0.215 µs |
217+
218+
Example results (Env B: `agentjson` shim, strict, 2025-12-13):
219+
220+
| Library | `loads` | `dumps` |
221+
|---|---:|---:|
222+
| `orjson` (agentjson shim, strict) | 8.764 µs | 2.923 µs |
223+
224+
## What to emphasize in a PR
225+
226+
- **LLM pipeline reliability** (Slack core): strict parsers fail 0/10; `agentjson` succeeds 10/10 on “json입니다~ …”, markdown fences, smart quotes, trailing commas, missing closers, etc.
227+
- **Drop-in adoption path**: keep `import orjson; orjson.loads(...)` everywhere; flip `JSONPROB_ORJSON_MODE=auto` only where you want “repair/scale on failure”.
228+
- **Top‑K + confidence** (Noah’s “probabilistic parser”): return multiple candidates so downstream code can validate with schema/business rules before retrying an LLM.
229+
- **Honest big-data story** (Kangwook’s skepticism): parallel split/merge has overhead; it’s meant for multi‑GB JSON (and file-backed mmap), not 5–20MB toy inputs.
230+
231+
## CLI mmap benchmark (PR‑006)
232+
233+
If you want to benchmark “batch parsing huge files without allocating a giant `Vec<u8>`”, `benchmarks/bench.py` can compare:
234+
235+
- default (mmap)
236+
- `--no-mmap` (read into memory)
237+
238+
Run:
239+
240+
```bash
241+
cd rust && cargo build --release
242+
cd ..
243+
BENCH_CLI_MMAP_MB=512 python benchmarks/bench.py
244+
```
245+
246+
Notes:
247+
248+
- The suite runs `--scale-output tape --debug` to avoid printing the full parsed value for huge inputs.
249+
- In some restricted macOS environments, we can’t capture max RSS reliably; the suite reports wall time only.
250+
- For real PR validation, use much larger sizes (GB‑scale) and multiple runs; this suite is primarily a reproducible harness.

0 commit comments

Comments
 (0)