Skip to content

Commit f0bad4a

Browse files
committed
refactor: bigbang
1 parent 27064ec commit f0bad4a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4576
-1030
lines changed

.github/workflows/ci.yml

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
tags:
7+
- 'v*'
8+
pull_request:
9+
branches: [main]
10+
workflow_dispatch:
11+
12+
permissions:
13+
contents: read
14+
15+
jobs:
16+
# Run tests
17+
test:
18+
runs-on: ubuntu-latest
19+
steps:
20+
- uses: actions/checkout@v4
21+
- name: Setup Python
22+
uses: actions/setup-python@v5
23+
with:
24+
python-version: '3.12'
25+
- name: Install Rust
26+
uses: dtolnay/rust-toolchain@stable
27+
with:
28+
components: clippy
29+
- name: Run cargo clippy
30+
run: cargo clippy -- -D warnings
31+
- name: Run cargo test
32+
run: cargo test
33+
- name: Build and install package
34+
run: |
35+
pip install maturin pytest
36+
maturin develop
37+
- name: Run Python tests
38+
run: PYTHONPATH=src python -m pytest tests/ -v
39+
40+
# Build wheels for all platforms
41+
build:
42+
name: Build wheels on ${{ matrix.os }}
43+
runs-on: ${{ matrix.os }}
44+
strategy:
45+
fail-fast: false
46+
matrix:
47+
os: [ubuntu-latest, macos-13, macos-14, windows-latest]
48+
steps:
49+
- uses: actions/checkout@v4
50+
51+
- name: Build wheels
52+
uses: PyO3/maturin-action@v1
53+
with:
54+
target: ${{ matrix.os == 'macos-14' && 'aarch64-apple-darwin' || '' }}
55+
args: --release --out dist
56+
sccache: 'true'
57+
manylinux: auto
58+
59+
- name: Upload wheels
60+
uses: actions/upload-artifact@v4
61+
with:
62+
name: wheels-${{ matrix.os }}-${{ strategy.job-index }}
63+
path: dist
64+
65+
# Build source distribution
66+
sdist:
67+
runs-on: ubuntu-latest
68+
steps:
69+
- uses: actions/checkout@v4
70+
- name: Build sdist
71+
uses: PyO3/maturin-action@v1
72+
with:
73+
command: sdist
74+
args: --out dist
75+
- name: Upload sdist
76+
uses: actions/upload-artifact@v4
77+
with:
78+
name: wheels-sdist
79+
path: dist
80+
81+
# Publish to PyPI on release
82+
publish:
83+
name: Publish to PyPI
84+
runs-on: ubuntu-latest
85+
if: startsWith(github.ref, 'refs/tags/v')
86+
needs: [test, build, sdist]
87+
permissions:
88+
id-token: write # For trusted publishing
89+
steps:
90+
- name: Download all artifacts
91+
uses: actions/download-artifact@v4
92+
with:
93+
pattern: wheels-*
94+
path: dist
95+
merge-multiple: true
96+
97+
- name: List dist contents
98+
run: ls -la dist/
99+
100+
- name: Publish to PyPI
101+
uses: PyO3/maturin-action@v1
102+
env:
103+
MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
104+
with:
105+
command: upload
106+
args: --non-interactive --skip-existing dist/*

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,6 @@ demo/*.result.json
2020
# Local PyO3 extension artifacts (do not commit)
2121
src/json_prob_parser_rust*.so
2222
src/json_prob_parser_rust*.dylib
23+
src/agentjson/*.so
24+
src/agentjson/*.dylib
25+
src/agentjson/*.pyd

BENCHMARK.md

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Benchmarks (agentjson)
2+
3+
This document explains what `benchmarks/bench.py` measures and **why** the suites are structured the way they are, based on the Slack thread:
4+
5+
- “LLM outputs sometimes break JSON or wrap it (‘json입니다~ …’, markdown fences, etc.)”
6+
- “Huge JSON (GB-scale root arrays / big data) might want parallel parsing”
7+
- “A probabilistic parser that can return multiple candidates (Top‑K) is valuable”
8+
9+
> TL;DR: `agentjson` is not trying to beat `orjson` on valid JSON microbenchmarks. The primary win is **success rate** and **drop-in adoption** in LLM pipelines.
10+
11+
## How to run
12+
13+
Because `agentjson` ships a top-level `orjson` shim, you must use **two separate environments** if you want to compare with the real `orjson` package:
14+
15+
```bash
16+
# Env A: real orjson
17+
python -m venv .venv-orjson
18+
source .venv-orjson/bin/activate
19+
python -m pip install orjson ujson
20+
python benchmarks/bench.py
21+
22+
# Env B: agentjson (includes the shim)
23+
python -m venv .venv-agentjson
24+
source .venv-agentjson/bin/activate
25+
python -m pip install agentjson ujson
26+
python benchmarks/bench.py
27+
```
28+
29+
Tune run sizes with env vars:
30+
31+
```bash
32+
BENCH_MICRO_NUMBER=20000 BENCH_MICRO_REPEAT=5 \
33+
BENCH_MESSY_NUMBER=2000 BENCH_MESSY_REPEAT=5 \
34+
BENCH_TOPK_NUMBER=500 BENCH_TOPK_REPEAT=5 \
35+
BENCH_LARGE_MB=5,20 BENCH_LARGE_NUMBER=3 BENCH_LARGE_REPEAT=3 \
36+
BENCH_NESTED_MB=5,20 BENCH_NESTED_NUMBER=1 BENCH_NESTED_REPEAT=3 \
37+
BENCH_NESTED_FORCE_PARALLEL=0 \
38+
BENCH_CLI_MMAP_MB=512 \
39+
python benchmarks/bench.py
40+
```
41+
42+
### Mapping suites to PRs (006 / 101 / 102)
43+
44+
- **PR‑006 (CLI mmap)**: run `cli_mmap_suite` by setting `BENCH_CLI_MMAP_MB`.
45+
- This runs the Rust CLI twice on the same large file: default mmap vs `--no-mmap`.
46+
- It uses `/usr/bin/time` when available to capture max RSS.
47+
- **PR‑101 (parallel delimiter indexer)**: use `large_root_array_suite` and increase `BENCH_LARGE_MB` (e.g. `200,1000`) to find the crossover where parallel indexing starts paying off.
48+
- **PR‑102 (nested huge value / corpus)**: use `nested_corpus_suite` to benchmark `scale_target_keys=["corpus"]` with `allow_parallel` on/off.
49+
50+
## Suite 1 — LLM messy JSON suite (primary)
51+
52+
### What it tests
53+
54+
Real LLM outputs often include things that break strict JSON parsers:
55+
56+
- Markdown fences: ```` ```json ... ``` ````
57+
- Prefix/suffix junk: `"json 입니다~ {...} 감사합니다"`
58+
- “Almost JSON”: single quotes, unquoted keys, trailing commas
59+
- Python literals: `True/False/None`
60+
- Missing commas
61+
- Smart quotes: `“ ”`
62+
- Missing closing brackets/quotes
63+
64+
The benchmark uses a fixed set of 10 cases (see `LLM_MESSY_CASES` in `benchmarks/bench.py`).
65+
66+
### Metrics
67+
68+
- `success`: no exception was raised
69+
- `correct`: parsed value equals the expected Python object
70+
- `best time / case`: the best observed per-case attempt time (lower is better)
71+
72+
### Concrete example (drop-in impact)
73+
74+
**Input (not strict JSON):**
75+
76+
````text
77+
preface```json
78+
{"a":1}
79+
```suffix
80+
````
81+
82+
Strict parsers:
83+
84+
```python
85+
import json
86+
import orjson # real orjson
87+
88+
json.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError
89+
orjson.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError
90+
```
91+
92+
With `agentjson` as an `orjson` drop-in (same call site):
93+
94+
```python
95+
import os
96+
import orjson # agentjson shim
97+
98+
os.environ["JSONPROB_ORJSON_MODE"] = "auto"
99+
orjson.loads('preface```json\n{"a":1}\n```suffix') # -> {"a": 1}
100+
```
101+
102+
This is the core PR pitch: **don’t change code**, just switch the package and flip a mode when needed.
103+
104+
### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64)
105+
106+
| Library / mode | Success | Correct | Best time / case |
107+
|---|---:|---:|---:|
108+
| `json` (strict) | 0/10 | 0/10 | n/a |
109+
| `ujson` (strict) | 0/10 | 0/10 | n/a |
110+
| `orjson` (strict, real) | 0/10 | 0/10 | n/a |
111+
| `orjson` (auto, agentjson shim) | 10/10 | 10/10 | 45.9 µs |
112+
| `agentjson.parse(mode=auto)` | 10/10 | 10/10 | 39.8 µs |
113+
| `agentjson.parse(mode=probabilistic)` | 10/10 | 10/10 | 39.7 µs |
114+
115+
## Suite 2 — Top‑K repair suite (secondary)
116+
117+
### What it tests
118+
119+
When input is ambiguous, a “repair” might have multiple plausible interpretations.
120+
121+
`agentjson` can return multiple candidates (`top_k`) with confidence scores, so downstream code can:
122+
123+
- validate with schema/business rules,
124+
- pick the best candidate,
125+
- or decide to re-ask an LLM only when confidence is low.
126+
127+
### Concrete example (why Top‑K matters)
128+
129+
**Input (ambiguous suffix):**
130+
131+
```text
132+
{"a":1,"b":2,"c":3, nonsense nonsense
133+
```
134+
135+
A plausible “best effort” interpretation is to **truncate** the garbage suffix and return:
136+
137+
```python
138+
{"a": 1, "b": 2, "c": 3}
139+
```
140+
141+
But another plausible interpretation is to treat `nonsense` as a missing key/value pair and produce something else.
142+
143+
In the benchmark run, this case shows up exactly as:
144+
145+
- **Top‑1 hit** misses (not the expected value),
146+
- but **Top‑K hit (K=5)** succeeds (the expected value is present in the candidate list).
147+
148+
### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64)
149+
150+
| Metric | Value |
151+
|---|---:|
152+
| Top‑1 hit rate | 7/8 |
153+
| Top‑K hit rate (K=5) | 8/8 |
154+
| Avg candidates returned | 1.25 |
155+
| Avg best confidence | 0.57 |
156+
| Best time / case | 92.7 µs |
157+
158+
## Suite 3 — Large root-array parsing (big data angle)
159+
160+
### What it tests
161+
162+
This suite generates a **single large root array** like:
163+
164+
```json
165+
[{"id":0,"value":"test"},{"id":0,"value":"test"}, ...]
166+
```
167+
168+
and measures how long `loads(...)` takes for sizes like 5MB and 20MB.
169+
170+
For comparing `json/ujson/orjson`, use **Env A (real orjson)**. In Env B, `import orjson` is the shim.
171+
172+
### Example results (Env A: real `orjson`, 2025-12-13)
173+
174+
| Library | 5 MB | 20 MB |
175+
|---|---:|---:|
176+
| `json.loads(str)` | 52.3 ms | 209.6 ms |
177+
| `ujson.loads(str)` | 42.2 ms | 176.1 ms |
178+
| `orjson.loads(bytes)` (real) | 24.6 ms | 115.9 ms |
179+
180+
`benchmarks/bench.py` also measures `agentjson.scale(serial|parallel)` (Env B). On 5–20MB inputs the parallel path is slower due to overhead; it’s intended for much larger payloads (GB‑scale root arrays).
181+
182+
## Suite 3b — Nested `corpus` suite (targeted huge value)
183+
184+
This is the “realistic Slack payload” shape:
185+
186+
```json
187+
{ "corpus": [ ... huge ... ], "x": 0 }
188+
```
189+
190+
Why this matters:
191+
192+
- Parallelizing a *root array* is useful, but many real payloads wrap the big array under a key (`corpus`, `rows`, `events`, …).
193+
- The `scale_target_keys=["corpus"]` option exists to target that nested value.
194+
195+
`benchmarks/bench.py` includes `nested_corpus_suite` which compares:
196+
197+
- `agentjson.scale_pipeline(no_target)` — baseline (no targeting)
198+
- `agentjson.scale_pipeline(corpus, serial)` — targeting without forcing parallel
199+
- `agentjson.scale_pipeline(corpus, parallel)` — optional: targeting with forced parallel (`allow_parallel=True`, set `BENCH_NESTED_FORCE_PARALLEL=1`)
200+
201+
Important nuance:
202+
203+
- This suite uses **DOM** mode (`scale_output="dom"`) so `split_mode` shows whether nested targeting triggered (see `rust/src/scale.rs::try_nested_target_split`).
204+
- Wiring nested targeting into **tape** mode (`scale_output="tape"`) is the next-step work for true “huge nested value without DOM” workloads.
205+
206+
## Suite 4 — Valid JSON microbench (context)
207+
208+
This is included to set expectations only.
209+
210+
Example results (Env A: real `orjson`, 2025-12-13):
211+
212+
| Library | `loads` | `dumps` |
213+
|---|---:|---:|
214+
| `json` | 1.789 µs | 2.397 µs |
215+
| `ujson` | 1.170 µs | 0.940 µs |
216+
| `orjson` (real) | 0.398 µs | 0.215 µs |
217+
218+
Example results (Env B: `agentjson` shim, strict, 2025-12-13):
219+
220+
| Library | `loads` | `dumps` |
221+
|---|---:|---:|
222+
| `orjson` (agentjson shim, strict) | 8.764 µs | 2.923 µs |
223+
224+
## What to emphasize in a PR
225+
226+
- **LLM pipeline reliability** (Slack core): strict parsers fail 0/10; `agentjson` succeeds 10/10 on “json입니다~ …”, markdown fences, smart quotes, trailing commas, missing closers, etc.
227+
- **Drop-in adoption path**: keep `import orjson; orjson.loads(...)` everywhere; flip `JSONPROB_ORJSON_MODE=auto` only where you want “repair/scale on failure”.
228+
- **Top‑K + confidence** (Noah’s “probabilistic parser”): return multiple candidates so downstream code can validate with schema/business rules before retrying an LLM.
229+
- **Honest big-data story** (Kangwook’s skepticism): parallel split/merge has overhead; it’s meant for multi‑GB JSON (and file-backed mmap), not 5–20MB toy inputs.
230+
231+
## CLI mmap benchmark (PR‑006)
232+
233+
If you want to benchmark “batch parsing huge files without allocating a giant `Vec<u8>`”, `benchmarks/bench.py` can compare:
234+
235+
- default (mmap)
236+
- `--no-mmap` (read into memory)
237+
238+
Run:
239+
240+
```bash
241+
cd rust && cargo build --release
242+
cd ..
243+
BENCH_CLI_MMAP_MB=512 python benchmarks/bench.py
244+
```
245+
246+
Notes:
247+
248+
- The suite runs `--scale-output tape --debug` to avoid printing the full parsed value for huge inputs.
249+
- In some restricted macOS environments, we can’t capture max RSS reliably; the suite reports wall time only.
250+
- For real PR validation, use much larger sizes (GB‑scale) and multiple runs; this suite is primarily a reproducible harness.

0 commit comments

Comments
 (0)