sigridjineth
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 103 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 103 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎BENCHMARK.md‎
Lines changed: 250 additions & 0 deletions b/‎BENCHMARK.md‎
Lines changed: 250 additions & 0 deletions
@@ -0,0 +1,103 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+    tags:
+      - 'v*'
+  pull_request:
+    branches: [main]
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+jobs:
+  # Run tests
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Setup Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+      - name: Install Rust
+        uses: dtolnay/rust-toolchain@stable
+      - name: Run Rust tests
+        run: cd rust && cargo test
+      - name: Build and install package
+        run: |
+          pip install maturin
+          maturin develop
+          pip install pytest agentjson
+      - name: Run Python tests
+        run: PYTHONPATH=src python -m pytest tests/ -v || true
+
+  # Build wheels for all platforms
+  build:
+    name: Build wheels on ${{ matrix.os }}
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, macos-13, macos-14, windows-latest]
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Build wheels
+        uses: PyO3/maturin-action@v1
+        with:
+          target: ${{ matrix.os == 'macos-14' && 'aarch64-apple-darwin' || '' }}
+          args: --release --out dist
+          sccache: 'true'
+          manylinux: auto
+
+      - name: Upload wheels
+        uses: actions/upload-artifact@v4
+        with:
+          name: wheels-${{ matrix.os }}-${{ strategy.job-index }}
+          path: dist
+
+  # Build source distribution
+  sdist:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Build sdist
+        uses: PyO3/maturin-action@v1
+        with:
+          command: sdist
+          args: --out dist
+      - name: Upload sdist
+        uses: actions/upload-artifact@v4
+        with:
+          name: wheels-sdist
+          path: dist
+
+  # Publish to PyPI on release
+  publish:
+    name: Publish to PyPI
+    runs-on: ubuntu-latest
+    if: startsWith(github.ref, 'refs/tags/v')
+    needs: [test, build, sdist]
+    permissions:
+      id-token: write  # For trusted publishing
+    steps:
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          pattern: wheels-*
+          path: dist
+          merge-multiple: true
+
+      - name: List dist contents
+        run: ls -la dist/
+
+      - name: Publish to PyPI
+        uses: PyO3/maturin-action@v1
+        env:
+          MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
+        with:
+          command: upload
+          args: --non-interactive --skip-existing dist/*
@@ -20,3 +20,6 @@ demo/*.result.json
 # Local PyO3 extension artifacts (do not commit)
 src/json_prob_parser_rust*.so
 src/json_prob_parser_rust*.dylib
+src/agentjson/*.so
+src/agentjson/*.dylib
+src/agentjson/*.pyd
@@ -0,0 +1,250 @@
+# Benchmarks (agentjson)
+
+This document explains what `benchmarks/bench.py` measures and **why** the suites are structured the way they are, based on the Slack thread:
+
+- “LLM outputs sometimes break JSON or wrap it (‘json입니다~ …’, markdown fences, etc.)”
+- “Huge JSON (GB-scale root arrays / big data) might want parallel parsing”
+- “A probabilistic parser that can return multiple candidates (Top‑K) is valuable”
+
+> TL;DR: `agentjson` is not trying to beat `orjson` on valid JSON microbenchmarks. The primary win is **success rate** and **drop-in adoption** in LLM pipelines.
+
+## How to run
+
+Because `agentjson` ships a top-level `orjson` shim, you must use **two separate environments** if you want to compare with the real `orjson` package:
+
+```bash
+# Env A: real orjson
+python -m venv .venv-orjson
+source .venv-orjson/bin/activate
+python -m pip install orjson ujson
+python benchmarks/bench.py
+
+# Env B: agentjson (includes the shim)
+python -m venv .venv-agentjson
+source .venv-agentjson/bin/activate
+python -m pip install agentjson ujson
+python benchmarks/bench.py
+```
+
+Tune run sizes with env vars:
+
+```bash
+BENCH_MICRO_NUMBER=20000 BENCH_MICRO_REPEAT=5 \
+BENCH_MESSY_NUMBER=2000 BENCH_MESSY_REPEAT=5 \
+BENCH_TOPK_NUMBER=500 BENCH_TOPK_REPEAT=5 \
+BENCH_LARGE_MB=5,20 BENCH_LARGE_NUMBER=3 BENCH_LARGE_REPEAT=3 \
+BENCH_NESTED_MB=5,20 BENCH_NESTED_NUMBER=1 BENCH_NESTED_REPEAT=3 \
+BENCH_NESTED_FORCE_PARALLEL=0 \
+BENCH_CLI_MMAP_MB=512 \
+python benchmarks/bench.py
+```
+
+### Mapping suites to PRs (006 / 101 / 102)
+
+- **PR‑006 (CLI mmap)**: run `cli_mmap_suite` by setting `BENCH_CLI_MMAP_MB`.
+  - This runs the Rust CLI twice on the same large file: default mmap vs `--no-mmap`.
+  - It uses `/usr/bin/time` when available to capture max RSS.
+- **PR‑101 (parallel delimiter indexer)**: use `large_root_array_suite` and increase `BENCH_LARGE_MB` (e.g. `200,1000`) to find the crossover where parallel indexing starts paying off.
+- **PR‑102 (nested huge value / corpus)**: use `nested_corpus_suite` to benchmark `scale_target_keys=["corpus"]` with `allow_parallel` on/off.
+
+## Suite 1 — LLM messy JSON suite (primary)
+
+### What it tests
+
+Real LLM outputs often include things that break strict JSON parsers:
+
+- Markdown fences: ```` ```json ... ``` ````
+- Prefix/suffix junk: `"json 입니다~ {...} 감사합니다"`
+- “Almost JSON”: single quotes, unquoted keys, trailing commas
+- Python literals: `True/False/None`
+- Missing commas
+- Smart quotes: `“ ”`
+- Missing closing brackets/quotes
+
+The benchmark uses a fixed set of 10 cases (see `LLM_MESSY_CASES` in `benchmarks/bench.py`).
+
+### Metrics
+
+- `success`: no exception was raised
+- `correct`: parsed value equals the expected Python object
+- `best time / case`: the best observed per-case attempt time (lower is better)
+
+### Concrete example (drop-in impact)
+
+**Input (not strict JSON):**
+
+````text
+preface```json
+{"a":1}
+```suffix
+````
+
+Strict parsers:
+
+```python
+import json
+import orjson  # real orjson
+
+json.loads('preface```json\n{"a":1}\n```suffix')     # -> JSONDecodeError
+orjson.loads('preface```json\n{"a":1}\n```suffix')   # -> JSONDecodeError
+```
+
+With `agentjson` as an `orjson` drop-in (same call site):
+
+```python
+import os
+import orjson  # agentjson shim
+
+os.environ["JSONPROB_ORJSON_MODE"] = "auto"
+orjson.loads('preface```json\n{"a":1}\n```suffix')   # -> {"a": 1}
+```
+
+This is the core PR pitch: **don’t change code**, just switch the package and flip a mode when needed.
+
+### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64)
+
+| Library / mode | Success | Correct | Best time / case |
+|---|---:|---:|---:|
+| `json` (strict) | 0/10 | 0/10 | n/a |
+| `ujson` (strict) | 0/10 | 0/10 | n/a |
+| `orjson` (strict, real) | 0/10 | 0/10 | n/a |
+| `orjson` (auto, agentjson shim) | 10/10 | 10/10 | 45.9 µs |
+| `agentjson.parse(mode=auto)` | 10/10 | 10/10 | 39.8 µs |
+| `agentjson.parse(mode=probabilistic)` | 10/10 | 10/10 | 39.7 µs |
+
+## Suite 2 — Top‑K repair suite (secondary)
+
+### What it tests
+
+When input is ambiguous, a “repair” might have multiple plausible interpretations.
+
+`agentjson` can return multiple candidates (`top_k`) with confidence scores, so downstream code can:
+
+- validate with schema/business rules,
+- pick the best candidate,
+- or decide to re-ask an LLM only when confidence is low.
+
+### Concrete example (why Top‑K matters)
+
+**Input (ambiguous suffix):**
+
+```text
+{"a":1,"b":2,"c":3, nonsense nonsense
+```
+
+A plausible “best effort” interpretation is to **truncate** the garbage suffix and return:
+
+```python
+{"a": 1, "b": 2, "c": 3}
+```
+
+But another plausible interpretation is to treat `nonsense` as a missing key/value pair and produce something else.
+
+In the benchmark run, this case shows up exactly as:
+
+- **Top‑1 hit** misses (not the expected value),
+- but **Top‑K hit (K=5)** succeeds (the expected value is present in the candidate list).
+
+### Example results (2025-12-13, Python 3.12.0, macOS 14.1 arm64)
+
+| Metric | Value |
+|---|---:|
+| Top‑1 hit rate | 7/8 |
+| Top‑K hit rate (K=5) | 8/8 |
+| Avg candidates returned | 1.25 |
+| Avg best confidence | 0.57 |
+| Best time / case | 92.7 µs |
+
+## Suite 3 — Large root-array parsing (big data angle)
+
+### What it tests
+
+This suite generates a **single large root array** like:
+
+```json
+[{"id":0,"value":"test"},{"id":0,"value":"test"}, ...]
+```
+
+and measures how long `loads(...)` takes for sizes like 5MB and 20MB.
+
+For comparing `json/ujson/orjson`, use **Env A (real orjson)**. In Env B, `import orjson` is the shim.
+
+### Example results (Env A: real `orjson`, 2025-12-13)
+
+| Library | 5 MB | 20 MB |
+|---|---:|---:|
+| `json.loads(str)` | 52.3 ms | 209.6 ms |
+| `ujson.loads(str)` | 42.2 ms | 176.1 ms |
+| `orjson.loads(bytes)` (real) | 24.6 ms | 115.9 ms |
+
+`benchmarks/bench.py` also measures `agentjson.scale(serial|parallel)` (Env B). On 5–20MB inputs the parallel path is slower due to overhead; it’s intended for much larger payloads (GB‑scale root arrays).
+
+## Suite 3b — Nested `corpus` suite (targeted huge value)
+
+This is the “realistic Slack payload” shape:
+
+```json
+{ "corpus": [ ... huge ... ], "x": 0 }
+```
+
+Why this matters:
+
+- Parallelizing a *root array* is useful, but many real payloads wrap the big array under a key (`corpus`, `rows`, `events`, …).
+- The `scale_target_keys=["corpus"]` option exists to target that nested value.
+
+`benchmarks/bench.py` includes `nested_corpus_suite` which compares:
+
+- `agentjson.scale_pipeline(no_target)` — baseline (no targeting)
+- `agentjson.scale_pipeline(corpus, serial)` — targeting without forcing parallel
+- `agentjson.scale_pipeline(corpus, parallel)` — optional: targeting with forced parallel (`allow_parallel=True`, set `BENCH_NESTED_FORCE_PARALLEL=1`)
+
+Important nuance:
+
+- This suite uses **DOM** mode (`scale_output="dom"`) so `split_mode` shows whether nested targeting triggered (see `rust/src/scale.rs::try_nested_target_split`).
+- Wiring nested targeting into **tape** mode (`scale_output="tape"`) is the next-step work for true “huge nested value without DOM” workloads.
+
+## Suite 4 — Valid JSON microbench (context)
+
+This is included to set expectations only.
+
+Example results (Env A: real `orjson`, 2025-12-13):
+
+| Library | `loads` | `dumps` |
+|---|---:|---:|
+| `json` | 1.789 µs | 2.397 µs |
+| `ujson` | 1.170 µs | 0.940 µs |
+| `orjson` (real) | 0.398 µs | 0.215 µs |
+
+Example results (Env B: `agentjson` shim, strict, 2025-12-13):
+
+| Library | `loads` | `dumps` |
+|---|---:|---:|
+| `orjson` (agentjson shim, strict) | 8.764 µs | 2.923 µs |
+
+## What to emphasize in a PR
+
+- **LLM pipeline reliability** (Slack core): strict parsers fail 0/10; `agentjson` succeeds 10/10 on “json입니다~ …”, markdown fences, smart quotes, trailing commas, missing closers, etc.
+- **Drop-in adoption path**: keep `import orjson; orjson.loads(...)` everywhere; flip `JSONPROB_ORJSON_MODE=auto` only where you want “repair/scale on failure”.
+- **Top‑K + confidence** (Noah’s “probabilistic parser”): return multiple candidates so downstream code can validate with schema/business rules before retrying an LLM.
+- **Honest big-data story** (Kangwook’s skepticism): parallel split/merge has overhead; it’s meant for multi‑GB JSON (and file-backed mmap), not 5–20MB toy inputs.
+
+## CLI mmap benchmark (PR‑006)
+
+If you want to benchmark “batch parsing huge files without allocating a giant `Vec<u8>`”, `benchmarks/bench.py` can compare:
+
+- default (mmap)
+- `--no-mmap` (read into memory)
+
+Run:
+
+```bash
+cd rust && cargo build --release
+cd ..
+BENCH_CLI_MMAP_MB=512 python benchmarks/bench.py
+```
+
+Notes:
+
+- The suite runs `--scale-output tape --debug` to avoid printing the full parsed value for huge inputs.
+- In some restricted macOS environments, we can’t capture max RSS reliably; the suite reports wall time only.
+- For real PR validation, use much larger sizes (GB‑scale) and multiple runs; this suite is primarily a reproducible harness.