toon-format
diff --git a/‎README.md‎
Lines changed: 25 additions & 6 deletions b/‎README.md‎
Lines changed: 25 additions & 6 deletions
diff --git a/‎benchmarks/results/retrieval-accuracy.md‎
Lines changed: 18 additions & 1 deletion b/‎benchmarks/results/retrieval-accuracy.md‎
Lines changed: 18 additions & 1 deletion
diff --git a/‎benchmarks/results/token-efficiency.md‎
Lines changed: 0 additions & 2 deletions b/‎benchmarks/results/token-efficiency.md‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎benchmarks/scripts/accuracy-benchmark.ts‎
Lines changed: 12 additions & 8 deletions b/‎benchmarks/scripts/accuracy-benchmark.ts‎
Lines changed: 12 additions & 8 deletions
diff --git a/‎benchmarks/scripts/token-efficiency-benchmark.ts‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/scripts/token-efficiency-benchmark.ts‎
Lines changed: 1 addition & 1 deletion
@@ -62,12 +62,14 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
 
 ## Key Features
 
-- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON
+- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
 - 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
 - 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
 - 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
 - 🧺 **Tabular arrays:** declare keys once, stream data as rows
 
+[^1]: For flat tabular data, CSV is more compact. TOON adds minimal overhead to provide explicit structure and validation that improves LLM reliability.
+
 ## Benchmarks
 
 > [!TIP]
@@ -80,12 +82,10 @@ Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-token
 The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
 
 > [!NOTE]
-> CSV/TSV isn't shown in the token-efficiency chart because it doesn't encode nesting without flattening. For flat datasets, see CSV token counts in the [Retrieval Accuracy](#retrieval-accuracy) tables.
+> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
 
 <!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
 
-### Token Efficiency
-
 ```
 ⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
                              vs JSON (−42.3%)           15,145
@@ -251,9 +251,28 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
 
 <!-- /automd -->
 
+### Retrieval Accuracy
+
 <!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
 
-### Retrieval Accuracy
+Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
+
+#### Efficiency Ranking (Accuracy per 1K Tokens)
+
+Each format's overall performance, balancing accuracy against token cost:
+
+```
+toon           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   15.0  │  70.1% acc  │  4,678 tokens
+csv            ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░   14.3  │  67.7% acc  │  4,745 tokens
+json-compact   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░   11.0  │  65.3% acc  │  5,925 tokens
+yaml           ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░    9.4  │  66.7% acc  │  7,091 tokens
+json-pretty    ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░    7.5  │  65.4% acc  │  8,713 tokens
+xml            ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    6.8  │  67.2% acc  │  9,944 tokens
+```
+
+TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
+
+#### Per-Model Accuracy
 
 Accuracy across **4 LLMs** on 154 data retrieval questions:
 
@@ -915,7 +934,7 @@ By default, the decoder validates input strictly:
 - Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale.
   - **TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure.
   - **JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets.
-  - **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds minimal overhead (`[N]` length markers, delimiter scoping, deterministic quoting) to improve LLM reliability while staying close to CSV's token efficiency.
+  - **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds structure (`[N]` length markers, delimiter scoping, deterministic quoting) that improves LLM reliability with minimal token overhead.
 - **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
 - **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
 
 
@@ -1,4 +1,21 @@
-### Retrieval Accuracy
+Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
+
+#### Efficiency Ranking (Accuracy per 1K Tokens)
+
+Each format's overall performance, balancing accuracy against token cost:
+
+```
+toon           ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓   15.0  │  70.1% acc  │  4,678 tokens
+csv            ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░   14.3  │  67.7% acc  │  4,745 tokens
+json-compact   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░   11.0  │  65.3% acc  │  5,925 tokens
+yaml           ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░    9.4  │  66.7% acc  │  7,091 tokens
+json-pretty    ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░    7.5  │  65.4% acc  │  8,713 tokens
+xml            ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    6.8  │  67.2% acc  │  9,944 tokens
+```
+
+TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
+
+#### Per-Model Accuracy
 
 Accuracy across **4 LLMs** on 154 data retrieval questions:
 
 
@@ -1,5 +1,3 @@
-### Token Efficiency
-
 ```
 ⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
                              vs JSON (−42.3%)           15,145
 
@@ -1,15 +1,17 @@
 import type { Question } from '../src/types'
+import * as fsp from 'node:fs/promises'
 import * as path from 'node:path'
 import process from 'node:process'
 import * as prompts from '@clack/prompts'
 import PQueue from 'p-queue'
-import { DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
+import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
 import { datasets } from '../src/datasets'
 import { evaluateQuestion, models } from '../src/evaluate'
 import { formatters } from '../src/formatters'
 import { generateQuestions } from '../src/questions'
-import { calculateFormatResults, calculateTokenCounts, saveResults } from '../src/report'
+import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
 import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
+import { ensureDir } from '../src/utils'
 
 prompts.intro('Retrieval Accuracy Benchmark')
 
@@ -142,13 +144,15 @@ if (allResults.length === 0) {
   process.exit(0)
 }
 
-// Calculate token counts freshly (deterministic, no need to persist)
 const tokenCounts = calculateTokenCounts(formatters)
-
-// Calculate format statistics and save report
 const formatResults = calculateFormatResults(allResults, tokenCounts)
-const resultsDir = await saveResults(allResults, formatResults, questions, tokenCounts)
+const accuracyReport = generateAccuracyReport(allResults, formatResults, tokenCounts)
+
+const resultsDir = path.join(BENCHMARKS_DIR, 'results')
+await ensureDir(resultsDir)
+
+const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
+await fsp.writeFile(outputFilePath, accuracyReport)
 
-const reportPath = path.join(resultsDir, 'retrieval-accuracy.md')
-prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, reportPath)}\``)
+prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
 reportSpinner.stop('Report generation complete!')
@@ -217,4 +217,4 @@ await ensureDir(resultsDir)
 const outputFilePath = path.join(resultsDir, 'token-efficiency.md')
 await fsp.writeFile(outputFilePath, markdown, 'utf-8')
 
-prompts.log.success(`Result saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)
+prompts.log.success(`Report saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-### Token Efficiency`
`2`		`-`
`3`	`1`	```
`4`	`2`	`⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens`
`5`	`3`	`vs JSON (−42.3%) 15,145`