Skip to content

Commit af17efe

Browse files
docs: add accuracy per 1k tokens report (closes #72)
1 parent 9268fdf commit af17efe

8 files changed

Lines changed: 396 additions & 163 deletions

File tree

README.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,14 @@ For small payloads, JSON/CSV/YAML work fine. TOON's value emerges at scale: when
6262

6363
## Key Features
6464

65-
- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON
65+
- 💸 **Token-efficient:** typically 30–60% fewer tokens than JSON[^1]
6666
- 🤿 **LLM-friendly guardrails:** explicit lengths and fields enable validation
6767
- 🍱 **Minimal syntax:** removes redundant punctuation (braces, brackets, most quotes)
6868
- 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
6969
- 🧺 **Tabular arrays:** declare keys once, stream data as rows
7070

71+
[^1]: For flat tabular data, CSV is more compact. TOON adds minimal overhead to provide explicit structure and validation that improves LLM reliability.
72+
7173
## Benchmarks
7274

7375
> [!TIP]
@@ -80,12 +82,10 @@ Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-token
8082
The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
8183

8284
> [!NOTE]
83-
> CSV/TSV isn't shown in the token-efficiency chart because it doesn't encode nesting without flattening. For flat datasets, see CSV token counts in the [Retrieval Accuracy](#retrieval-accuracy) tables.
85+
> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy) section.
8486
8587
<!-- automd:file src="./benchmarks/results/token-efficiency.md" -->
8688

87-
### Token Efficiency
88-
8989
```
9090
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
9191
vs JSON (−42.3%) 15,145
@@ -251,9 +251,28 @@ metrics[5]{date,views,clicks,conversions,revenue,bounceRate}:
251251

252252
<!-- /automd -->
253253

254+
### Retrieval Accuracy
255+
254256
<!-- automd:file src="./benchmarks/results/retrieval-accuracy.md" -->
255257

256-
### Retrieval Accuracy
258+
Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
259+
260+
#### Efficiency Ranking (Accuracy per 1K Tokens)
261+
262+
Each format's overall performance, balancing accuracy against token cost:
263+
264+
```
265+
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.0 │ 70.1% acc │ 4,678 tokens
266+
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 14.3 │ 67.7% acc │ 4,745 tokens
267+
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
268+
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
269+
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
270+
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
271+
```
272+
273+
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
274+
275+
#### Per-Model Accuracy
257276

258277
Accuracy across **4 LLMs** on 154 data retrieval questions:
259278

@@ -915,7 +934,7 @@ By default, the decoder validates input strictly:
915934
- Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale.
916935
- **TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure.
917936
- **JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets.
918-
- **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds minimal overhead (`[N]` length markers, delimiter scoping, deterministic quoting) to improve LLM reliability while staying close to CSV's token efficiency.
937+
- **CSV is more compact for:** Flat, uniform tables without nesting. TOON adds structure (`[N]` length markers, delimiter scoping, deterministic quoting) that improves LLM reliability with minimal token overhead.
919938
- **Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
920939
- **TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
921940

benchmarks/results/retrieval-accuracy.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,21 @@
1-
### Retrieval Accuracy
1+
Benchmarks test LLM comprehension across different input formats using 154 data retrieval questions on 4 models.
2+
3+
#### Efficiency Ranking (Accuracy per 1K Tokens)
4+
5+
Each format's overall performance, balancing accuracy against token cost:
6+
7+
```
8+
toon ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 15.0 │ 70.1% acc │ 4,678 tokens
9+
csv ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░ 14.3 │ 67.7% acc │ 4,745 tokens
10+
json-compact ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░ 11.0 │ 65.3% acc │ 5,925 tokens
11+
yaml ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░ 9.4 │ 66.7% acc │ 7,091 tokens
12+
json-pretty ▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░ 7.5 │ 65.4% acc │ 8,713 tokens
13+
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
14+
```
15+
16+
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
17+
18+
#### Per-Model Accuracy
219

320
Accuracy across **4 LLMs** on 154 data retrieval questions:
421

benchmarks/results/token-efficiency.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
### Token Efficiency
2-
31
```
42
⭐ GitHub Repositories ██████████████░░░░░░░░░░░ 8,745 tokens
53
vs JSON (−42.3%) 15,145

benchmarks/scripts/accuracy-benchmark.ts

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,17 @@
11
import type { Question } from '../src/types'
2+
import * as fsp from 'node:fs/promises'
23
import * as path from 'node:path'
34
import process from 'node:process'
45
import * as prompts from '@clack/prompts'
56
import PQueue from 'p-queue'
6-
import { DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
7+
import { BENCHMARKS_DIR, DEFAULT_CONCURRENCY, DRY_RUN, DRY_RUN_LIMITS, MODEL_RPM_LIMITS, ROOT_DIR } from '../src/constants'
78
import { datasets } from '../src/datasets'
89
import { evaluateQuestion, models } from '../src/evaluate'
910
import { formatters } from '../src/formatters'
1011
import { generateQuestions } from '../src/questions'
11-
import { calculateFormatResults, calculateTokenCounts, saveResults } from '../src/report'
12+
import { calculateFormatResults, calculateTokenCounts, generateAccuracyReport } from '../src/report'
1213
import { getAllModelResults, hasModelResults, saveModelResults } from '../src/storage'
14+
import { ensureDir } from '../src/utils'
1315

1416
prompts.intro('Retrieval Accuracy Benchmark')
1517

@@ -142,13 +144,15 @@ if (allResults.length === 0) {
142144
process.exit(0)
143145
}
144146

145-
// Calculate token counts freshly (deterministic, no need to persist)
146147
const tokenCounts = calculateTokenCounts(formatters)
147-
148-
// Calculate format statistics and save report
149148
const formatResults = calculateFormatResults(allResults, tokenCounts)
150-
const resultsDir = await saveResults(allResults, formatResults, questions, tokenCounts)
149+
const accuracyReport = generateAccuracyReport(allResults, formatResults, tokenCounts)
150+
151+
const resultsDir = path.join(BENCHMARKS_DIR, 'results')
152+
await ensureDir(resultsDir)
153+
154+
const outputFilePath = path.join(resultsDir, 'retrieval-accuracy.md')
155+
await fsp.writeFile(outputFilePath, accuracyReport)
151156

152-
const reportPath = path.join(resultsDir, 'retrieval-accuracy.md')
153-
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, reportPath)}\``)
157+
prompts.log.info(`Report saved to: \`${path.relative(ROOT_DIR, outputFilePath)}\``)
154158
reportSpinner.stop('Report generation complete!')

benchmarks/scripts/token-efficiency-benchmark.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,4 +217,4 @@ await ensureDir(resultsDir)
217217
const outputFilePath = path.join(resultsDir, 'token-efficiency.md')
218218
await fsp.writeFile(outputFilePath, markdown, 'utf-8')
219219

220-
prompts.log.success(`Result saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)
220+
prompts.log.success(`Report saved to \`${path.relative(ROOT_DIR, outputFilePath)}\``)

0 commit comments

Comments
 (0)