You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- 📐 **Indentation-based structure:** like YAML, uses whitespace instead of braces
69
69
- 🧺 **Tabular arrays:** declare keys once, stream data as rows
70
70
71
+
[^1]: For flat tabular data, CSV is more compact. TOON adds minimal overhead to provide explicit structure and validation that improves LLM reliability.
72
+
71
73
## Benchmarks
72
74
73
75
> [!TIP]
@@ -80,12 +82,10 @@ Token counts are measured using the GPT-5 `o200k_base` tokenizer via [`gpt-token
80
82
The benchmarks use datasets optimized for TOON's strengths (uniform tabular data). Real-world performance depends on your data structure.
81
83
82
84
> [!NOTE]
83
-
> CSV/TSV isn't shown in the token-efficiency chart because it doesn't encode nesting without flattening. For flat datasets, see CSV token counts in the [Retrieval Accuracy](#retrieval-accuracy)tables.
85
+
> CSV/TSV doesn't support nested structures, so it's not included in this comparison. For flat datasets where CSV applies, see token counts and accuracy metrics in the [Retrieval Accuracy](#retrieval-accuracy)section.
xml ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 6.8 │ 67.2% acc │ 9,944 tokens
271
+
```
272
+
273
+
TOON achieves **70.1%** accuracy (vs JSON's 65.4%) while using **46.3% fewer tokens**.
274
+
275
+
#### Per-Model Accuracy
257
276
258
277
Accuracy across **4 LLMs** on 154 data retrieval questions:
259
278
@@ -915,7 +934,7 @@ By default, the decoder validates input strictly:
915
934
- Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (due to mixed types, non-uniform objects, or nested structures), TOON switches to list format where JSON can be more efficient at scale.
916
935
-**TOON excels at:** Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure.
917
936
-**JSON is better for:** Non-uniform data, deeply nested structures, and objects with varying field sets.
918
-
-**CSV is more compact for:** Flat, uniform tables without nesting. TOON adds minimal overhead (`[N]` length markers, delimiter scoping, deterministic quoting) to improve LLM reliability while staying close to CSV's token efficiency.
937
+
-**CSV is more compact for:** Flat, uniform tables without nesting. TOON adds structure (`[N]` length markers, delimiter scoping, deterministic quoting) that improves LLM reliability with minimal token overhead.
919
938
-**Token counts vary by tokenizer and model.** Benchmarks use a GPT-style tokenizer (cl100k/o200k); actual savings will differ with other models (e.g., [SentencePiece](https://github.com/google/sentencepiece)).
920
939
-**TOON is designed for LLM input** where human readability and token efficiency matter. It's **not** a drop-in replacement for JSON in APIs or storage.
0 commit comments