Ran an expanded benchmark (34 models, 3,026 tests) — some findings that might be useful #285

mikegc-aws · 2026-03-04T23:56:57Z

mikegc-aws
Mar 4, 2026

Hey! I came across toon recently from a colleague so I thought I would burn a few tokens taking a look.

I ended up running my own benchmark, and figured the results might be interesting to the project. This isn't a critique (your own benchmark is solid) I just happened to test at a wider scale and a few patterns emerged that I hadn't seen documented elsewhere.

What I tested

34 foundation models across 10 providers (Anthropic, Meta, Mistral, Amazon, Google, Cohere, AI21, DeepSeek, NVIDIA, others), all via Amazon Bedrock
5 formats: JSON (pretty), JSON (compact), TOON, YAML, CSV
5 datasets: flat users (10 and 25 rows), event logs (25 rows), product catalog with nested specs, mixed API response
3,026 total tests (vs your 209 across 4 models)
Temperature 0, single-pass, no retries
Used the toon-format Python SDK (v0.9.0b1) for encoding

Token efficiency: your numbers hold up

I got basicly the same results:

Data shape	TOON vs JSON compact
Wide tables (many columns)	-45%
100-row flat arrays	-31%
Event logs	-24%
Mixed API responses	-22% to -30%

Accuracy

I ran it over 34 models, the picture shifts a little bit:

Format	Accuracy
YAML	83.0%
JSON (pretty)	81.0%
JSON compact	79.6%
TOON	77.4%
CSV	69.7% (flat datasets only)

Then I asked Claude to process the results... and it said:

---- AI Alert: Generated text from this point! :) ----

The gap is almost entirely explained by one category, which I'll get to below. But first, the good news:

Field retrieval and structural queries are essentially format-agnostic. All five formats score 95-99%. Models parse TOON's tabular syntax just as reliably as JSON for direct lookups. This is the most common real-world use case, and TOON is a clear win here — same accuracy, 30% fewer tokens.

Finding 1: The filtering gap is real, but it's a tabular format problem, not a TOON problem

Your benchmark shows TOON at 56.8% on filtering vs JSON at 53.1% — TOON slightly ahead. I found the opposite at 34 models: TOON at 66.9% vs JSON at 78.7%.

But here's the thing that might be more useful than the specific numbers: I also tested CSV, and it shows the exact same weakness. CSV scores 64.7% on filtering. Both tabular formats struggle when models need to scan rows and apply conditions, because the model has to map column positions back to the header rather than reading self-contained key-value objects.

Category	JSON	TOON	CSV
Field retrieval	98.0%	95.6%	94.9%
Structural	99.3%	98.5%	98.5%
Aggregation	47.6%	47.1%	33.8%
Filtering	78.7%	66.9%	64.7%

This reframes the question from "does TOON have a filtering problem?" to "do tabular formats in general have a filtering problem?" — which might inform whether the solution is format-level or prompt-level.

Finding 2: Model capability tier matters a lot

This only becomes visible with enough models. When I grouped by capability:

Top-tier models (Claude Sonnet 4.5, Llama 4 Maverick, etc.) show zero accuracy gap between TOON and JSON. They handle the tabular format perfectly.
Mid-tier models show a small gap (2-5pp), mostly on filtering.
Smaller models (Ministral 3B, some Llama 3.x variants) show gaps of 10-15pp, concentrated on filtering.

This creates a practical tension worth noting: the models where token savings matter most (cheaper, smaller) are the ones most likely to lose accuracy on complex queries. The models that handle TOON perfectly are already the expensive ones where 30% token savings is a smaller absolute dollar amount.

The sweet spot seems to be mid-tier models on lookup-heavy workloads — you capture the savings where they matter, on query types where accuracy holds.

Finding 3: TOON beats CSV on accuracy (and that's a good selling point)

Anyone evaluating TOON will naturally ask "why not just use CSV?" The token story is already good (TOON is only 2-6% more expensive than CSV while handling nested data). But now I can add: CSV actually performs worse than TOON on aggregation (33.8% vs 47.1%), and only marginally better on filtering (64.7% vs 66.9%). TOON offers comparable or better accuracy than CSV plus full JSON expressiveness. That's a strong argument.

Finding 4: Some model architectures favour TOON

Six models in my benchmark actually scored higher on TOON than JSON: Claude Sonnet 4, Gemma 3 27B, Mistral Large 3, Nova Pro, Llama 3.2 11B, and Nemotron Nano 9B. These aren't all from the same tier or provider — it suggests some architectures may naturally handle tabular formats better. Could be interesting to investigate why.

What I'd suggest (take it or leave it)

The "accuracy per 1K tokens" metric in your README is great — it captures the real tradeoff well. Might be worth noting that the filtering category is the main drag and that lookups/structural queries are effectively at parity.
CSV as a comparison point in the benchmark could strengthen the pitch. TOON beats CSV on accuracy while being nearly as compact — that directly addresses the "why not just CSV?" objection.
Prompt engineering guidance for filtering could help adoption. Since the weakness is about column-position mapping, something like "for filtering-heavy use cases, consider including a brief explanation of the tabular format in your system prompt" might narrow the gap.
The model-tier pattern might be worth documenting — even just a note that top-tier models handle TOON with no accuracy loss, while smaller models may show a gap on filtering tasks. Helps users set expectations.

hpvd · 2026-03-09T11:37:58Z

hpvd
Mar 9, 2026

many thanks for your work on providing this inside!

0 replies

Nyrok · 2026-03-11T04:42:38Z

Nyrok
Mar 11, 2026

The filtering gap analysis is the most useful finding here. Tabular formats create an implicit mapping problem: the model has to track column positions back to headers, while self-contained key-value formats carry the label with every value. That's a structural difference in information density, not a model capability issue.

The prompt engineering suggestion at the end makes sense. It's the same principle as explicit block separation in system prompts: when the model has to infer context from position rather than labeled keys, accuracy on complex tasks drops. The filtering failure rate is the tabular version of what happens when role, constraints, and output format are all mixed into a single prose blob.

I've been building flompt for exactly that problem on the prompt side, a visual prompt builder that decomposes system prompts into 12 explicit semantic blocks. Open-source: github.com/Nyrok/flompt

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ran an expanded benchmark (34 models, 3,026 tests) — some findings that might be useful #285

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ran an expanded benchmark (34 models, 3,026 tests) — some findings that might be useful #285

Uh oh!

mikegc-aws Mar 4, 2026

What I tested

Token efficiency: your numbers hold up

Accuracy

Finding 1: The filtering gap is real, but it's a tabular format problem, not a TOON problem

Finding 2: Model capability tier matters a lot

Finding 3: TOON beats CSV on accuracy (and that's a good selling point)

Finding 4: Some model architectures favour TOON

What I'd suggest (take it or leave it)

Replies: 2 comments

Uh oh!

Uh oh!

hpvd Mar 9, 2026

Uh oh!

Nyrok Mar 11, 2026

mikegc-aws
Mar 4, 2026

hpvd
Mar 9, 2026

Nyrok
Mar 11, 2026