Ran an expanded benchmark (34 models, 3,026 tests) — some findings that might be useful #285
Replies: 2 comments
-
|
many thanks for your work on providing this inside! |
Beta Was this translation helpful? Give feedback.
-
|
The filtering gap analysis is the most useful finding here. Tabular formats create an implicit mapping problem: the model has to track column positions back to headers, while self-contained key-value formats carry the label with every value. That's a structural difference in information density, not a model capability issue. The prompt engineering suggestion at the end makes sense. It's the same principle as explicit block separation in system prompts: when the model has to infer context from position rather than labeled keys, accuracy on complex tasks drops. The filtering failure rate is the tabular version of what happens when role, constraints, and output format are all mixed into a single prose blob. I've been building flompt for exactly that problem on the prompt side, a visual prompt builder that decomposes system prompts into 12 explicit semantic blocks. Open-source: github.com/Nyrok/flompt |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey! I came across toon recently from a colleague so I thought I would burn a few tokens taking a look.
I ended up running my own benchmark, and figured the results might be interesting to the project. This isn't a critique (your own benchmark is solid) I just happened to test at a wider scale and a few patterns emerged that I hadn't seen documented elsewhere.
What I tested
toon-formatPython SDK (v0.9.0b1) for encodingToken efficiency: your numbers hold up
I got basicly the same results:
Accuracy
I ran it over 34 models, the picture shifts a little bit:
Then I asked Claude to process the results... and it said:
---- AI Alert: Generated text from this point! :) ----
The gap is almost entirely explained by one category, which I'll get to below. But first, the good news:
Field retrieval and structural queries are essentially format-agnostic. All five formats score 95-99%. Models parse TOON's tabular syntax just as reliably as JSON for direct lookups. This is the most common real-world use case, and TOON is a clear win here — same accuracy, 30% fewer tokens.
Finding 1: The filtering gap is real, but it's a tabular format problem, not a TOON problem
Your benchmark shows TOON at 56.8% on filtering vs JSON at 53.1% — TOON slightly ahead. I found the opposite at 34 models: TOON at 66.9% vs JSON at 78.7%.
But here's the thing that might be more useful than the specific numbers: I also tested CSV, and it shows the exact same weakness. CSV scores 64.7% on filtering. Both tabular formats struggle when models need to scan rows and apply conditions, because the model has to map column positions back to the header rather than reading self-contained key-value objects.
This reframes the question from "does TOON have a filtering problem?" to "do tabular formats in general have a filtering problem?" — which might inform whether the solution is format-level or prompt-level.
Finding 2: Model capability tier matters a lot
This only becomes visible with enough models. When I grouped by capability:
This creates a practical tension worth noting: the models where token savings matter most (cheaper, smaller) are the ones most likely to lose accuracy on complex queries. The models that handle TOON perfectly are already the expensive ones where 30% token savings is a smaller absolute dollar amount.
The sweet spot seems to be mid-tier models on lookup-heavy workloads — you capture the savings where they matter, on query types where accuracy holds.
Finding 3: TOON beats CSV on accuracy (and that's a good selling point)
Anyone evaluating TOON will naturally ask "why not just use CSV?" The token story is already good (TOON is only 2-6% more expensive than CSV while handling nested data). But now I can add: CSV actually performs worse than TOON on aggregation (33.8% vs 47.1%), and only marginally better on filtering (64.7% vs 66.9%). TOON offers comparable or better accuracy than CSV plus full JSON expressiveness. That's a strong argument.
Finding 4: Some model architectures favour TOON
Six models in my benchmark actually scored higher on TOON than JSON: Claude Sonnet 4, Gemma 3 27B, Mistral Large 3, Nova Pro, Llama 3.2 11B, and Nemotron Nano 9B. These aren't all from the same tier or provider — it suggests some architectures may naturally handle tabular formats better. Could be interesting to investigate why.
What I'd suggest (take it or leave it)
The "accuracy per 1K tokens" metric in your README is great — it captures the real tradeoff well. Might be worth noting that the filtering category is the main drag and that lookups/structural queries are effectively at parity.
CSV as a comparison point in the benchmark could strengthen the pitch. TOON beats CSV on accuracy while being nearly as compact — that directly addresses the "why not just CSV?" objection.
Prompt engineering guidance for filtering could help adoption. Since the weakness is about column-position mapping, something like "for filtering-heavy use cases, consider including a brief explanation of the tabular format in your system prompt" might narrow the gap.
The model-tier pattern might be worth documenting — even just a note that top-tier models handle TOON with no accuracy loss, while smaller models may show a gap on filtering tasks. Helps users set expectations.
Beta Was this translation helpful? Give feedback.
All reactions