An empirical comparison of prompting techniques across LLM models and use cases.
This project started as Ailo — a structured prompting framework using schema-based prompts (ACT/OBJ/TAGS) to optimize AI communication. The hypothesis was that structured prompts would consistently outperform natural language.
Then we tested it.
What we found was more nuanced: schema prompting helps in some cases, but simpler techniques often win. Few-shot examples beat complex reasoning chains. Premium models don't need elaborate prompts. Token overhead from fancy techniques rarely pays off.
So we pivoted. Instead of promoting one prompting style, we built a research framework to answer: "Which prompting technique should I use for my model and use case?"
This repository contains:
- Benchmark tooling for 10 prompting techniques (including Verbalized Sampling)
- Results across budget, mid-tier, and premium models
- Data-driven recommendations by use case
- Key Findings
- Results by Model
- Code Generation Results
- V3 TRUE Multi-Turn Results
- Verbalized Sampling (VS)
- Agentic Techniques
- Methodology
- Running Benchmarks
- What Happened to Ailo
Methodology Note: Results below use dual evaluation: deterministic checks + LLM-as-judge (Opus 4.5) scoring correctness, completeness, clarity, and relevance. Combined score = 60% deterministic + 40% LLM judge.
Style Combined Score Tokens Key Insight
──────────────────────────────────────────────────────────────────────
schema 94.3% 416 Best overall, clear constraints
cot 93.7% 532 Strong reasoning, higher cost
directional 92.5% 421 Good balance accuracy/tokens
few_shot 91.5% 302 MOST EFFICIENT: -12% tokens
gen_knowledge 91.9% 533 Moderate gains, extra tokens
zero_shot 93.1% 325 Baseline: surprisingly strong
meta 87.4% 658 Often over-complicates
tot 80.2% 4,806 TRUE MULTI-TURN: Not worth cost
self_consistency 81.8% 5,491 TRUE MULTI-TURN: 17x tokens, poor ROI
verbalized_samp. 62.5% 689 CREATIVE ONLY: -37% accuracy on empirical tasks
V3 Update: ToT and Self-Consistency now use TRUE multi-turn conversations (4 API calls each) instead of single-call simulations. This increased tokens 7-17x but reduced quality scores due to context drift.
BUDGET MODELS (Mistral 7B, Nova Micro)
──────────────────────────────────────────────────────────────────────────
Style Combined Correctness Clarity Tokens Verdict
──────────────────────────────────────────────────────────────────────────
cot 95.9% 0.98 0.94 453 WINNER: Best quality
schema 94.1% 0.98 0.99 416 High clarity
zero_shot 93.1% 0.96 0.99 325 Strong baseline
few_shot 85.3%* 0.90 0.93 302 *Can hurt small models
*Note: Mistral 7B few_shot dropped to 87.5% accuracy vs 100% for other styles
MID-TIER MODELS (Claude Haiku 4.5)
──────────────────────────────────────────────────────────────────────────
zero_shot 93.5% 1.00 1.00 300 WINNER: Perfect clarity
cot 92.9% 0.99 0.99 554 Great reasoning
schema 92.8% 0.99 0.99 480 Format control
self_consistency 88.0% 0.88 0.95 479 Higher cost, less gain
PREMIUM MODELS (Claude Sonnet 4.5, Mistral Large)
──────────────────────────────────────────────────────────────────────────
zero_shot 95.4% 0.98 0.99 291 WINNER: Excellent baseline
cot 94.5% 0.96 0.95 530 Marginal improvement
schema 93.3% 0.99 0.99 522 Format precision
tot 88.9% 0.94 0.93 745 Overkill for these models
| Use Case | Recommended Style | Why | Token Impact |
|---|---|---|---|
| Code Generation | few_shot |
Examples demonstrate structure/style | -8% tokens |
| Documentation | schema |
Highest clarity (0.99), format control | +28% tokens |
| Math/Logic | cot |
Step-by-step reasoning, 95.9% combined | +63% tokens |
| Agents/Agentic | zero_shot + tools |
Let tools handle complexity | Baseline |
| Data Analysis | cot or gen_knowledge |
Reasoning + domain context | +52-63% |
| Creative Writing | directional |
Hints guide without constraining | +35% tokens |
| Creative Diversity | verbalized_sampling |
When you WANT multiple varied outputs | +132% tokens |
| API/Integration | schema |
Structured output, predictable format | +28% tokens |
| Quick Prototyping | zero_shot |
Fast iteration, 93.1% baseline | Baseline |
Code Generation
Best: few_shot (show 1-2 examples of desired code style)
Why: Models learn naming conventions and structure from examples
Avoid: self_consistency (adds tokens, no accuracy gain)
Documentation Writing
Best: schema (ACT=Write OBJ=Documentation TAGS=[Format:Markdown])
Why: Explicit format constraints ensure consistent output
Alt: directional (provide outline hints)
Math & Logic Problems
Best: cot (Chain-of-Thought)
Why: Step-by-step reasoning catches errors, 95.9% combined
Alt: gen_knowledge (recall formulas first, then solve)
Agentic Workflows
Best: zero_shot + tool descriptions
Why: Let tools handle complexity; prompts should be simple triggers
Note: Complex prompting often interferes with tool selection
Data Analysis
Best: cot (for reasoning through data)
Alt: gen_knowledge (recall statistical concepts first)
Why: Explicit reasoning prevents calculation errors
USE CASE RECOMMENDED WHY
─────────────────────────────────────────────────────────────────────────
Code generation few_shot Examples > explanations
Documentation schema Format control, high clarity
Math/Logic cot 95.9% combined, best reasoning
Agents/Agentic zero_shot Keep prompts simple, let tools work
Data analysis cot / gen_knowledge Reasoning prevents errors
Creative directional Hints without constraints
Creative diversity verbalized_sampling When you need 5 varied options
Budget models cot > schema Step-by-step helps smaller models
Premium models zero_shot Already excellent, save tokens
Token-sensitive few_shot 302 avg tokens (lowest)
Avoid (accuracy) verbalized_sampling 62.5% accuracy, worst for correctness
Avoid (cost) self_consistency 78.2%, 17x tokens, worst ROI
- Zero-shot is better than expected — 93.1% combined score, works great on modern models
- CoT wins for reasoning — Especially on budget models (95.9% combined)
- TRUE multi-turn ToT/SC is NOT worth it — V3 tested real 4-turn conversations: tokens increased 7-17x (4,806-5,491 avg) but quality DROPPED due to context drift and repetition
- Schema provides clarity — Highest clarity scores (0.99) across models
- Few-shot can backfire — On small models like Mistral 7B, accuracy dropped 12.5%
- Multi-turn hurts small models most — Mistral 7B accuracy dropped to 75% with TRUE multi-turn ToT/SC (vs 100% single-call)
- Verbalized Sampling is for creativity, not correctness — VS dropped accuracy by 37% on empirical tasks; only use when you explicitly want diverse outputs
| Tier | Models | Cost (per 1K tokens) |
|---|---|---|
| Budget | Nova Micro, Mistral 7B, GPT-4o-mini | $0.00004 - $0.00015 |
| Mid | Claude Haiku 4.5, Nova Lite | $0.0006 - $0.001 |
| Premium | Claude Sonnet 4.5, Mistral Large, GPT-4o | $0.003 - $0.015 |
All models tested on 8 prompts × 9 styles = 72 evaluations per model. Scores combine deterministic checks (60%) + LLM judge scores (40%).
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
cot 93.9% 100.0% 0.99 0.97 0.98 630 BEST
schema 94.3% 100.0% 0.99 0.99 0.99 522
zero_shot 91.7% 100.0% 0.99 1.00 1.00 282 Baseline
few_shot 89.5% 100.0% 0.99 0.96 1.00 276
directional 91.4% 100.0% 0.99 0.96 1.00 349
meta 89.9% 100.0% 0.99 0.94 0.99 688
tot 88.4% 100.0% 0.99 0.95 0.94 673
self_consistency 70.0% 100.0% 0.99 0.94 1.00 411 WORST
Key insight: 100% accuracy across all styles. CoT provides best combined quality but zero-shot is already excellent.
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
gen_knowledge 97.0% 100.0% 0.99 0.95 0.96 450 BEST
zero_shot 94.0% 100.0% 1.00 0.99 1.00 296
cot 94.0% 100.0% 0.97 0.94 0.97 471
schema 94.0% 100.0% 0.99 0.97 0.99 400
tot 93.0% 100.0% 0.96 0.88 0.90 701
few_shot 92.0% 100.0% 0.99 0.99 1.00 271 EFFICIENT
self_consistency 92.0% 100.0% 0.95 0.89 0.92 702
directional 90.0% 100.0% 0.96 0.89 0.97 400
meta 87.0% 100.0% 0.85 0.74 0.85 625 WORST
Key insight: 100% accuracy across all styles. gen_knowledge achieves highest combined score. few_shot saves tokens.
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
zero_shot 93.5% 100.0% 1.00 1.00 1.00 300 WINNER
cot 92.9% 100.0% 0.99 1.00 0.99 554
schema 92.8% 100.0% 0.99 0.99 0.99 480
few_shot 91.6% 100.0% 0.99 1.00 1.00 291
directional 90.7% 100.0% 1.00 0.99 0.98 353
gen_knowledge 91.3% 100.0% 0.99 1.00 0.99 471
meta 84.9% 100.0% 0.98 0.99 0.96 622
tot 91.0% 100.0% 1.00 0.98 0.95 567
self_consistency 88.0% 100.0% 0.88 0.95 0.95 479
Key insight: Perfect accuracy across all styles. Zero-shot achieves perfect clarity and completeness scores.
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
few_shot 92.7% 100.0% 1.00 0.96 0.99 241 BEST ROI
zero_shot 91.0% 100.0% 1.00 0.96 0.99 495
cot 90.6% 100.0% 0.91 0.83 0.91 637
schema 91.0% 100.0% 0.99 0.97 0.98 439
directional 88.5% 87.5% 0.99 0.89 0.95 445
meta 81.9% 87.5% 0.96 0.83 0.94 637
tot 88.9% 100.0% 0.96 0.91 0.96 745
self_consistency 58.0% 100.0% 0.91 0.73 0.94 634 WORST
Key insight: Few-shot wins with fewest tokens. Self-consistency has worst combined score.
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
cot 95.9% 100.0% 0.98 0.95 0.95 453 WINNER
zero_shot 94.6% 100.0% 0.96 0.97 0.99 320
schema 92.0% 100.0% 0.98 0.97 0.98 438
directional 92.2% 100.0% 0.96 0.94 0.94 423
few_shot 80.3% 87.5% 0.82 0.91 0.85 320 CAUTION
gen_knowledge 87.6% 87.5% 0.91 0.87 0.94 494
meta 83.7% 100.0% 0.93 0.79 0.96 563
tot 89.6% 100.0% 0.92 0.88 0.91 702
self_consistency 68.0% 100.0% 0.84 0.87 0.95 894
Key insight: CoT provides best quality. Few-shot HURTS accuracy on this smaller model (87.5% vs 100%).
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
zero_shot 95.4% 100.0% 0.97 0.98 0.99 300 WINNER
cot 95.0% 100.0% 0.93 0.91 0.93 429
few_shot 93.0% 100.0% 0.99 0.98 1.00 290
schema 92.6% 100.0% 0.99 0.96 0.98 398
directional 93.9% 100.0% 0.98 0.96 0.99 394
gen_knowledge 94.7% 100.0% 0.96 0.96 0.96 565
meta 92.7% 100.0% 0.99 0.91 0.95 543
tot 89.4% 100.0% 0.96 0.91 0.93 691
self_consistency 58.0% 100.0% 0.91 0.87 0.95 635 WORST
Key insight: Zero-shot is best. Self-consistency provides worst combined score despite 100% accuracy.
Style Combined Accuracy Correctness Completeness Clarity Tokens
───────────────────────────────────────────────────────────────────────────────────
schema 94.1% 100.0% 0.98 0.98 0.99 440 WINNER
zero_shot 92.2% 100.0% 0.98 0.96 0.99 325
cot 91.8% 100.0% 0.96 0.92 0.95 543
few_shot 91.5% 100.0% 0.98 0.95 0.99 317
directional 92.3% 100.0% 0.98 0.96 0.98 410
gen_knowledge 90.9% 100.0% 0.96 0.93 0.95 566
meta 88.8% 100.0% 0.96 0.91 0.95 686
tot 87.9% 100.0% 0.96 0.90 0.91 746
self_consistency 90.9% 100.0% 0.93 0.94 0.94 706
Key insight: Schema provides highest combined score. All styles achieve 100% accuracy.
Tested on 4 JavaScript algorithms (factorial, fibonacci, GCD, primality) measuring similarity to reference implementations from javascript-algorithms:
Style Similarity Correctness Tokens
─────────────────────────────────────────────────────────────────────────────────────
few_shot ████████████████████ 53.9% 89.1% 315 WINNER
zero_shot ████████████████░░░░ 41.1% 84.5% 227
schema ███████████████░░░░░ 37.7% 90.3% 270
tot ██████████████░░░░░░ 35.3% 75.8% 1211
cot ███████████░░░░░░░░░ 29.0% 72.8% 646
directional ██████████░░░░░░░░░░ 24.7% 76.4% 514
self_consistency █████████░░░░░░░░░░░ 21.7% 71.4% 943
meta ████████░░░░░░░░░░░░ 18.9% 64.5% 735
gen_knowledge ███████░░░░░░░░░░░░░ 16.7% 56.1% 604
Model Best Style Similarity Correctness Tokens
─────────────────────────────────────────────────────────────────────
Gemini 2.0 Flash Few-shot 57.9% 96.4% 263
Claude Haiku Few-shot 60.2% 85.1% 262
Mistral 7B Few-shot 56.8% 92.8% 332
Nova Micro Zero-shot 51.6% 89.3% 157
Why few-shot wins for code:
- Examples demonstrate expected style and structure
- Models learn naming conventions from examples
- Avoids verbose explanations that dilute output
- Lower token overhead than reasoning techniques
V3 updated Tree of Thoughts (ToT) and Self-Consistency to use real multi-turn conversations instead of single-call simulations. Each technique now makes 4 separate API calls with conversation history.
ToT (Tree of Thoughts) - 4 Turns:
Turn 1: "Solve using Path A (direct approach)"
Turn 2: "Now solve using Path B (alternative method)"
Turn 3: "Now solve using Path C (verification/estimation)"
Turn 4: "Evaluate all paths and give FINAL ANSWER"
Self-Consistency - 4 Turns:
Turn 1: "Solve using Method 1 (standard calculation)"
Turn 2: "Solve using Method 2 (alternative approach)"
Turn 3: "Solve using Method 3 (verification/cross-check)"
Turn 4: "Compare all methods and reconcile to FINAL ANSWER"
Model ToT Tokens ToT Score SC Tokens SC Score Accuracy
─────────────────────────────────────────────────────────────────────────────
Nova Micro 5,376 82.7% 5,576 81.6% 100%
Claude Haiku 4.5 4,756 86.2% 6,202 87.0% 100%
GPT-4o-mini 4,705 83.4% 5,152 84.6% 100%
Gemini 2.0 Flash 4,725 80.6% 5,810 82.3% 100%
Mistral Large 3,855 72.4% 4,407 80.9% 100%
Claude Sonnet 4.5 5,533 88.7% 6,963 88.9% 100%
Mistral 7B 3,693 67.2% 4,329 69.4% 75% ⚠️
- Context Drift: Models lose focus across turns, repeating earlier reasoning instead of building on it
- Token Explosion: 7-17x more tokens than single-call approaches for similar accuracy
- Clarity Degradation: Clarity scores dropped to 0.50-0.87 (vs 0.98+ for single-call styles)
- Small Model Failure: Mistral 7B accuracy dropped from 100% (single-call) to 75% (multi-turn)
Avoid TRUE multi-turn ToT/Self-Consistency for most use cases. Single-call CoT (93.7% combined, 532 tokens) outperforms multi-turn ToT (80.2% combined, 4,806 tokens) at 1/9th the cost.
Use multi-turn only when:
- You need explicit exploration of multiple solution paths for auditing
- Token cost is not a concern
- Using premium models (Claude Sonnet 4.5 maintained 88.9% quality)
After reading the Stanford paper "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity", I implemented VS as a 10th prompting style to test its claims. The paper argues that VS can recover diverse outputs lost to RLHF alignment by asking models to generate distributions of responses with probabilities.
I knew going in that VS was designed for creative diversity, not correctness. But I wanted to see how this prompting technique would fare alongside others in empirical, correctness-focused scenarios.
Generate 5 different solutions to this problem, each with a probability score (0.0-1.0).
Format:
Response 1 (Prob: X.XX): [solution]
Response 2 (Prob: X.XX): [solution]
...
| Model | Style | Accuracy | Tokens | Diversity | ROI |
|---|---|---|---|---|---|
| Claude Haiku | zero_shot | 87.5% | 314 | — | baseline |
| tot | 87.5% | 5,072 | — | +0.00% per 100 tokens | |
| verbalized_sampling | 62.5% | 971 | 0.50 | -3.81% per 100 tokens | |
| GPT-4o-mini | zero_shot | 100.0% | 297 | — | baseline |
| tot | 100.0% | 4,710 | — | +0.00% per 100 tokens | |
| verbalized_sampling | 62.5% | 689 | 0.27 | -9.57% per 100 tokens | |
| Gemini 2.0 Flash | zero_shot | 87.5% | 434 | — | baseline |
| tot | 87.5% | 5,166 | — | +0.00% per 100 tokens | |
| verbalized_sampling | 62.5% | 848 | 0.30 | -6.04% per 100 tokens |
Diversity measured using OpenAI
text-embedding-3-smallcosine similarity (paper methodology). Score = 1 - mean pairwise similarity. Higher = more diverse.
Style Accuracy vs Zero Tokens ROI
──────────────────────────────────────────────────────────────────────────
zero_shot 100.0% baseline 297 baseline
cot 100.0% +0.0% 477 +0.00%
meta 100.0% +0.0% 650 +0.00%
gen_knowledge 100.0% +0.0% 408 +0.00%
directional 100.0% +0.0% 394 +0.00%
tot 100.0% +0.0% 4,710 +0.00%
few_shot 87.5% -12.5% 278 EFFICIENT
schema 87.5% -12.5% 440 -8.74%
self_consistency 87.5% -12.5% 5,083 -0.26%
verbalized_sampling 62.5% -37.5% 689 -9.57% WORST
Embedding Diversity: 0.273 (using OpenAI text-embedding-3-small)
Combined Diversity: 0.520 (lexical + semantic + n-gram)
Parse Success Rate: 100% (VS format reliably parsed)
Any Correct Rate: 75% (at least 1 of 5 answers correct)
Top-1 Accuracy: 62.5% (highest-probability answer correct)
- VS is NOT a cheaper ToT — ToT maintains 100% accuracy at 4,710 tokens; VS drops to 62.5% at 689 tokens
- VS hurts accuracy by 25-37% — Across all models tested, VS consistently underperformed
- VS's probability ranking is unreliable — 75% of tests had a correct answer among the 5, but VS only selected it 62.5% of the time
- Diversity is moderate, not revolutionary — 0.27-0.50 embedding diversity, not the 1.6-2.1x improvement claimed in the paper
- The paper tested creative tasks — Poems, jokes, stories have no "correct" answer; VS excels where diversity IS the goal
The paper's methodology evaluated creative writing where:
- Multiple outputs are valid (any joke about coffee is acceptable)
- Diversity IS the success metric
- Human judges rated "interestingness" not correctness
Our benchmark tests correctness-focused tasks where:
- There's ONE right answer ($11.20, not $14.00)
- VS's "distribution thinking" dilutes focus on the correct solution
- Asking for 5 alternatives spreads cognitive effort across wrong paths
VS is a real technique, but it's a creative diversity tool—NOT a general prompting improvement.
I will continue to include VS in the benchmark for completeness, but I won't use it for anything except creative tasks where I explicitly want diverse outputs (story brainstorming, UI variation generation, etc.).
For correctness-focused work: stick with zero_shot, cot, or schema.
Techniques requiring tool execution or multi-turn orchestration:
| Technique | Description | Best For |
|---|---|---|
| ReAct | Reason + Act loop with tools | Tool-heavy tasks |
| PAL | Generate & execute Python code | Math (saves 59% tokens) |
| Chaining | Multi-step orchestration | Complex workflows |
| Reflexion | Generate, critique, retry | Error recovery |
Technique Pass Rate Tokens LLM Calls Tool Calls
──────────────────────────────────────────────────────────────────────────────────────
zero_shot ████████████████████ 100% 284 1 0
PAL ████████████████████ 100% 116 1 1 BEST!
chaining ████████████████████ 100% 2014 3 0
reflexion ████████████████████ 100% 1209 2 0
react ████████████████░░░░ 80% 476 1 0
PAL saves 59% tokens by generating concise code instead of verbose reasoning.
Technique Pass Rate Tokens Overhead
──────────────────────────────────────────────────────────────────────
zero_shot ████████████████░░░░ 80% 263 baseline
react ████████████████░░░░ 80% 592 +125%
chaining ████████████████░░░░ 80% 1413 +438%
PAL ████████░░░░░░░░░░░░ 40% 146 -44% Code quality issues
Warning: For budget models, PAL fails more often because generated code has errors.
| Technique | Description | Token Overhead | API Calls |
|---|---|---|---|
| zero_shot | Plain natural language | Baseline | 1 |
| few_shot | 1-2 examples before task | -25% to +30% | 1 |
| cot | Step-by-step reasoning | +15% to +68% | 1 |
| schema | Structured ACT/OBJ/TAGS | +4% to +86% | 1 |
| meta | LLM designs approach first | +46% to +123% | 1 |
| gen_knowledge | Generate facts, then answer | +5% to +75% | 1 |
| directional | Hints/keywords to guide | -10% to +27% | 1 |
| tot | TRUE multi-turn: 3 paths + synthesis | +1,100% to +1,700% | 4 |
| self_consistency | TRUE multi-turn: 3 methods + reconcile | +1,200% to +2,000% | 4 |
| verbalized_sampling | Generate 5 responses with probabilities | +95% to +170% | 1 |
Every technique was tested with the same task presented in different formats:
| Technique | Prompt Structure |
|---|---|
| Zero-shot | "Calculate: 7 apples at $2 each with 20% discount" |
| Few-shot | "Example: 3 items at $5 = $15. Now solve: 7 apples..." |
| CoT | "Solve step by step: 1) Calculate total 2) Apply discount..." |
| Schema | ACT=Calculate OBJ=Price TAGS=[ShowWork] |
| Meta | "First, decide how to solve this. Then execute." |
| Gen-Knowledge | "Recall: Discount formula is... Now apply to: 7 apples..." |
| Directional | "Calculate price. HINTS: Total=$14, discount=20%" |
| ToT | 4 turns: Path A → Path B → Path C → Synthesis (TRUE multi-turn) |
| Self-Consistency | 4 turns: Method 1 → Method 2 → Method 3 → Reconcile (TRUE multi-turn) |
| Verbalized Sampling | "Generate 5 approaches with probabilities: Response 1 (Prob: 0.35)..." |
The V2 benchmark uses dual evaluation:
1. Deterministic Evaluation (60% weight)
EVAL TYPES
├── numeric - Extract numbers, compare with tolerance (±0.01)
├── contains - Check if expected answer appears in response
├── keywords - Count required keywords found (threshold 0.5)
├── exact - Normalized string match
├── fuzzy - Substring match + F1 token overlap
└── code_exec - Lint + execute against test cases
2. LLM-as-Judge (40% weight) - Opus 4.5
| Dimension | Question Asked | What It Measures |
|---|---|---|
| Correctness | Is the answer factually correct? | Did the model arrive at the right answer? For math, is the number right? For logic, is the conclusion valid? |
| Completeness | Does it fully address the task? | Are all parts of the question answered? Nothing missing or skipped? |
| Clarity | Is it well-organized and clear? | Easy to follow? Good structure? No rambling or disjointed reasoning? |
| Relevance | Does it stay on topic? | No tangents, unnecessary content, or off-topic information? |
Each dimension scored 0.0 to 1.0. The combined LLM judge score averages all four.
Why multi-turn hurt clarity: Multi-turn ToT/SC responses scored 0.50-0.70 on clarity because the final synthesis often referenced "Path A" without restating conclusions, repeated earlier reasoning, or produced disjointed summaries assuming context from prior turns. Single-call techniques kept everything in one coherent response, scoring 0.95-1.0.
Unified Benchmark (8 test cases × 10 styles = 80 per model):
- Math: Discount calculation, percentage calculation
- Logic: Pet logic puzzle
- Writing: Executive summary
- Analysis: Framework comparison, pros/cons
- Technical: Code explanation
- Creative: Story ideas
Code Benchmark (6 algorithms):
- Factorial, Fibonacci, GCD, Primality, Reverse String, Two Sum
- Execution-based evaluation (syntax + test cases)
- Sample Size: 8 test cases per comprehensive run
- Multi-Turn Context Drift: TRUE multi-turn ToT/SC suffer from context drift, which may partially explain quality degradation
- Model Versions: Results may vary with model updates
- LLM Judge Bias: Opus 4.5 may favor certain styles (mitigated by 60/40 weighting)
cd research
# Install dependencies
pip install boto3 openai google-generativeai
# Create .env file with API keys
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
OPENAI_KEY=your_key
GEMINI_KEY=your_key# Test a model connection
python multi_provider_client.py claude-haiku
# Run V2 benchmark WITHOUT LLM judge (faster, cheaper)
python unified_benchmark_v2.py --model gemini-2.0-flash
# Run V2 benchmark WITH LLM judge (recommended for quality)
python unified_benchmark_v2.py --model claude-sonnet --llm-judge
# Run specific styles only
python unified_benchmark_v2.py --model nova-micro --styles zero_shot few_shot cot
# Run code generation benchmark with execution
python code_benchmark_v2.py --model gpt-4o-mini
# List all available models
python multi_provider_client.py list| Provider | Models |
|---|---|
| AWS Bedrock | Claude (Haiku, Sonnet, Opus), Nova (Micro, Lite), Mistral (7B, Large), Llama |
| OpenAI | GPT-4o, GPT-4o-mini, GPT-3.5-turbo, o1-mini |
| Gemini 2.0 Flash, 1.5 Flash, 1.5 Pro |
ailo/
├── README.md # This file
├── research/
│ ├── evaluation.py # Unified evaluation module (V2)
│ ├── multi_provider_client.py # Unified client (Bedrock, OpenAI, Gemini)
│ ├── unified_benchmark_v2.py # Main benchmark runner with LLM judge
│ ├── code_benchmark_v2.py # Code generation with execution
│ ├── test_prompts_v4.py # Multi-style prompt definitions
│ └── results/ # Raw benchmark data (JSON)
│ ├── unified_v2_*.json # V2 benchmark results
│ └── code_v2_*.json # Code benchmark results
This project started as Ailo — a structured prompting framework. The original hypothesis was that schema-based prompts would consistently outperform natural language.
CONTEXT = [Background / why you need this]
PERSONA = [Role for AI to adopt: mentor, critic, analyst...]
MODE = [Task type: Generate, Evaluate, Compare, Plan...]
ACT = [What you want done]
OBJ = [The subject/object to work on]
TAGS = [
Format: [list, table, code, JSON...]
Length: [short, 200 words, 5 bullets...]
Style: [formal, casual, technical...]
Audience: [beginner, expert, executive...]
Constraints: [no jargon, max 3 steps...]
]
OUTPUT = [Delivery format: text, code, file...]
Example:
PERSONA = Business analyst briefing an executive
MODE = Summarize
ACT = Summarize
OBJ = Climate policy report
TAGS = [Format:Bullets, Length:5, Audience:Executive, Constraints:No jargon]
OUTPUT = Text
Schema prompting (tested as "schema" style in our benchmarks) performs well for:
- Best combined quality — 94.3% combined score (highest average across all models)
- Clarity optimization — Consistently achieves 0.99 clarity scores
- Budget models like Nova Micro — 94.1% combined, outperforming zero-shot
But the V2 results show key nuances:
- Zero-shot is better than expected — 93.1% combined, works great on modern models
- CoT wins for reasoning — 95.9% combined on budget models
- Self-consistency is wasteful — 78.2% combined, worst ROI across all styles
- Few-shot can hurt small models — Mistral 7B accuracy dropped 12.5% with few-shot
The research shows prompting style choice depends on model tier and task type.
MIT License — Use this research freely.
V3.1: Empirical prompting research with TRUE multi-turn, Verbalized Sampling, and LLM-as-Judge evaluation, November 2025