An evaluation framework that quantifies LLM output variance by treating temperature as a fuzzing parameter across multiple task categories.
How consistent are LLM outputs across temperature settings, and which task types are most sensitive to temperature variation?
Temperature is a critical hyperparameter for LLM inference, but its impact on output consistency is poorly understood. This matters for:
- Production Reliability: Can we trust outputs in high-stakes applications?
- Cost Optimization: Do we need multiple samples or is one enough?
- Task-Specific Tuning: What temperature for what task type?
This project moves LLM evaluation from "vibes-based" (it looks okay) to "evidence-based" (it is statistically consistent).
- Factual queries: 95% consistency even at T=1.0
- Logic/Reasoning: 78% at T=0.3, drops to 52% at T=1.0
- Creative tasks: 45% consistency at T=0.7
Common advice: "use 0.7 for balanced outputs"
Our finding: This is only optimal for creative tasks. For factual/code generation, use T=0.0-0.3.
- Semantic Drift (35%): Outputs convey different meanings
- Length Variance (28%): Same content, different verbosity
- Formatting Variance (22%): Same info, different structure
- Complete Divergence (15%): Entirely different responses
For consistency-critical tasks (code, extraction):
- T > 0.5 introduces 40%+ variance
- Single sampling at T=0.0 more reliable than averaging 3 samples at T=0.7
- Cost savings: ~60% fewer API calls with proper temperature tuning
Multi-Task Benchmark (25 prompts)
↓
Fuzzing Engine (450+ experiments)
↓
Consistency Scorer (3 scoring methods)
↓
Visualizations + Failure Analysis
- Exact Match: For factual/logic tasks (deterministic answers)
- Semantic Similarity: For creative/summarization (BERT embeddings)
- Regex Matching: For code generation (pattern validation)
Python 3.8+
Groq API key (free tier works - get at console.groq.com)# Clone repository
git clone https://github.com/aarushisingh04/temperature-fuzzer.git
cd temperature-fuzzer
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up API key
echo "GROQ_API_KEY=your_key_here" > .env# Run all 450+ experiments + analysis (~10-15 minutes, ~$1.50 on Groq)
python main.py
# Or skip experiments and use existing results
python main.py --skip-experiments# Just run experiments
python fuzzing_engine.py
# Just score existing results
python consistency_scorer.py
# Just generate visualizations
python visualizer.py
# Just analyze failures
python failure_analyzer.pytemperature-fuzzer/
├── main.py # Main pipeline orchestrator
├── config.py # Configuration settings
├── fuzzing_engine.py # Experiment runner
├── consistency_scorer.py # Scoring algorithms
├── visualizer.py # Visualization generator
├── failure_analyzer.py # Failure mode categorization
├── requirements.txt # Dependencies
├── .env # API keys (not committed)
├── .gitignore # Git ignore file
│
├── data/
│ └── benchmark_prompts.py # 25 gold-standard prompts
│
├── results/
│ ├── raw_results.json # Raw experimental outputs
│ ├── consistency_scores.csv # Calculated scores
│ └── failure_analysis.csv # Failure categorization
│
└── visualizations/
├── fragility_heatmap.png # Main visualization
├── consistency_curves.png
├── variance_distribution.png
└── temperature_sensitivity.png
The signature visualization showing which tasks "break" first as temperature increases.
Degradation trajectories by task category with standard deviation bands.
Box plots showing consistency score distributions.
Bar chart ranking tasks by temperature fragility.
Edit config.py to customize:
# Model selection
MODEL_NAME = "llama-3.3-70b-versatile"
# Experiment parameters
TEMPERATURES = [0.0, 0.3, 0.7, 1.0, 1.2, 1.5]
RUNS_PER_CONDITION = 10
# Scoring thresholds
SIMILARITY_THRESHOLD = 0.85
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"Add new prompts in data/benchmark_prompts.py:
BENCHMARK_PROMPTS = {
"your_category": [
{
"id": "custom_01",
"prompt": "Your test prompt here",
"expected_answer": "expected output",
"explanation": "What this tests"
}
]
}- 5 task categories: Logic, Factual, Coding, Creative, Summarization
- 6 temperature settings: 0.0, 0.3, 0.7, 1.0, 1.2, 1.5
- 10 runs per condition: Ensures statistical significance
- Total: 25 prompts × 6 temps × 10 runs = 1,500 API calls
For each (prompt, temperature) pair:
- Collect 10 independent outputs
- Calculate pairwise similarities (semantic or exact match)
- Aggregate into single consistency score ∈ [0, 1]
- Score of 1.0 = perfect consistency, 0.0 = complete divergence
For text-based tasks, we use BERT embeddings to calculate semantic similarity:
- Encode all outputs into embedding space
- Compute cosine similarity between all pairs
- Average similarity = consistency score
This captures meaning preservation even when exact wording differs.
- 0.90 - 1.00: Highly consistent (safe for production)
- 0.75 - 0.89: Moderately consistent (acceptable for most uses)
- 0.60 - 0.74: Variable (consider lowering temperature)
- < 0.60: Unreliable (high-entropy sampling)
| Task Type | Recommended T | Reasoning |
|---|---|---|
| Factual | 0.0 - 0.3 | Deterministic answers needed |
| Logic | 0.0 - 0.3 | Reasoning must be consistent |
| Code | 0.0 - 0.5 | Syntax errors costly |
| Summarization | 0.3 - 0.7 | Balance consistency + conciseness |
| Creative | 0.7 - 1.2 | Diversity desired |
This framework can be extended to study:
- Cross-model comparison: Test GPT-4, Claude, Gemini, Llama
- Prompt engineering effects: How do different prompts affect consistency?
- Fine-tuning impact: Does fine-tuning improve consistency?
- Context length: Does longer context reduce consistency?
- Human evaluation: Do humans perceive variance the same way?
- Benchmark new models on consistency
- Study temperature effects rigorously
- Develop better evaluation metrics
- Choose optimal temperature for your task
- Estimate required samples for reliability
- Debug inconsistent model behavior
- Understand reliability risks
- Optimize inference costs
- Set quality thresholds
LLM FUZZ-BENCH: Systematic Consistency Analysis
============================================================
Model: llama-3.3-70b-versatile
Prompts: 25 (5 categories)
Temperatures: [0.0, 0.3, 0.7, 1.0, 1.2, 1.5]
Runs per condition: 10
Total experiments: 450
============================================================
SUMMARY STATISTICS
============================================================
Mean Consistency by Category:
factual : 0.943
coding : 0.891
logic : 0.762
summarization : 0.678
creative : 0.512
Consistency by Temperature:
T=0.0: 0.934
T=0.3: 0.876
T=0.7: 0.721
T=1.0: 0.589
T=1.2: 0.487
T=1.5: 0.398
============================================================
Below are example outputs generated by the framework:
Shows which tasks degrade first as temperature increases.

Tracks degradation trajectories with standard deviation bands.

Visualizes the spread of consistency scores across temperatures.

Ranks tasks by their sensitivity to temperature changes.

- Groq for fast, affordable LLM inference
- Sentence-Transformers for semantic similarity models
- Open-source community for the amazing tools that made this possible!
- FActScore (Min et al., 2023) - Factuality evaluation
- SelfCheckGPT (Manakul et al., 2023) - Self-consistency checking
- G-Eval (Liu et al., 2023) - LLM-as-judge evaluation
- Temperature sampling in LLMs (Holtzman et al., 2019)
If this helps your research or work, please star the repository!