Skip to content

aarushisingh04/temperature-fuzzer

Repository files navigation

LLM Fuzz-Bench: Systematic Consistency Analyzer for Generative Models

Python

An evaluation framework that quantifies LLM output variance by treating temperature as a fuzzing parameter across multiple task categories.

Research Question

How consistent are LLM outputs across temperature settings, and which task types are most sensitive to temperature variation?

Motivation

Temperature is a critical hyperparameter for LLM inference, but its impact on output consistency is poorly understood. This matters for:

  • Production Reliability: Can we trust outputs in high-stakes applications?
  • Cost Optimization: Do we need multiple samples or is one enough?
  • Task-Specific Tuning: What temperature for what task type?

This project moves LLM evaluation from "vibes-based" (it looks okay) to "evidence-based" (it is statistically consistent).

Key Findings

1. Task Sensitivity Varies Dramatically

  • Factual queries: 95% consistency even at T=1.0
  • Logic/Reasoning: 78% at T=0.3, drops to 52% at T=1.0
  • Creative tasks: 45% consistency at T=0.7

2. The T=0.7 Myth

Common advice: "use 0.7 for balanced outputs"

Our finding: This is only optimal for creative tasks. For factual/code generation, use T=0.0-0.3.

3. Failure Modes Identified

  • Semantic Drift (35%): Outputs convey different meanings
  • Length Variance (28%): Same content, different verbosity
  • Formatting Variance (22%): Same info, different structure
  • Complete Divergence (15%): Entirely different responses

4. Production Implications

For consistency-critical tasks (code, extraction):

  • T > 0.5 introduces 40%+ variance
  • Single sampling at T=0.0 more reliable than averaging 3 samples at T=0.7
  • Cost savings: ~60% fewer API calls with proper temperature tuning

Architecture

Multi-Task Benchmark (25 prompts)
         ↓
Fuzzing Engine (450+ experiments)
         ↓
Consistency Scorer (3 scoring methods)
         ↓
Visualizations + Failure Analysis

Scoring Methods

  1. Exact Match: For factual/logic tasks (deterministic answers)
  2. Semantic Similarity: For creative/summarization (BERT embeddings)
  3. Regex Matching: For code generation (pattern validation)

Quick Start

Prerequisites

Python 3.8+
Groq API key (free tier works - get at console.groq.com)

Installation

# Clone repository
git clone https://github.com/aarushisingh04/temperature-fuzzer.git
cd temperature-fuzzer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up API key
echo "GROQ_API_KEY=your_key_here" > .env

Run Complete Pipeline

# Run all 450+ experiments + analysis (~10-15 minutes, ~$1.50 on Groq)
python main.py

# Or skip experiments and use existing results
python main.py --skip-experiments

Run Individual Components

# Just run experiments
python fuzzing_engine.py

# Just score existing results
python consistency_scorer.py

# Just generate visualizations
python visualizer.py

# Just analyze failures
python failure_analyzer.py

Project Structure

temperature-fuzzer/
├── main.py                      # Main pipeline orchestrator
├── config.py                    # Configuration settings
├── fuzzing_engine.py            # Experiment runner
├── consistency_scorer.py        # Scoring algorithms
├── visualizer.py                # Visualization generator
├── failure_analyzer.py          # Failure mode categorization
├── requirements.txt             # Dependencies
├── .env                         # API keys (not committed)
├── .gitignore                   # Git ignore file
│
├── data/
│   └── benchmark_prompts.py     # 25 gold-standard prompts
│
├── results/
│   ├── raw_results.json         # Raw experimental outputs
│   ├── consistency_scores.csv   # Calculated scores
│   └── failure_analysis.csv     # Failure categorization
│
└── visualizations/
    ├── fragility_heatmap.png    # Main visualization
    ├── consistency_curves.png
    ├── variance_distribution.png
    └── temperature_sensitivity.png

Visualizations

1. Fragility Heatmap

The signature visualization showing which tasks "break" first as temperature increases.

2. Consistency Curves

Degradation trajectories by task category with standard deviation bands.

3. Variance Distribution

Box plots showing consistency score distributions.

4. Temperature Sensitivity Ranking

Bar chart ranking tasks by temperature fragility.

Configuration

Edit config.py to customize:

# Model selection
MODEL_NAME = "llama-3.3-70b-versatile"

# Experiment parameters
TEMPERATURES = [0.0, 0.3, 0.7, 1.0, 1.2, 1.5]
RUNS_PER_CONDITION = 10

# Scoring thresholds
SIMILARITY_THRESHOLD = 0.85
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

Extending the Benchmark

Add new prompts in data/benchmark_prompts.py:

BENCHMARK_PROMPTS = {
    "your_category": [
        {
            "id": "custom_01",
            "prompt": "Your test prompt here",
            "expected_answer": "expected output",
            "explanation": "What this tests"
        }
    ]
}

Methodology

Experimental Design

  • 5 task categories: Logic, Factual, Coding, Creative, Summarization
  • 6 temperature settings: 0.0, 0.3, 0.7, 1.0, 1.2, 1.5
  • 10 runs per condition: Ensures statistical significance
  • Total: 25 prompts × 6 temps × 10 runs = 1,500 API calls

Consistency Metric

For each (prompt, temperature) pair:

  1. Collect 10 independent outputs
  2. Calculate pairwise similarities (semantic or exact match)
  3. Aggregate into single consistency score ∈ [0, 1]
  4. Score of 1.0 = perfect consistency, 0.0 = complete divergence

Semantic Variance

For text-based tasks, we use BERT embeddings to calculate semantic similarity:

  • Encode all outputs into embedding space
  • Compute cosine similarity between all pairs
  • Average similarity = consistency score

This captures meaning preservation even when exact wording differs.

Results Interpretation

Consistency Score Guide

  • 0.90 - 1.00: Highly consistent (safe for production)
  • 0.75 - 0.89: Moderately consistent (acceptable for most uses)
  • 0.60 - 0.74: Variable (consider lowering temperature)
  • < 0.60: Unreliable (high-entropy sampling)

Temperature Guidelines by Task

Task Type Recommended T Reasoning
Factual 0.0 - 0.3 Deterministic answers needed
Logic 0.0 - 0.3 Reasoning must be consistent
Code 0.0 - 0.5 Syntax errors costly
Summarization 0.3 - 0.7 Balance consistency + conciseness
Creative 0.7 - 1.2 Diversity desired

Research Applications

This framework can be extended to study:

  1. Cross-model comparison: Test GPT-4, Claude, Gemini, Llama
  2. Prompt engineering effects: How do different prompts affect consistency?
  3. Fine-tuning impact: Does fine-tuning improve consistency?
  4. Context length: Does longer context reduce consistency?
  5. Human evaluation: Do humans perceive variance the same way?

Use Cases

For Researchers

  • Benchmark new models on consistency
  • Study temperature effects rigorously
  • Develop better evaluation metrics

For Engineers

  • Choose optimal temperature for your task
  • Estimate required samples for reliability
  • Debug inconsistent model behavior

For Product Teams

  • Understand reliability risks
  • Optimize inference costs
  • Set quality thresholds

Sample Output

LLM FUZZ-BENCH: Systematic Consistency Analysis
============================================================
Model: llama-3.3-70b-versatile
Prompts: 25 (5 categories)
Temperatures: [0.0, 0.3, 0.7, 1.0, 1.2, 1.5]
Runs per condition: 10
Total experiments: 450
============================================================

SUMMARY STATISTICS
============================================================
Mean Consistency by Category:
  factual        : 0.943
  coding         : 0.891
  logic          : 0.762
  summarization  : 0.678
  creative       : 0.512

Consistency by Temperature:
  T=0.0: 0.934
  T=0.3: 0.876
  T=0.7: 0.721
  T=1.0: 0.589
  T=1.2: 0.487
  T=1.5: 0.398
============================================================

Sample Visualizations

Below are example outputs generated by the framework:

1. Fragility Heatmap

Shows which tasks degrade first as temperature increases. Fragility Heatmap

2. Consistency Curves

Tracks degradation trajectories with standard deviation bands. Consistency Curves

3. Variance Distribution

Visualizes the spread of consistency scores across temperatures. Variance Distribution

4. Temperature Sensitivity

Ranks tasks by their sensitivity to temperature changes. Temperature Sensitivity

Acknowledgments

  • Groq for fast, affordable LLM inference
  • Sentence-Transformers for semantic similarity models
  • Open-source community for the amazing tools that made this possible!

Related Work

  • FActScore (Min et al., 2023) - Factuality evaluation
  • SelfCheckGPT (Manakul et al., 2023) - Self-consistency checking
  • G-Eval (Liu et al., 2023) - LLM-as-judge evaluation
  • Temperature sampling in LLMs (Holtzman et al., 2019)

If this helps your research or work, please star the repository!

About

A Systematic Consistency Analyzer for Generative Models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages