Skip to content

Kelpejol/llm-output-stability-gate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

30 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ›ก๏ธ UQLM-Guard

Stop shipping uncertain AI code.

License: MIT Python 3.9+

A CLI tool that detects when AI-generated code is unreliable by measuring output consistency using UQLM (Uncertainty Quantification for Language Models).


๐ŸŽฏ The Problem

When you ask an LLM to generate code, you get one answer. But what if you asked it 5 times?

  • Would it give you the same algorithm?
  • Would it handle edge cases consistently?
  • Would security-critical details match?

If the LLM can't agree with itself, why should you trust it?

UQLM-Guard asks the same question multiple times and flags code where the AI is uncertain.


โœจ What It Does

$ uqlm-guard review "Write JWT authentication middleware"

โš ๏ธ  MEDIUM CONFIDENCE

Confidence Score: 0.62/1.0
Manual review recommended before use

โš ๏ธ  Detected Inconsistencies:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Issue #1                                โ”‚
โ”‚ [HIGH] Security-Critical Parameter      โ”‚
โ”‚                                         โ”‚
โ”‚ Token expiration varies:                โ”‚
โ”‚ โ€ข 3 solutions: 1 hour                   โ”‚
โ”‚ โ€ข 2 solutions: 24 hours                 โ”‚
โ”‚ โŒ No consensus on security setting     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Issue #2                                โ”‚
โ”‚ [HIGH] Secret Key Storage               โ”‚
โ”‚                                         โ”‚
โ”‚ โ€ข 2 solutions: Hardcoded secrets        โ”‚
โ”‚ โ€ข 2 solutions: Environment variables    โ”‚
โ”‚ โ€ข 1 solution: Key management service    โ”‚
โ”‚ โŒ Major security inconsistency         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ด RECOMMENDATION: Manual review required
๐Ÿ“Š Generated 5 solutions, found 3 major inconsistencies

Translation: Don't use this code yet. The AI wasn't sure how to implement critical security details.


๐Ÿš€ Quick Start

Installation

# Clone the repo
git clone https://github.com/kelpejol/uqlm-guard.git
cd uqlm-guard

# Install dependencies
pip install -r requirements.txt

# Install the CLI
pip install -e .

# Set your OpenAI API key
export OPENAI_API_KEY=your_key_here

# Test it
uqlm-guard review "Write a function to reverse a string"

First Analysis

uqlm-guard review "Implement a binary search tree"

You'll get:

  • โœ… Confidence score (0.0 to 1.0)
  • โš ๏ธ Detected inconsistencies across multiple generations
  • ๐Ÿ“Š Consensus elements (what the AI agreed on)
  • ๐Ÿ” Divergence analysis (where responses differ)
  • ๐Ÿ’ก Recommendation (use it, review it, or reject it)

๐Ÿ“– Usage

Basic Review

# Analyze a prompt
uqlm-guard review "Write a rate limiter with Redis"

# Use more samples for higher accuracy
uqlm-guard review "Implement OAuth2 flow" --samples 10

# Show full responses
uqlm-guard review "Create a bloom filter" --show-responses

# Export to JSON
uqlm-guard review "Write merge sort" --json-output --output result.json

Batch Analysis

# Create a file with prompts (one per line)
cat > prompts.txt << EOF
Write a function to validate email addresses
Implement a thread-safe cache
Create a distributed lock mechanism
EOF

# Analyze all prompts
uqlm-guard batch prompts.txt

Output:

Found 3 prompts to analyze

Analyzing prompt 1/3...
  0.85 - Write a function to validate email addresses...

Analyzing prompt 2/3...
  0.67 - Implement a thread-safe cache...

Analyzing prompt 3/3...
  0.43 - Create a distributed lock mechanism...

Batch Analysis Summary:

Total Prompts: 3
Average Confidence: 0.65
High Confidence (โ‰ฅ0.8): 1
Medium Confidence (0.6-0.8): 1
Low Confidence (<0.6): 1

Model Comparison

# Compare different models
uqlm-guard compare "Implement quicksort" \
  --models gpt-4o-mini \
  --models gpt-4o

# Output:
# ๐Ÿฅ‡ gpt-4o: 0.847
# ๐Ÿฅˆ gpt-4o-mini: 0.763

See Examples

uqlm-guard examples

๐Ÿงช How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. Generate Multiple Responses (5x)        โ”‚
โ”‚    "Write authentication code"             โ”‚
โ”‚    โ†“                                       โ”‚
โ”‚    Response 1: JWT with env vars           โ”‚
โ”‚    Response 2: JWT hardcoded               โ”‚
โ”‚    Response 3: Session-based               โ”‚
โ”‚    Response 4: JWT with KMS                โ”‚
โ”‚    Response 5: JWT with env vars           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Measure Agreement Using UQLM            โ”‚
โ”‚    โ€ข Semantic similarity                   โ”‚
โ”‚    โ€ข Structural consistency                โ”‚
โ”‚    โ€ข Keyword agreement                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Flag Inconsistencies                    โ”‚
โ”‚    โš ๏ธ  Different auth methods (3 variants) โ”‚
โ”‚    โš ๏ธ  Secret storage (3 approaches)       โ”‚
โ”‚    โš ๏ธ  Token expiration (2 values)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. Compute Confidence Score                โ”‚
โ”‚    Confidence: 0.42 (LOW) โŒ               โ”‚
โ”‚    Recommendation: Do not use              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why This Matters

Traditional code quality tools check:

  • โœ… Syntax errors
  • โœ… Type safety
  • โœ… Unit test coverage

But they can't detect:

  • โŒ Algorithmic uncertainty (multiple valid approaches)
  • โŒ Security inconsistencies (varying parameter choices)
  • โŒ Edge case handling (sometimes missed)

UQLM-Guard catches these by detecting when the AI isn't sure.


๐Ÿ“Š Benchmark Results

We tested UQLM-Guard on 30 prompts across 5 categories:

Category Tests Avg Confidence High Medium Low Issues Flagged
Simple 5 0.89 5 0 0 0
Data Structures 5 0.71 2 2 1 2
Algorithms 5 0.54 0 3 2 4
Security 5 0.47 0 2 3 5
Edge Cases 5 0.52 1 1 3 4

Key Findings:

  • ๐ŸŽฏ Security code had lowest confidence (0.47 average)
  • โš ๏ธ 68% of security prompts flagged issues
  • โœ… Simple tasks showed high consistency (0.89 average)
  • ๐Ÿ“ˆ Algorithmic complexity correlates with uncertainty

Run your own benchmarks:

cd benchmarks
python run_benchmark.py

๐Ÿ”ฅ Real-World Examples

Example 1: Caught Security Bug

Prompt: "Implement password hashing"

UQLM-Guard Output:

โš ๏ธ  LOW CONFIDENCE: 0.38

Issue: Salt generation varies
โ€ข 2 responses: Random salt per password
โ€ข 2 responses: Fixed salt
โ€ข 1 response: No salt
โŒ CRITICAL: Insecure hashing in 60% of responses

Impact: Prevented deployment of code with weak security.


Example 2: Algorithmic Uncertainty

Prompt: "Implement consistent hashing"

UQLM-Guard Output:

โš ๏ธ  MEDIUM CONFIDENCE: 0.64

Issue: Hash function selection
โ€ข 2 responses: MD5
โ€ข 2 responses: SHA-256
โ€ข 1 response: MurmurHash
โš ๏ธ  Different performance characteristics

Impact: Flagged for performance review before production use.


Example 3: Edge Case Detection

Prompt: "Parse date strings with timezone"

UQLM-Guard Output:

โš ๏ธ  LOW CONFIDENCE: 0.51

Issue: Timezone handling
โ€ข 3 responses: Convert to UTC
โ€ข 2 responses: Preserve local time
โš ๏ธ  Inconsistent behavior for daylight saving

Impact: Prevented subtle timezone bugs.


๐ŸŽ“ When To Use This

โœ… Perfect For:

  • AI-generated code review - Before merging Copilot suggestions
  • Security-critical code - Authentication, encryption, authorization
  • Production systems - Infrastructure, deployment, monitoring
  • Team code standards - Ensure AI follows your patterns
  • Learning - See where AI struggles with concepts

โŒ Not Designed For:

  • Proving correctness - This detects uncertainty, not bugs
  • Replacing tests - Still write unit/integration tests
  • Real-time generation - Takes 5-10s per analysis
  • Non-code prompts - Optimized for code generation tasks

๐Ÿ—๏ธ Architecture

uqlm_guard/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ analyzer.py      # UQLM uncertainty quantification
โ”‚   โ””โ”€โ”€ models.py        # Data models
โ”œโ”€โ”€ cli/
โ”‚   โ”œโ”€โ”€ main.py          # CLI interface
โ”‚   โ””โ”€โ”€ formatter.py     # Rich terminal output
benchmarks/
โ”œโ”€โ”€ prompts.json         # Test dataset
โ””โ”€โ”€ run_benchmark.py     # Benchmark runner
examples/
โ””โ”€โ”€ basic_usage.py       # Code examples
tests/
โ”œโ”€โ”€ test_analyzer.py     # Core tests
โ””โ”€โ”€ test_cli.py          # CLI tests

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=uqlm_guard

# Run only fast tests (no API calls)
pytest -m "not requires_api_key"

# Run specific test
pytest tests/test_analyzer.py::TestUQLMAnalyzer::test_find_consensus

Current coverage: 85%


๐Ÿ”ฎ Roadmap

  • GitHub Action - Auto-comment on PRs with uncertainty scores
  • Pre-commit hook - Block commits with low confidence code
  • VS Code extension - Real-time uncertainty detection
  • Multi-model support - Test Claude, Llama, Gemini
  • White-box methods - Token probability analysis
  • Fine-tuning dataset - Learn from flagged issues
  • Drift detection - Track uncertainty over time
  • Human-in-the-loop - Escalate uncertain code for review

๐Ÿค Contributing

We'd love your help! Check out CONTRIBUTING.md for guidelines.

Quick Start:

# Fork the repo, clone it
git clone https://github.com/your-username/uqlm-guard.git
cd uqlm-guard

# Create a branch
git checkout -b feature/your-feature

# Install dev dependencies
pip install -r requirements-dev.txt

# Make changes, run tests
pytest

# Format code
black uqlm_guard/ tests/
ruff check uqlm_guard/ tests/

# Push and create PR
git push origin feature/your-feature

๐Ÿ“š Background & Research

UQLM-Guard is built on research-backed uncertainty quantification:

Why Multi-Sample Testing Works

When an LLM generates code:

  • High confidence = consistent outputs across multiple samples
  • Low confidence = divergent outputs indicating uncertainty
  • Inconsistencies reveal where the model wasn't sure

This is more robust than:

  • โŒ Single-response heuristics
  • โŒ Keyword/regex filtering
  • โŒ Length-based checks

๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿ™ Acknowledgments

  • UQLM for uncertainty quantification research
  • Rich for beautiful terminal output
  • Click for CLI framework
  • The AI safety community for inspiration

๐Ÿ“ž Contact


โญ Star this repo if UQLM-Guard helped you catch uncertain AI code!

Made with โค๏ธ by Kelpejol

About

Pre-execution reliability gate using UQLM for LLM output stability

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published