Stop shipping uncertain AI code.
A CLI tool that detects when AI-generated code is unreliable by measuring output consistency using UQLM (Uncertainty Quantification for Language Models).
When you ask an LLM to generate code, you get one answer. But what if you asked it 5 times?
- Would it give you the same algorithm?
- Would it handle edge cases consistently?
- Would security-critical details match?
If the LLM can't agree with itself, why should you trust it?
UQLM-Guard asks the same question multiple times and flags code where the AI is uncertain.
$ uqlm-guard review "Write JWT authentication middleware"
โ ๏ธ MEDIUM CONFIDENCE
Confidence Score: 0.62/1.0
Manual review recommended before use
โ ๏ธ Detected Inconsistencies:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Issue #1 โ
โ [HIGH] Security-Critical Parameter โ
โ โ
โ Token expiration varies: โ
โ โข 3 solutions: 1 hour โ
โ โข 2 solutions: 24 hours โ
โ โ No consensus on security setting โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Issue #2 โ
โ [HIGH] Secret Key Storage โ
โ โ
โ โข 2 solutions: Hardcoded secrets โ
โ โข 2 solutions: Environment variables โ
โ โข 1 solution: Key management service โ
โ โ Major security inconsistency โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ด RECOMMENDATION: Manual review required
๐ Generated 5 solutions, found 3 major inconsistenciesTranslation: Don't use this code yet. The AI wasn't sure how to implement critical security details.
# Clone the repo
git clone https://github.com/kelpejol/uqlm-guard.git
cd uqlm-guard
# Install dependencies
pip install -r requirements.txt
# Install the CLI
pip install -e .
# Set your OpenAI API key
export OPENAI_API_KEY=your_key_here
# Test it
uqlm-guard review "Write a function to reverse a string"uqlm-guard review "Implement a binary search tree"You'll get:
- โ Confidence score (0.0 to 1.0)
โ ๏ธ Detected inconsistencies across multiple generations- ๐ Consensus elements (what the AI agreed on)
- ๐ Divergence analysis (where responses differ)
- ๐ก Recommendation (use it, review it, or reject it)
# Analyze a prompt
uqlm-guard review "Write a rate limiter with Redis"
# Use more samples for higher accuracy
uqlm-guard review "Implement OAuth2 flow" --samples 10
# Show full responses
uqlm-guard review "Create a bloom filter" --show-responses
# Export to JSON
uqlm-guard review "Write merge sort" --json-output --output result.json# Create a file with prompts (one per line)
cat > prompts.txt << EOF
Write a function to validate email addresses
Implement a thread-safe cache
Create a distributed lock mechanism
EOF
# Analyze all prompts
uqlm-guard batch prompts.txtOutput:
Found 3 prompts to analyze
Analyzing prompt 1/3...
0.85 - Write a function to validate email addresses...
Analyzing prompt 2/3...
0.67 - Implement a thread-safe cache...
Analyzing prompt 3/3...
0.43 - Create a distributed lock mechanism...
Batch Analysis Summary:
Total Prompts: 3
Average Confidence: 0.65
High Confidence (โฅ0.8): 1
Medium Confidence (0.6-0.8): 1
Low Confidence (<0.6): 1
# Compare different models
uqlm-guard compare "Implement quicksort" \
--models gpt-4o-mini \
--models gpt-4o
# Output:
# ๐ฅ gpt-4o: 0.847
# ๐ฅ gpt-4o-mini: 0.763uqlm-guard examplesโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Generate Multiple Responses (5x) โ
โ "Write authentication code" โ
โ โ โ
โ Response 1: JWT with env vars โ
โ Response 2: JWT hardcoded โ
โ Response 3: Session-based โ
โ Response 4: JWT with KMS โ
โ Response 5: JWT with env vars โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. Measure Agreement Using UQLM โ
โ โข Semantic similarity โ
โ โข Structural consistency โ
โ โข Keyword agreement โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. Flag Inconsistencies โ
โ โ ๏ธ Different auth methods (3 variants) โ
โ โ ๏ธ Secret storage (3 approaches) โ
โ โ ๏ธ Token expiration (2 values) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. Compute Confidence Score โ
โ Confidence: 0.42 (LOW) โ โ
โ Recommendation: Do not use โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Traditional code quality tools check:
- โ Syntax errors
- โ Type safety
- โ Unit test coverage
But they can't detect:
- โ Algorithmic uncertainty (multiple valid approaches)
- โ Security inconsistencies (varying parameter choices)
- โ Edge case handling (sometimes missed)
UQLM-Guard catches these by detecting when the AI isn't sure.
We tested UQLM-Guard on 30 prompts across 5 categories:
| Category | Tests | Avg Confidence | High | Medium | Low | Issues Flagged |
|---|---|---|---|---|---|---|
| Simple | 5 | 0.89 | 5 | 0 | 0 | 0 |
| Data Structures | 5 | 0.71 | 2 | 2 | 1 | 2 |
| Algorithms | 5 | 0.54 | 0 | 3 | 2 | 4 |
| Security | 5 | 0.47 | 0 | 2 | 3 | 5 |
| Edge Cases | 5 | 0.52 | 1 | 1 | 3 | 4 |
Key Findings:
- ๐ฏ Security code had lowest confidence (0.47 average)
โ ๏ธ 68% of security prompts flagged issues- โ Simple tasks showed high consistency (0.89 average)
- ๐ Algorithmic complexity correlates with uncertainty
Run your own benchmarks:
cd benchmarks
python run_benchmark.pyPrompt: "Implement password hashing"
UQLM-Guard Output:
โ ๏ธ LOW CONFIDENCE: 0.38
Issue: Salt generation varies
โข 2 responses: Random salt per password
โข 2 responses: Fixed salt
โข 1 response: No salt
โ CRITICAL: Insecure hashing in 60% of responses
Impact: Prevented deployment of code with weak security.
Prompt: "Implement consistent hashing"
UQLM-Guard Output:
โ ๏ธ MEDIUM CONFIDENCE: 0.64
Issue: Hash function selection
โข 2 responses: MD5
โข 2 responses: SHA-256
โข 1 response: MurmurHash
โ ๏ธ Different performance characteristics
Impact: Flagged for performance review before production use.
Prompt: "Parse date strings with timezone"
UQLM-Guard Output:
โ ๏ธ LOW CONFIDENCE: 0.51
Issue: Timezone handling
โข 3 responses: Convert to UTC
โข 2 responses: Preserve local time
โ ๏ธ Inconsistent behavior for daylight saving
Impact: Prevented subtle timezone bugs.
- AI-generated code review - Before merging Copilot suggestions
- Security-critical code - Authentication, encryption, authorization
- Production systems - Infrastructure, deployment, monitoring
- Team code standards - Ensure AI follows your patterns
- Learning - See where AI struggles with concepts
- Proving correctness - This detects uncertainty, not bugs
- Replacing tests - Still write unit/integration tests
- Real-time generation - Takes 5-10s per analysis
- Non-code prompts - Optimized for code generation tasks
uqlm_guard/
โโโ core/
โ โโโ analyzer.py # UQLM uncertainty quantification
โ โโโ models.py # Data models
โโโ cli/
โ โโโ main.py # CLI interface
โ โโโ formatter.py # Rich terminal output
benchmarks/
โโโ prompts.json # Test dataset
โโโ run_benchmark.py # Benchmark runner
examples/
โโโ basic_usage.py # Code examples
tests/
โโโ test_analyzer.py # Core tests
โโโ test_cli.py # CLI tests
# Run all tests
pytest
# Run with coverage
pytest --cov=uqlm_guard
# Run only fast tests (no API calls)
pytest -m "not requires_api_key"
# Run specific test
pytest tests/test_analyzer.py::TestUQLMAnalyzer::test_find_consensusCurrent coverage: 85%
- GitHub Action - Auto-comment on PRs with uncertainty scores
- Pre-commit hook - Block commits with low confidence code
- VS Code extension - Real-time uncertainty detection
- Multi-model support - Test Claude, Llama, Gemini
- White-box methods - Token probability analysis
- Fine-tuning dataset - Learn from flagged issues
- Drift detection - Track uncertainty over time
- Human-in-the-loop - Escalate uncertain code for review
We'd love your help! Check out CONTRIBUTING.md for guidelines.
Quick Start:
# Fork the repo, clone it
git clone https://github.com/your-username/uqlm-guard.git
cd uqlm-guard
# Create a branch
git checkout -b feature/your-feature
# Install dev dependencies
pip install -r requirements-dev.txt
# Make changes, run tests
pytest
# Format code
black uqlm_guard/ tests/
ruff check uqlm_guard/ tests/
# Push and create PR
git push origin feature/your-featureUQLM-Guard is built on research-backed uncertainty quantification:
- Paper: Uncertainty Quantification for Language Models
- UQLM Library: github.com/zlin7/UQ-NLG
- Concept: Semantic negentropy measures agreement across model generations
When an LLM generates code:
- High confidence = consistent outputs across multiple samples
- Low confidence = divergent outputs indicating uncertainty
- Inconsistencies reveal where the model wasn't sure
This is more robust than:
- โ Single-response heuristics
- โ Keyword/regex filtering
- โ Length-based checks
MIT License - see LICENSE for details.
- UQLM for uncertainty quantification research
- Rich for beautiful terminal output
- Click for CLI framework
- The AI safety community for inspiration
- ๐ Issues: github.com/kelpejol/uqlm-guard/issues
- ๐ฌ Discussions: github.com/kelpejol/uqlm-guard/discussions
โญ Star this repo if UQLM-Guard helped you catch uncertain AI code!
Made with โค๏ธ by Kelpejol