🤖 LLM Judge Local Development Guide

This guide explains how to set up and run the LLM Judge evaluation system locally for development and testing.

🚀 Quick Start

1. Automatic Setup (Recommended)

Run the setup script to automatically configure everything:

./scripts/setup_local_llm_judge.sh

This script will:

Check and install dependencies
Set up Ollama and required models
Test the LLM Judge functionality
Run a quick evaluation
Generate action items

2. Manual Setup

If you prefer to set up manually, follow these steps:

Prerequisites

Python 3.11+ and Poetry

# Install Poetry if not already installed
curl -sSL https://install.python-poetry.org | python3 -

Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull required model
ollama pull mistral

Installation

Install dependencies
```
poetry install
```
Create necessary directories
```
mkdir -p tests/data test_chroma_db logs
```

Test the setup

poetry run python scripts/test_llm_judge.py

🎯 Usage

Basic Commands

Smart Evaluation (Recommended - automatically chooses best backend)

# Using Makefile
make llm-judge-quick

# Using script directly
./scripts/run_llm_judge.sh quick auto 7.0

# Using poetry directly
poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py --quick

Full Evaluation (Comprehensive analysis)

# Using Makefile
make llm-judge

# Using script directly
./scripts/run_llm_judge.sh full auto 7.0

# Using poetry directly
poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py

Force Specific Backend

# Force Ollama backend
LLM_JUDGE_FORCE_BACKEND=OLLAMA poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py --quick

# Force OpenAI backend
LLM_JUDGE_FORCE_BACKEND=OPENAI poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py --quick

# Using scripts with specific backend
./scripts/run_llm_judge.sh quick ollama 7.0
./scripts/run_llm_judge.sh quick openai 7.0

Available Makefile Commands

Command	Description
`make llm-judge-quick`	Quick evaluation (smart backend)
`make llm-judge`	Full evaluation (smart backend)
`make llm-judge-ollama-quick`	Quick evaluation with Ollama
`make llm-judge-ollama`	Full evaluation with Ollama
`make llm-judge-openai-quick`	Quick evaluation with OpenAI
`make llm-judge-openai`	Full evaluation with OpenAI
`make test-and-evaluate`	Run tests + quick LLM judge
`make evaluate-all`	Run all tests + full LLM judge + performance test

📊 Understanding Results

Generated Files

After running an evaluation, you'll get several files:

llm_judge_results.json - Raw evaluation data
llm_judge_action_items.md - Actionable improvement plan
llm_judge_improvement_tips.md - Specific improvement tips
final_test_report.md - Combined test and evaluation report

Score Interpretation

10/10: Exemplary - Perfect adherence to best practices
8-9/10: Excellent - Minor improvements needed
7-8/10: Good - Some improvements needed
6-7/10: Acceptable - Notable issues but functional
5-6/10: Poor - Significant problems
<5/10: Critical - Major issues requiring immediate attention

Evaluation Categories

Code Quality - Structure, naming, complexity, Python best practices
Test Coverage - Comprehensiveness, quality, effectiveness
Documentation - README quality, inline docs, project documentation
Architecture - Design patterns, modularity, scalability
Security - Potential vulnerabilities, security best practices
Performance - Code efficiency, optimization opportunities

🔧 Configuration

Environment Variables

Variable	Default	Description
`LLM_JUDGE_THRESHOLD`	`7.0`	Minimum passing score
`LLM_JUDGE_FORCE_BACKEND`	-	Force specific backend (OLLAMA/OPENAI)
`OLLAMA_API_URL`	`http://localhost:11434/api`	Ollama API URL
`OLLAMA_MODEL`	`mistral`	Ollama model to use
`OPENAI_API_KEY`	-	OpenAI API key (required for OpenAI backend)
`OPENAI_MODEL`	`gpt-3.5-turbo`	OpenAI model to use

Rules Configuration

The evaluation rules are defined in basicchat/evaluation/evaluators/llm_judge_rules.json. You can customize:

Evaluation criteria and weights
Best practices guidelines
File patterns and exclusions
Consistency checks
Priority levels

🛠️ Troubleshooting

Common Issues

1. "No module named 'basicchat'"

# Solution: Use poetry to run commands
poetry run python basicchat/evaluation/evaluators/check_llm_judge.py --quick

2. "Ollama is not running"

# Solution: Start Ollama service
ollama serve

# Check if it's running
curl http://localhost:11434/api/tags

3. "Model not found"

# Solution: Pull the required model
ollama pull mistral

# List available models
ollama list

4. "Failed to parse JSON response"

This usually means the LLM response wasn't properly formatted. Try:

Running again (temporary issue)
Using a different model
Checking Ollama logs

5. "Evaluation failed with exit code"

Check the detailed error message. Common causes:

Ollama not running
Model not available
Network connectivity issues

Debug Mode

Enable debug mode for more detailed output:

export LLM_JUDGE_DEBUG=1
poetry run python basicchat/evaluation/evaluators/check_llm_judge.py --quick

Logs

Check Ollama logs for issues:

# View Ollama logs
ollama logs

# Check system logs
journalctl -u ollama -f

🔄 Continuous Integration

The LLM Judge is integrated into the CI pipeline and runs:

On every push to main branch
On pull requests from the same repository
After unit tests pass
With fallback to OpenAI if Ollama fails

CI Configuration

The CI configuration is in .github/workflows/verify.yml and includes:

LLM Judge evaluation job
Automatic fallback to OpenAI
Artifact upload for results
Integration with final test reports

📈 Best Practices

For Development

Run quick evaluations frequently during development
Address critical issues immediately (score < 6)
Plan to fix high priority issues (score 6-7)
Use the action items as a development roadmap
Run full evaluations before major releases

For Teams

Set up local development for all team members
Use consistent thresholds across the team
Review action items in team meetings
Track progress over time
Customize rules for your project needs

For CI/CD

Set appropriate thresholds for your project
Use quick mode for faster feedback
Configure fallback to OpenAI for reliability
Upload artifacts for review
Integrate with existing quality gates

🎯 Next Steps

Run the setup script: ./scripts/setup_local_llm_judge.sh
Try a quick evaluation: make llm-judge-quick
Review the action items: Check llm_judge_action_items.md
Implement improvements: Follow the prioritized action plan
Run regularly: Integrate into your development workflow

📚 Additional Resources

This guide covers local development setup. For production deployment and CI/CD integration, see the main EVALUATORS.md documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤖 LLM Judge Local Development Guide

🚀 Quick Start

1. Automatic Setup (Recommended)

2. Manual Setup

Prerequisites

Installation

🎯 Usage

Basic Commands

Smart Evaluation (Recommended - automatically chooses best backend)

Full Evaluation (Comprehensive analysis)

Force Specific Backend

Available Makefile Commands

📊 Understanding Results

Generated Files

Score Interpretation

Evaluation Categories

🔧 Configuration

Environment Variables

Rules Configuration

🛠️ Troubleshooting

Common Issues

1. "No module named 'basicchat'"

2. "Ollama is not running"

3. "Model not found"

4. "Failed to parse JSON response"

5. "Evaluation failed with exit code"

Debug Mode

Logs

🔄 Continuous Integration

CI Configuration

📈 Best Practices

For Development

For Teams

For CI/CD

🎯 Next Steps

📚 Additional Resources

FilesExpand file tree

LOCAL_LLM_JUDGE.md

Latest commit

History

LOCAL_LLM_JUDGE.md

File metadata and controls

🤖 LLM Judge Local Development Guide

🚀 Quick Start

1. Automatic Setup (Recommended)

2. Manual Setup

Prerequisites

Installation

🎯 Usage

Basic Commands

Smart Evaluation (Recommended - automatically chooses best backend)

Full Evaluation (Comprehensive analysis)

Force Specific Backend

Available Makefile Commands

📊 Understanding Results

Generated Files

Score Interpretation

Evaluation Categories

🔧 Configuration

Environment Variables

Rules Configuration

🛠️ Troubleshooting

Common Issues

1. "No module named 'basicchat'"

2. "Ollama is not running"

3. "Model not found"

4. "Failed to parse JSON response"

5. "Evaluation failed with exit code"

Debug Mode

Logs

🔄 Continuous Integration

CI Configuration

📈 Best Practices

For Development

For Teams

For CI/CD

🎯 Next Steps

📚 Additional Resources