This guide explains how to set up and run the LLM Judge evaluation system locally for development and testing.
Run the setup script to automatically configure everything:
./scripts/setup_local_llm_judge.shThis script will:
- Check and install dependencies
- Set up Ollama and required models
- Test the LLM Judge functionality
- Run a quick evaluation
- Generate action items
If you prefer to set up manually, follow these steps:
-
Python 3.11+ and Poetry
# Install Poetry if not already installed curl -sSL https://install.python-poetry.org | python3 -
-
Ollama
# Install Ollama curl -fsSL https://ollama.ai/install.sh | sh # Start Ollama service ollama serve # Pull required model ollama pull mistral
-
Install dependencies
poetry install
-
Create necessary directories
mkdir -p tests/data test_chroma_db logs
-
Test the setup
poetry run python scripts/test_llm_judge.py
# Using Makefile
make llm-judge-quick
# Using script directly
./scripts/run_llm_judge.sh quick auto 7.0
# Using poetry directly
poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py --quick# Using Makefile
make llm-judge
# Using script directly
./scripts/run_llm_judge.sh full auto 7.0
# Using poetry directly
poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py# Force Ollama backend
LLM_JUDGE_FORCE_BACKEND=OLLAMA poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py --quick
# Force OpenAI backend
LLM_JUDGE_FORCE_BACKEND=OPENAI poetry run python basicchat/evaluation/evaluators/check_llm_judge_smart.py --quick
# Using scripts with specific backend
./scripts/run_llm_judge.sh quick ollama 7.0
./scripts/run_llm_judge.sh quick openai 7.0| Command | Description |
|---|---|
make llm-judge-quick |
Quick evaluation (smart backend) |
make llm-judge |
Full evaluation (smart backend) |
make llm-judge-ollama-quick |
Quick evaluation with Ollama |
make llm-judge-ollama |
Full evaluation with Ollama |
make llm-judge-openai-quick |
Quick evaluation with OpenAI |
make llm-judge-openai |
Full evaluation with OpenAI |
make test-and-evaluate |
Run tests + quick LLM judge |
make evaluate-all |
Run all tests + full LLM judge + performance test |
After running an evaluation, you'll get several files:
llm_judge_results.json- Raw evaluation datallm_judge_action_items.md- Actionable improvement planllm_judge_improvement_tips.md- Specific improvement tipsfinal_test_report.md- Combined test and evaluation report
- 10/10: Exemplary - Perfect adherence to best practices
- 8-9/10: Excellent - Minor improvements needed
- 7-8/10: Good - Some improvements needed
- 6-7/10: Acceptable - Notable issues but functional
- 5-6/10: Poor - Significant problems
- <5/10: Critical - Major issues requiring immediate attention
- Code Quality - Structure, naming, complexity, Python best practices
- Test Coverage - Comprehensiveness, quality, effectiveness
- Documentation - README quality, inline docs, project documentation
- Architecture - Design patterns, modularity, scalability
- Security - Potential vulnerabilities, security best practices
- Performance - Code efficiency, optimization opportunities
| Variable | Default | Description |
|---|---|---|
LLM_JUDGE_THRESHOLD |
7.0 |
Minimum passing score |
LLM_JUDGE_FORCE_BACKEND |
- | Force specific backend (OLLAMA/OPENAI) |
OLLAMA_API_URL |
http://localhost:11434/api |
Ollama API URL |
OLLAMA_MODEL |
mistral |
Ollama model to use |
OPENAI_API_KEY |
- | OpenAI API key (required for OpenAI backend) |
OPENAI_MODEL |
gpt-3.5-turbo |
OpenAI model to use |
The evaluation rules are defined in basicchat/evaluation/evaluators/llm_judge_rules.json. You can customize:
- Evaluation criteria and weights
- Best practices guidelines
- File patterns and exclusions
- Consistency checks
- Priority levels
# Solution: Use poetry to run commands
poetry run python basicchat/evaluation/evaluators/check_llm_judge.py --quick# Solution: Start Ollama service
ollama serve
# Check if it's running
curl http://localhost:11434/api/tags# Solution: Pull the required model
ollama pull mistral
# List available models
ollama listThis usually means the LLM response wasn't properly formatted. Try:
- Running again (temporary issue)
- Using a different model
- Checking Ollama logs
Check the detailed error message. Common causes:
- Ollama not running
- Model not available
- Network connectivity issues
Enable debug mode for more detailed output:
export LLM_JUDGE_DEBUG=1
poetry run python basicchat/evaluation/evaluators/check_llm_judge.py --quickCheck Ollama logs for issues:
# View Ollama logs
ollama logs
# Check system logs
journalctl -u ollama -fThe LLM Judge is integrated into the CI pipeline and runs:
- On every push to main branch
- On pull requests from the same repository
- After unit tests pass
- With fallback to OpenAI if Ollama fails
The CI configuration is in .github/workflows/verify.yml and includes:
- LLM Judge evaluation job
- Automatic fallback to OpenAI
- Artifact upload for results
- Integration with final test reports
- Run quick evaluations frequently during development
- Address critical issues immediately (score < 6)
- Plan to fix high priority issues (score 6-7)
- Use the action items as a development roadmap
- Run full evaluations before major releases
- Set up local development for all team members
- Use consistent thresholds across the team
- Review action items in team meetings
- Track progress over time
- Customize rules for your project needs
- Set appropriate thresholds for your project
- Use quick mode for faster feedback
- Configure fallback to OpenAI for reliability
- Upload artifacts for review
- Integrate with existing quality gates
- Run the setup script:
./scripts/setup_local_llm_judge.sh - Try a quick evaluation:
make llm-judge-quick - Review the action items: Check
llm_judge_action_items.md - Implement improvements: Follow the prioritized action plan
- Run regularly: Integrate into your development workflow
- LLM Judge Evaluator Documentation
- Evaluation Rules Configuration
- GitHub Actions Workflow
- Makefile Commands
This guide covers local development setup. For production deployment and CI/CD integration, see the main EVALUATORS.md documentation.