Skip to content

tushar-sanap/llm-as-judge

Repository files navigation

LLM-as-a-Judge for RCA Agent Validation

A comprehensive evaluation system for validating the accuracy and quality of AI-generated Root Cause Analysis (RCA) outputs from test automation failures.

Features

🔢 Quantitative Metrics

  • Semantic Similarity: Using sentence transformers for meaning comparison
  • ROUGE Scores: Text overlap and n-gram matching
  • BERT Score: Deep semantic understanding evaluation
  • Category Accuracy: Classification correctness
  • Evidence Quality: Coverage, relevance, and completeness analysis
  • Confidence Alignment: How well confidence levels match expectations

🤖 LLM-based Qualitative Evaluation

  • Logical Reasoning: Evaluates causal chain and consistency
  • Actionability: Assesses practicality and specificity of recommendations
  • Domain Expertise: Tests technical accuracy and best practices knowledge
  • Comprehensive Grading: A+ to F scoring with detailed feedback

📊 Evaluation Components

  1. Primary Root Cause Analysis

    • Category classification accuracy
    • Description semantic similarity
    • Confidence level alignment
  2. Supporting Evidence Assessment

    • Coverage of expected evidence points
    • Relevance of generated evidence
    • Completeness balance
  3. Alternative Causes Evaluation

    • Accuracy of alternative hypotheses
    • Completeness of consideration
  4. Recommendations Quality

    • Actionability and specificity
    • Feasibility and relevance
    • Clarity and prioritization

Installation

  1. Clone and setup:
git clone <repository-url>
cd llm-as-judge
pip install -r requirements.txt
  1. Environment Setup: Create a .env file:
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here  # Optional
  1. Download NLTK data:
import nltk
nltk.download('punkt')

Quick Start

Running the Web UI

To start the web interface for interactive RCA evaluation:

source venv/bin/activate && python web_ui.py

The web UI will be available at http://localhost:8000 and provides:

  • Interactive single RCA evaluation
  • Batch processing with Excel file upload
  • LLM-as-a-Judge integration
  • Export results to Excel

Basic Usage

from models import TestFailureContext, RCAOutput, RootCause, EvaluationInput
from llm_judge import LLMJudge, LLMJudgeConfig
from evaluation_metrics import RCAEvaluationMetrics

# Create test context
test_context = TestFailureContext(
    test_name="test_user_login",
    test_type=TestType.UI,
    error_message="ElementNotInteractableException: Element is not clickable",
    logs=["Error: Modal overlay blocking interaction"]
)

# Define expected and generated RCA
expected_rca = RCAOutput(
    primary_root_cause=RootCause(
        category=FailureCategory.LOCATOR,
        description="Modal dialog blocking interaction",
        confidence=ConfidenceLevel.HIGH
    ),
    # ... other fields
)

generated_rca = RCAOutput(
    # Your RCA agent's output
)

# Create evaluation input
eval_input = EvaluationInput(
    test_context=test_context,
    expected_rca=expected_rca,
    generated_rca=generated_rca
)

# Run quantitative evaluation
metrics = RCAEvaluationMetrics()
results = metrics.comprehensive_evaluation(eval_input)
print(f"Overall Score: {results['overall_score']:.3f}")

# Run LLM-based evaluation (requires API key)
config = LLMJudgeConfig(provider="openai", model="gpt-4")
judge = LLMJudge(config)
llm_results = judge.comprehensive_llm_evaluation(eval_input)
print(f"Final Grade: {llm_results['final_assessment']['grade']}")

Running Examples

python example_usage.py

Evaluation Metrics

Overall Score Calculation

  • Primary Cause (35%): Category accuracy + description similarity + confidence alignment
  • Evidence Quality (20%): Coverage + relevance + completeness
  • Alternative Causes (15%): Accuracy + completeness of alternatives
  • Analysis Summary (20%): Semantic similarity + BERT score + ROUGE
  • Recommendations (10%): Coverage + relevance + actionability

LLM Judge Scoring

  • Quantitative Metrics (40%)
  • Logical Reasoning (25%)
  • Actionability (20%)
  • Domain Expertise (15%)

Configuration Options

LLM Provider Configuration

# OpenAI Configuration
config = LLMJudgeConfig(
    provider="openai",
    model="gpt-4",
    temperature=0.1,
    max_tokens=2000
)

# Anthropic Configuration
config = LLMJudgeConfig(
    provider="anthropic",
    model="claude-3-opus-20240229",
    temperature=0.1,
    max_tokens=2000
)

Metrics Configuration

# Custom sentence transformer model
metrics = RCAEvaluationMetrics(model_name="all-mpnet-base-v2")

API Reference

Core Classes

RCAEvaluationMetrics

Quantitative evaluation using semantic similarity and text metrics.

Key Methods:

  • comprehensive_evaluation(evaluation_input): Full quantitative assessment
  • semantic_similarity(text1, text2): Semantic similarity score
  • evaluate_evidence_quality(expected, generated): Evidence assessment

LLMJudge

LLM-powered qualitative evaluation and reasoning assessment.

Key Methods:

  • comprehensive_llm_evaluation(evaluation_input): Full LLM assessment
  • evaluate_logical_reasoning(evaluation_input): Logic and causality
  • evaluate_actionability(evaluation_input): Recommendation quality
  • evaluate_domain_expertise(evaluation_input): Technical accuracy

Data Models

TestFailureContext

Test execution context and failure information.

RCAOutput

Complete RCA analysis structure with primary cause, alternatives, and recommendations.

EvaluationInput

Input structure containing test context, expected RCA, and generated RCA.

Use Cases

1. RCA Agent Development

Validate and improve your RCA agent during development:

# Test multiple iterations
for iteration in rca_iterations:
    score = evaluate_rca(iteration)
    if score > threshold:
        deploy_model(iteration)

2. Continuous Monitoring

Monitor RCA quality in production:

# Automated quality checks
def monitor_rca_quality(generated_rca, expected_rca):
    score = judge.evaluate(generated_rca, expected_rca)
    if score < quality_threshold:
        alert_team("RCA quality degradation detected")

3. Batch Evaluation

Evaluate multiple test cases:

results = []
for test_case in test_suite:
    result = metrics.comprehensive_evaluation(test_case)
    results.append(result)

average_score = sum(r['overall_score'] for r in results) / len(results)

Output Examples

Quantitative Results

{
  "overall_score": 0.847,
  "primary_cause": {
    "category_accuracy": 1.0,
    "description_similarity": 0.923,
    "confidence_alignment": 1.0
  },
  "evidence": {
    "coverage": 0.856,
    "relevance": 0.891,
    "completeness": 0.778
  }
}

LLM Assessment

{
  "final_assessment": {
    "final_score": 0.823,
    "grade": "A",
    "strengths": [
      "Correctly identified failure category",
      "Strong logical reasoning and causal analysis"
    ],
    "weaknesses": [
      "Could provide more specific recommendations"
    ]
  }
}

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

  • Create an issue on GitHub
  • Check the example usage for common patterns
  • Review the API documentation

Note: This system requires API keys for LLM-based evaluation. Quantitative metrics work without external APIs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published