A comprehensive evaluation system for validating the accuracy and quality of AI-generated Root Cause Analysis (RCA) outputs from test automation failures.
- Semantic Similarity: Using sentence transformers for meaning comparison
- ROUGE Scores: Text overlap and n-gram matching
- BERT Score: Deep semantic understanding evaluation
- Category Accuracy: Classification correctness
- Evidence Quality: Coverage, relevance, and completeness analysis
- Confidence Alignment: How well confidence levels match expectations
- Logical Reasoning: Evaluates causal chain and consistency
- Actionability: Assesses practicality and specificity of recommendations
- Domain Expertise: Tests technical accuracy and best practices knowledge
- Comprehensive Grading: A+ to F scoring with detailed feedback
-
Primary Root Cause Analysis
- Category classification accuracy
- Description semantic similarity
- Confidence level alignment
-
Supporting Evidence Assessment
- Coverage of expected evidence points
- Relevance of generated evidence
- Completeness balance
-
Alternative Causes Evaluation
- Accuracy of alternative hypotheses
- Completeness of consideration
-
Recommendations Quality
- Actionability and specificity
- Feasibility and relevance
- Clarity and prioritization
- Clone and setup:
git clone <repository-url>
cd llm-as-judge
pip install -r requirements.txt
- Environment Setup:
Create a
.env
file:
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here # Optional
- Download NLTK data:
import nltk
nltk.download('punkt')
To start the web interface for interactive RCA evaluation:
source venv/bin/activate && python web_ui.py
The web UI will be available at http://localhost:8000
and provides:
- Interactive single RCA evaluation
- Batch processing with Excel file upload
- LLM-as-a-Judge integration
- Export results to Excel
from models import TestFailureContext, RCAOutput, RootCause, EvaluationInput
from llm_judge import LLMJudge, LLMJudgeConfig
from evaluation_metrics import RCAEvaluationMetrics
# Create test context
test_context = TestFailureContext(
test_name="test_user_login",
test_type=TestType.UI,
error_message="ElementNotInteractableException: Element is not clickable",
logs=["Error: Modal overlay blocking interaction"]
)
# Define expected and generated RCA
expected_rca = RCAOutput(
primary_root_cause=RootCause(
category=FailureCategory.LOCATOR,
description="Modal dialog blocking interaction",
confidence=ConfidenceLevel.HIGH
),
# ... other fields
)
generated_rca = RCAOutput(
# Your RCA agent's output
)
# Create evaluation input
eval_input = EvaluationInput(
test_context=test_context,
expected_rca=expected_rca,
generated_rca=generated_rca
)
# Run quantitative evaluation
metrics = RCAEvaluationMetrics()
results = metrics.comprehensive_evaluation(eval_input)
print(f"Overall Score: {results['overall_score']:.3f}")
# Run LLM-based evaluation (requires API key)
config = LLMJudgeConfig(provider="openai", model="gpt-4")
judge = LLMJudge(config)
llm_results = judge.comprehensive_llm_evaluation(eval_input)
print(f"Final Grade: {llm_results['final_assessment']['grade']}")
python example_usage.py
- Primary Cause (35%): Category accuracy + description similarity + confidence alignment
- Evidence Quality (20%): Coverage + relevance + completeness
- Alternative Causes (15%): Accuracy + completeness of alternatives
- Analysis Summary (20%): Semantic similarity + BERT score + ROUGE
- Recommendations (10%): Coverage + relevance + actionability
- Quantitative Metrics (40%)
- Logical Reasoning (25%)
- Actionability (20%)
- Domain Expertise (15%)
# OpenAI Configuration
config = LLMJudgeConfig(
provider="openai",
model="gpt-4",
temperature=0.1,
max_tokens=2000
)
# Anthropic Configuration
config = LLMJudgeConfig(
provider="anthropic",
model="claude-3-opus-20240229",
temperature=0.1,
max_tokens=2000
)
# Custom sentence transformer model
metrics = RCAEvaluationMetrics(model_name="all-mpnet-base-v2")
Quantitative evaluation using semantic similarity and text metrics.
Key Methods:
comprehensive_evaluation(evaluation_input)
: Full quantitative assessmentsemantic_similarity(text1, text2)
: Semantic similarity scoreevaluate_evidence_quality(expected, generated)
: Evidence assessment
LLM-powered qualitative evaluation and reasoning assessment.
Key Methods:
comprehensive_llm_evaluation(evaluation_input)
: Full LLM assessmentevaluate_logical_reasoning(evaluation_input)
: Logic and causalityevaluate_actionability(evaluation_input)
: Recommendation qualityevaluate_domain_expertise(evaluation_input)
: Technical accuracy
Test execution context and failure information.
Complete RCA analysis structure with primary cause, alternatives, and recommendations.
Input structure containing test context, expected RCA, and generated RCA.
Validate and improve your RCA agent during development:
# Test multiple iterations
for iteration in rca_iterations:
score = evaluate_rca(iteration)
if score > threshold:
deploy_model(iteration)
Monitor RCA quality in production:
# Automated quality checks
def monitor_rca_quality(generated_rca, expected_rca):
score = judge.evaluate(generated_rca, expected_rca)
if score < quality_threshold:
alert_team("RCA quality degradation detected")
Evaluate multiple test cases:
results = []
for test_case in test_suite:
result = metrics.comprehensive_evaluation(test_case)
results.append(result)
average_score = sum(r['overall_score'] for r in results) / len(results)
{
"overall_score": 0.847,
"primary_cause": {
"category_accuracy": 1.0,
"description_similarity": 0.923,
"confidence_alignment": 1.0
},
"evidence": {
"coverage": 0.856,
"relevance": 0.891,
"completeness": 0.778
}
}
{
"final_assessment": {
"final_score": 0.823,
"grade": "A",
"strengths": [
"Correctly identified failure category",
"Strong logical reasoning and causal analysis"
],
"weaknesses": [
"Could provide more specific recommendations"
]
}
}
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT License - see LICENSE file for details.
For issues and questions:
- Create an issue on GitHub
- Check the example usage for common patterns
- Review the API documentation
Note: This system requires API keys for LLM-based evaluation. Quantitative metrics work without external APIs.