"I write the code, then I write a unit test to prove it works."
In traditional software engineering, developers write implementation code (functions, classes, algorithms) and then write unit tests to verify the implementation works correctly. This approach works well for deterministic systems.
"In a probabilistic world, you can't write a unit test that covers every creative variation of an AI's answer."
When working with AI systems, traditional unit testing breaks down. An AI might give different (but equally valid) responses to the same input. Testing becomes about evaluating quality across multiple dimensions rather than checking for exact matches.
"If the AI is the Coder, the Human is the Examiner. We don't write the implementation anymore. We write the Exam."
Evaluation Engineering is the most valuable code a Senior Engineer writes today. Instead of writing parseDate(), you write:
- Golden Dataset: 50 tricky, malformed date strings with expected outputs
- Scoring Rubric: "If the answer is correct but rude, score 5/10. If incorrect but polite, score 0/10."
- Pass Threshold: AI must score >90% to pass
This is the evolution of TDD (Test-Driven Development) into Eval-DD (Evaluation-Driven Development): You write the "Golden Dataset" first and let the AI iterate until it scores >90% against your rubric.
The Golden Dataset is your test suite. It defines what "good" looks like through examples.
from evaluation_engineering import EvaluationDataset
dataset = EvaluationDataset(
name="Date Parsing Golden Dataset",
description="50 tricky, malformed date strings"
)
# Instead of writing parseDate(), write examples
dataset.add_case(
id="parse_date_001",
input="Parse this date: 2024-01-15",
expected_output="2024-01-15",
tags=["basic", "iso"]
)
dataset.add_case(
id="parse_date_002",
input="Parse this date: Jan 15, 2024",
expected_output="2024-01-15",
tags=["readable"]
)
# Edge cases
dataset.add_case(
id="parse_date_edge_001",
input="Parse this date: 2024-13-01",
expected_output="ERROR",
tags=["invalid", "edge_case"],
difficulty="hard"
)The Scoring Rubric defines HOW to evaluate responses. This is where you encode your engineering judgment.
from evaluation_engineering import ScoringRubric
rubric = ScoringRubric(
name="Customer Service Rubric",
description="Evaluates correctness AND tone"
)
# Multi-dimensional evaluation
rubric.add_criteria(
dimension="correctness",
weight=0.5, # 50% of score
description="Does it solve the problem?",
evaluator=correctness_evaluator
)
rubric.add_criteria(
dimension="tone",
weight=0.4, # 40% of score - tone is almost as important!
description="Is it polite and empathetic?",
evaluator=tone_evaluator
)
rubric.add_criteria(
dimension="safety",
weight=0.1, # 10% of score
description="No harmful content",
evaluator=safety_evaluator
)
rubric.set_pass_threshold(0.85) # Must score 85% to passThe Evaluation Runner executes your AI against the Golden Dataset and scores it with your Rubric.
from evaluation_engineering import EvaluationRunner
# Your AI function (could be GPT-4, Claude, custom model, etc.)
def my_ai_function(input_text: str) -> str:
# AI implementation here
return ai_response
# Run the evaluation
runner = EvaluationRunner(dataset, rubric, my_ai_function)
results = runner.run(verbose=True)
print(f"Pass Rate: {results['pass_rate']*100:.1f}%")
print(f"Average Score: {results['average_score']:.2f}")
if results['overall_passed']:
print("🎉 AI meets requirements!")
else:
print("❌ AI needs improvement")
# Debug failed cases
failed = runner.get_failed_cases()
for case in failed:
print(f"Failed: {case.case_id}")
print(f" Input: {case.input}")
print(f" Expected: {case.expected_output}")
print(f" Got: {case.actual_output}")The most important insight: Correctness alone is not enough.
# Example: Customer Service Agent
# Scenario: Customer asks "How do I reset my password?"
# Response A: "Just click forgot password. Duh."
# Correctness: 100% (answer is correct)
# Tone: 0% (rude)
# Overall: 50% (assuming 50/50 weight)
# Response B: "I'm happy to help! Click 'Forgot Password' on the login page."
# Correctness: 100% (answer is correct)
# Tone: 100% (polite and helpful)
# Overall: 100%
# Response C: "I apologize for any confusion. Let me help you with that."
# Correctness: 0% (doesn't answer the question)
# Tone: 100% (very polite)
# Overall: 50% (assuming 50/50 weight)This is why the rubric matters: "If correct but rude, score 5/10. If incorrect but polite, score 0/10."
In traditional development, you write a specification document, then write code to implement it. In Eval-DD, the dataset IS the specification.
# Traditional Spec:
# "parseDate() should accept ISO 8601 format, US format (MM/DD/YYYY),
# EU format (DD/MM/YYYY), and readable formats like 'Jan 15, 2024'.
# It should return dates in ISO format (YYYY-MM-DD).
# Invalid dates should return 'ERROR'."
# Eval-DD Spec (via dataset):
dataset.add_case(id="iso", input="2024-01-15", expected="2024-01-15")
dataset.add_case(id="us", input="01/15/2024", expected="2024-01-15")
dataset.add_case(id="eu", input="15/01/2024", expected="2024-01-15")
dataset.add_case(id="readable", input="Jan 15, 2024", expected="2024-01-15")
dataset.add_case(id="invalid", input="2024-13-01", expected="ERROR")The dataset is clearer, more precise, and easier to maintain than prose.
In Eval-DD, you focus on edge cases from day one.
# In traditional TDD, you might start with happy path:
def test_parse_date():
assert parseDate("2024-01-15") == "2024-01-15"
# In Eval-DD, you enumerate edge cases immediately:
dataset.add_case(id="edge_001", input="2024.01.15", expected="2024-01-15") # Dots
dataset.add_case(id="edge_002", input="15-JAN-2024", expected="2024-01-15") # Uppercase
dataset.add_case(id="edge_003", input="january 15 2024", expected="2024-01-15") # Lowercase
dataset.add_case(id="edge_004", input="1/5/24", expected="2024-01-05") # Short format
dataset.add_case(id="edge_005", input="32/01/2024", expected="ERROR") # Invalid dayfrom evaluation_engineering import EvaluationDataset, ScoringRubric, EvaluationRunner
# 1. Write the Golden Dataset (this is your "code")
dataset = EvaluationDataset("Date Parsing", "Test suite for date parsing")
# Add 50 test cases covering all edge cases
for i, (input_str, expected) in enumerate(test_cases):
dataset.add_case(
id=f"date_{i:03d}",
input=f"Parse: {input_str}",
expected_output=expected
)
# 2. Write the Scoring Rubric
rubric = ScoringRubric("Date Parser Rubric", "Correctness + Clarity")
rubric.add_criteria("correctness", 0.7, "Correct parse", correctness_eval)
rubric.add_criteria("clarity", 0.3, "Clear response", clarity_eval)
rubric.set_pass_threshold(0.9)
# 3. Test your AI
runner = EvaluationRunner(dataset, rubric, my_ai_parser)
results = runner.run()
# 4. Iterate until passing
if not results['overall_passed']:
print("Failed cases:")
for case in runner.get_failed_cases():
print(f" {case.case_id}: {case.input}")
# Improve AI (retrain, adjust prompt, etc.)
# Run again# 1. Golden Dataset with behavioral expectations
dataset = EvaluationDataset("Customer Service", "CS agent evaluation")
dataset.add_case(
id="complaint_001",
input="My order is late!",
expected_behavior="Apologize, show empathy, offer solution",
tags=["complaint", "empathy"]
)
dataset.add_case(
id="technical_001",
input="How do I reset password?",
expected_output="Click 'Forgot Password', check email",
tags=["technical"]
)
# 2. Multi-dimensional Rubric
rubric = ScoringRubric("CS Agent Rubric", "Quality + Tone + Safety")
rubric.add_criteria("correctness", 0.5, "Solves problem", solve_eval)
rubric.add_criteria("tone", 0.4, "Polite and helpful", tone_eval)
rubric.add_criteria("safety", 0.1, "No harmful content", safety_eval)
rubric.set_pass_threshold(0.85)
# 3. Run evaluation
runner = EvaluationRunner(dataset, rubric, cs_agent)
results = runner.run()# Evaluate code generation on multiple dimensions
dataset = EvaluationDataset("Code Generation", "Python function generation")
dataset.add_case(
id="code_001",
input="Write a function to calculate factorial",
expected_behavior="Correct implementation, handles edge cases, includes docstring",
tags=["python", "recursion"]
)
rubric = ScoringRubric("Code Quality", "Multi-dimensional code evaluation")
rubric.add_criteria("correctness", 0.4, "Code works", correctness_eval)
rubric.add_criteria("efficiency", 0.2, "Time complexity", efficiency_eval)
rubric.add_criteria("readability", 0.2, "Clear and documented", readability_eval)
rubric.add_criteria("safety", 0.2, "Handles edge cases", safety_eval)You can write custom evaluators for any dimension:
def tone_evaluator(input_text: str, output_text: str, context: Any) -> float:
"""
Evaluate the tone of the response.
Returns 0.0 (rude) to 1.0 (polite).
"""
output_lower = output_text.lower()
# Negative indicators
rude_words = ["stupid", "idiot", "obviously", "duh"]
if any(word in output_lower for word in rude_words):
return 0.0
# Positive indicators
polite_words = ["please", "thank you", "happy to help", "sorry"]
polite_count = sum(1 for word in polite_words if word in output_lower)
if polite_count >= 2:
return 1.0
elif polite_count == 1:
return 0.7
else:
return 0.5 # Neutral
rubric.add_criteria(
dimension="tone",
weight=0.4,
description="Politeness and empathy",
evaluator=tone_evaluator
)Before writing any code, create your golden dataset. This forces you to think about:
- What inputs should the system handle?
- What outputs are expected?
- What edge cases exist?
- What quality dimensions matter?
Don't rely on "it should handle edge cases." Enumerate them:
# Bad (vague)
# "The system should handle invalid inputs"
# Good (explicit)
dataset.add_case(id="empty_input", input="", expected="ERROR")
dataset.add_case(id="null_input", input="null", expected="ERROR")
dataset.add_case(id="special_chars", input="@#$%", expected="ERROR")
dataset.add_case(id="very_long", input="x"*10000, expected="ERROR")Think carefully about the relative importance of each dimension:
# Customer-facing system: Tone matters a lot
rubric.add_criteria("correctness", 0.5, ...)
rubric.add_criteria("tone", 0.4, ...) # Almost as important!
rubric.add_criteria("safety", 0.1, ...)
# Internal tool: Correctness matters most
rubric.add_criteria("correctness", 0.8, ...)
rubric.add_criteria("efficiency", 0.15, ...)
rubric.add_criteria("clarity", 0.05, ...)Treat datasets like source code:
dataset.metadata = {
"version": "2.0",
"created_at": "2024-01-15",
"author": "engineering_team",
"changes": "Added 20 new edge cases for emoji handling"
}When cases fail, use them to improve:
# Run evaluation
runner = EvaluationRunner(dataset, rubric, ai_function)
results = runner.run()
# Analyze failures
for case in runner.get_failed_cases():
print(f"\nFailed: {case.case_id}")
print(f"Score: {case.scores['overall_score']:.2f}")
print(f"Input: {case.input}")
print(f"Expected: {case.expected_output}")
print(f"Got: {case.actual_output}")
# Dimension breakdown
for dim, scores in case.scores['dimension_scores'].items():
print(f" {dim}: {scores['score']:.2f}")
# Use failures to:
# 1. Add more training data
# 2. Adjust prompts
# 3. Fine-tune models
# 4. Update rubric weightsSave results to track improvement:
# After each iteration
runner.save_results(f"results_{version}_{timestamp}.json")
# Compare versions
v1_results = load_results("results_v1.json")
v2_results = load_results("results_v2.json")
print(f"V1 Pass Rate: {v1_results['pass_rate']:.1%}")
print(f"V2 Pass Rate: {v2_results['pass_rate']:.1%}")
print(f"Improvement: {(v2_results['pass_rate'] - v1_results['pass_rate']):.1%}")The "Source Code" of the future isn't the application logic; it's the Evaluation Suite that constrains it.
In a world where AI systems generate the implementation, the engineer's role shifts from writing code to:
- Defining Quality: What does "good" look like across multiple dimensions?
- Enumerating Edge Cases: What are all the tricky scenarios?
- Setting Standards: What pass threshold is acceptable?
- Iterating Based on Data: How do we improve based on failures?
This is Evaluation Engineering - and it's the most valuable work a Senior Engineer does today.
See the code documentation in evaluation_engineering.py for detailed API reference.
EvaluationDataset: Container for test casesEvaluationCase: Individual test caseScoringRubric: Multi-dimensional evaluation criteriaScoringCriteria: Single evaluation dimensionEvaluationRunner: Executes evaluations and generates reports
exact_match_evaluator: Check for exact string matchcontains_keywords_evaluator: Check for required keywordslength_check_evaluator: Validate response lengthno_harmful_content_evaluator: Screen for harmful content
# Run the main demonstration
python example_evaluation_engineering.py
# Run tests
python test_evaluation_engineering.pyThe Evaluation Engineering framework integrates naturally with the self-evolving agent system:
from agent import DoerAgent
from observer import ObserverAgent
from evaluation_engineering import EvaluationDataset, ScoringRubric, EvaluationRunner
# Create evaluation suite
dataset = create_my_dataset()
rubric = create_my_rubric()
# Wrap DoerAgent for evaluation
def ai_function(input_text: str) -> str:
doer = DoerAgent()
result = doer.run(input_text, verbose=False)
return result['response']
# Run evaluation
runner = EvaluationRunner(dataset, rubric, ai_function)
results = runner.run()
# If not passing, trigger evolution
if not results['overall_passed']:
observer = ObserverAgent()
observer.process_events() # Learn from failuresThis creates a closed loop: Evaluation → Evolution → Re-evaluation.
Evaluation Engineering represents a fundamental shift in how we build AI systems. Instead of writing parseDate(), we write the exam that parseDate() must pass. Instead of writing customer service responses, we write the rubric that defines quality.
This is the evolution of Test-Driven Development into Evaluation-Driven Development - and it's how Senior Engineers add value in the age of AI.