AI Evals For Engineers, PMs & QAs: Complete Study Guide

Based on the Maven course by Hamel Husain & Shreya Shankar, enriched with hands-on examples, production-ready code, and platform-specific guides for LangWatch, Langfuse, and more

Who is this guide for?

Engineers building AI-powered products who need to systematically evaluate quality
Product Managers who own the product experience and need to lead error analysis
QA Engineers who need to build automated evaluation pipelines for AI systems
Anyone who wants to learn how to evaluate AI applications without taking the full course

What you'll learn:

How to set up observability for any AI application
How to systematically find what's broken (error analysis)
How to build automated evaluators (code-based and LLM judges)
How to evaluate RAG systems, multi-step pipelines, and multi-turn conversations
How to run production evals: guardrails, safety, and real-time monitoring
How to use statistical correction to account for judge errors
How to close the loop: turn eval results into system improvements
How to do all of this with your observability platform of choice (LangWatch, Langfuse, Braintrust, LangSmith, or your own)

Platform Examples: This guide uses LangWatch (open-source, self-hosted or cloud) and Langfuse (open-source, cloud or self-hosted) as primary examples. The methodology is platform-agnostic — adapt it to whichever tool you use.

LangWatch vs Langfuse: Both are excellent open-source platforms with similar core capabilities. LangWatch offers simpler setup and built-in evaluators, while Langfuse provides more flexibility for custom pipelines and has a larger community. This guide shows both so you can choose based on your needs.

What Are AI Evals and Why You Need Them
Setting Up Observability
Error Analysis: The Secret Sauce
Building LLM-as-a-Judge Evaluators
Code-Based Evaluators
RAG System Evaluation
Multi-Step Pipeline Evaluation
Multi-Turn Conversation Evaluation
Production Evals: Safety, Guardrails & Monitoring
Statistical Correction with judgy
Closing the Loop: From Evals to Improvements
Human Annotation Best Practices
Cost, Latency & Scaling Evals
Practical Implementation Guide
Common Mistakes to Avoid
Tools and Resources

Appendices:

A: Glossary for PMs & QAs
B: Quick Reference
C: Complete Judge Prompts from Production
D: Pipeline State Evaluator Prompts
E: Judge Prompt Engineering Tips
F: Platform Methods Reference (LangWatch & Langfuse)
G: 30-Day Learning Path

Chapter 1: What Are AI Evals and Why You Need Them

Simple Definition

Evals (Evaluations) are systematic tests that check if your AI application is working correctly. Think of them like unit tests for traditional software, but for AI systems.

Why Everyone Needs Evals

There's a debate in the AI community: some people say "just vibe check your app" (meaning: just use it yourself and see if it feels good). But here's the truth:

Everyone needs evals. The people who say they don't need evals are actually benefiting from evals that someone else did upstream.

Example: If you're building a coding assistant with GPT-4, OpenAI already tested GPT-4 on massive code benchmarks. So you can "vibe check" your app. But for most applications that aren't simple uses of foundation models, you need your own evals.

The Three Core Truths About Evals

You can't improve what you don't measure
- Generic metrics like "helpfulness score" won't catch specific problems
- You need application-specific evals
Error analysis is the most important step
- More important than LLM judges
- More important than fancy observability tools
- This is where you actually learn what's broken
PMs and QAs must own error analysis, not just engineers
- Engineers know if code works
- PMs know if the product experience is good
- QAs know how to systematically break things
- You have the domain expertise
- This is product work, not just technical work

The AI Development Cycle is the Scientific Method

Building great AI products requires a rigorous evaluation process. In many ways, AI development IS the scientific method:

Observe - Trace your AI's behavior (Chapter 2)
Hypothesize - Identify what's broken through error analysis (Chapter 3)
Experiment - Build evaluators and test changes (Chapters 4-9)
Measure - Calculate metrics and correct for bias (Chapter 10)
Iterate - Improve based on data, not hunches (Chapter 11)

What Can Go Wrong Without Evals?

Your demo works great. Then production happens:

Users hit edge cases you never thought of
Text messages contain typos and unusual formatting
Dates are formatted differently than expected
AI tries to handle requests it should hand off to humans
Small prompt changes break things that were working

Example from real production data:

User: "I need a one bedroom with the bathroom NOT connected"
AI: Returns apartments with connected bathrooms (WRONG!)
User: "I do NOT want a bathroom connected to the room"
AI: "I'll check on that" but never actually checks
PLUS: AI used markdown formatting (* asterisks *) in a text message

Three different problems in one interaction! Without proper logging and evaluation, you'd never catch these patterns.

For PMs: Why This Is Your Job

Wrong approach: "This is technical AI stuff, let engineering figure it out"

Right approach: PMs should lead error analysis because:

You understand user needs
You have product taste
You have domain expertise
This is product work disguised as technical work

The teams shipping the best AI products have PMs who've personally reviewed hundreds or thousands of traces.

For QAs: Your New Superpower

Traditional QA involves test cases with expected outputs. AI QA is different:

Outputs are non-deterministic (same input can give different outputs)
"Correct" is often subjective
Edge cases are virtually infinite
You need automated evaluators that scale

But the core QA mindset - systematic testing, edge case thinking, regression prevention - is exactly what AI evals need. QAs who learn evals become incredibly valuable.

Chapter 2: Setting Up Observability

What is a Trace?

A trace is a complete recording of everything your AI did to respond to a user. It's like a detailed log that shows:

System prompt (instructions given to the AI)
User messages (what the person asked)
Tool calls (functions the AI tried to use)
Tool responses (what those functions returned)
Assistant responses (what the AI said back)
All context (everything the LLM saw when making decisions)

Example of a Complete Trace

=== TRACE ID: abc123 ===

SYSTEM PROMPT:
"You are a helpful property management assistant..."

USER MESSAGE:
"I need a one bedroom with the bathroom not connected"

TOOL CALL:
get_availability(bedrooms=1, bathroom_connected=None)

TOOL RESPONSE:
[
  {unit: "A101", bedrooms: 1, bathroom_connected: True},
  {unit: "B205", bedrooms: 1, bathroom_connected: True}
]

ASSISTANT RESPONSE:
"I found these apartments: A101 and B205..."
(Used markdown: ** ** in text message)

What Information to Capture

Minimum requirements:

Input (user message)
Output (AI response)
Timestamp
Unique ID for the interaction

Better to include:

System prompts used
Tool calls and their results
Model parameters (temperature, max_tokens, etc.)
Token counts
Latency (response time)
Cost per request

Best practice:

User context (session history)
Error messages if any occurred
Model version used
Feature flags active at the time

Choosing an Observability Platform

Tool	Type	Best For	Cost
LangWatch	Open source, cloud or self-hosted	Simple setup, built-in evaluators, great UX	Free tier + paid
Langfuse	Open source, cloud or self-hosted	Custom pipelines, large community	Free tier + paid
Braintrust	Cloud	Excellent UI, team collaboration	Paid
LangSmith	Cloud	LangChain users	Paid
Build Your Own	Custom	Learning, custom needs	Free

LangWatch vs Langfuse comparison:

Setup: LangWatch is simpler (3-line integration), Langfuse requires more configuration
Evaluators: LangWatch has 40+ built-in evaluators, Langfuse requires custom implementation
Flexibility: Langfuse is more flexible for custom workflows, LangWatch is more opinionated
Community: Langfuse has a larger community and more integrations
UI: Both have excellent UIs; LangWatch focuses on analytics, Langfuse on workflow

All of these support the same core concepts: traces, spans, datasets, evaluations, and experiments. The methodology in this guide works with any of them.

Setting Up LangWatch (Open-Source, Cloud or Self-Hosted)

LangWatch is an open-source LLM observability and analytics platform. It provides tracing, evaluation, datasets, experiments, and 40+ built-in evaluators.

Install and Configure

pip install langwatch

# Set your API key (get one at langwatch.ai or self-host)
import os
os.environ["LANGWATCH_API_KEY"] = "lw_..."  # or set in .env file

Cloud vs Self-Hosted:

Cloud: Sign up at langwatch.ai, get API key, done in 5 minutes
Self-Hosted: Run docker-compose up with their Docker setup, point to your own instance

Instrument Your Application (Auto-Tracing)

LangWatch supports auto-instrumentation for most frameworks:

import langwatch

# Initialize LangWatch
langwatch.init()

# Your existing OpenAI code now gets traced automatically!
import openai
client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a recipe assistant."},
        {"role": "user", "content": "How do I make pancakes?"}
    ],
    temperature=0.7
)
# This call is automatically captured by LangWatch!

Framework Support:

OpenAI (automatic)
LangChain (automatic)
LlamaIndex (automatic)
Anthropic Claude (automatic)
Any custom LLM (manual spans)

Add Custom Spans with Decorators

import langwatch

@langwatch.span(type="chain")
def my_pipeline(question):
    """Parent span for the whole pipeline"""
    sql = generate_sql(question)
    results = execute_query(sql)
    return synthesize_answer(question, results)

@langwatch.span(type="llm")
def generate_sql(question):
    """Tracked as an LLM generation"""
    return client.chat.completions.create(...)

@langwatch.span(type="tool")
def execute_query(sql):
    """Tracked as a tool call"""
    return db.execute(sql)

Comparison with Langfuse: Both use decorators, but LangWatch's @langwatch.span() is simpler than Langfuse's @observe(). LangWatch automatically categorizes spans by type, while Langfuse requires explicit as_type parameters.

Setting Up Langfuse (Open-Source, Cloud or Self-Hosted)

Langfuse provides tracing, evaluation, datasets, experiments, and prompt management. It offers a managed cloud and a self-hosted option.

Install and Configure

pip install langfuse openai

# Set environment variables (or pass to constructor)
# LANGFUSE_SECRET_KEY="sk-lf-..."
# LANGFUSE_PUBLIC_KEY="pk-lf-..."
# LANGFUSE_HOST="https://cloud.langfuse.com"  # or your self-hosted URL

Instrument Your Application (Drop-In Replacement)

# Just change your import — everything else stays the same!
from langfuse.openai import OpenAI

client = OpenAI()

# This call is automatically traced by Langfuse
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a recipe assistant."},
        {"role": "user", "content": "How do I make pancakes?"}
    ],
    temperature=0.7
)

Add Custom Spans with Decorators

from langfuse import observe

@observe()
def my_pipeline(question):
    """Parent trace for the whole pipeline"""
    sql = generate_sql(question)
    results = execute_query(sql)
    return synthesize_answer(question, results)

@observe(as_type="generation")
def generate_sql(question):
    """Tracked as a generation (LLM call)"""
    return client.chat.completions.create(...)

Creating and Managing Prompts

Both platforms support versioned prompt management:

LangWatch

import langwatch

# Create a prompt template
langwatch.prompts.create(
    name="recipe-assistant-v1",
    template=[
        {"role": "system", "content": "You are a recipe assistant..."},
        {"role": "user", "content": "{{question}}"}
    ],
    model="gpt-4o-mini",
    temperature=0.7
)

# Use at runtime
prompt = langwatch.prompts.get("recipe-assistant-v1")
messages = prompt.render(question="How do I make pancakes?")
response = client.chat.completions.create(messages=messages, **prompt.settings)

LangWatch advantage: Simpler API, automatic parameter management (temperature, model stored with prompt).

Langfuse

from langfuse import get_client

langfuse = get_client()

langfuse.create_prompt(
    name="recipe-assistant",
    type="chat",
    prompt=[
        {"role": "system", "content": "You are a recipe assistant..."},
        {"role": "user", "content": "{{query}}"},
    ],
    labels=["production"],
)

# Use at runtime
prompt = langfuse.get_prompt("recipe-assistant", type="chat")
compiled = prompt.compile(query="How do I make pancakes?")

Langfuse advantage: More mature prompt management, better versioning UI, labels for organization.

Uploading Test Datasets

LangWatch

import langwatch
import pandas as pd

df = pd.DataFrame({
    "query": [
        "Suggest a quick vegan breakfast recipe",
        "I have chicken and rice. What can I cook?",
        "Give me a dessert recipe with chocolate",
    ]
})

dataset = langwatch.datasets.create(
    name="recipe-queries",
    dataframe=df,
)

LangWatch advantage: Direct pandas DataFrame support, simpler API.

Langfuse

from langfuse import get_client

langfuse = get_client()

langfuse.create_dataset(name="recipe-queries")

for query in ["Suggest a quick vegan breakfast recipe",
              "I have chicken and rice. What can I cook?",
              "Give me a dessert recipe with chocolate"]:
    langfuse.create_dataset_item(
        dataset_name="recipe-queries",
        input={"query": query},
    )

Langfuse advantage: More control over individual items, better for incremental additions.

Key Principle

Without traces, you can't do evals. This is your foundation. Set this up first before anything else.

For PMs/QAs: You don't need to write the instrumentation code. Ask your engineers to set up tracing, then use the web UI to review traces visually. Both LangWatch (langwatch.ai or your self-hosted URL) and Langfuse (cloud.langfuse.com or your self-hosted URL) provide UIs that let you browse, search, and annotate traces without writing any code.

Platform Choice Guidance:

Choose LangWatch if: You want the fastest setup, built-in evaluators, and focus on analytics
Choose Langfuse if: You need maximum flexibility, have complex custom workflows, or want the largest community
Use both: They complement each other - LangWatch for quick evals, Langfuse for deep workflow customization

Chapter 3: Error Analysis: The Secret Sauce

What is Error Analysis?

Error analysis is the systematic process of:

Reviewing traces (logs of AI interactions)
Taking notes on problems you see
Categorizing those problems
Counting how often each type of problem occurs

This is THE most important skill in building reliable AI products.

Most teams skip straight to building fancy dashboards or LLM judges. That's backwards. You need to understand what's wrong before you can measure it.

Why PMs and QAs MUST Do This (Not Just Engineers)

Wrong approach: "This is technical AI stuff, let engineering figure it out"

Right approach: PMs and QAs should lead error analysis because:

You understand user needs - Engineers don't know if a "connected bathroom" vs "disconnected bathroom" matters to users
You have product taste - You know what good experiences look like
You have domain expertise - You understand business requirements
This is product work - Disguised as technical work, but it's really about product quality

Real impact: The teams shipping the best AI products have PMs who've personally reviewed hundreds or thousands of traces.

Step 1: Generate Diverse Test Queries

Before you can review traces, you need diverse test inputs. A powerful technique for this is dimensional sampling.

Define Key Dimensions

Identify 3-4 dimensions that matter for your product:

DIMENSIONS = {
    "dietary_restriction": [
        "vegan", "vegetarian", "gluten-free", "keto", "no restrictions"
    ],
    "cuisine_type": [
        "Italian", "Asian", "Mexican", "Mediterranean", "American"
    ],
    "meal_type": [
        "breakfast", "lunch", "dinner", "snack", "dessert"
    ],
    "skill_level": [
        "beginner", "intermediate", "advanced"
    ],
}

# Total possible combinations: 5 x 5 x 5 x 3 = 375

Generate Random Combinations

import random

random.seed(42)
dimension_tuples = []

for i in range(25):  # Generate 25 diverse tuples
    tuple_data = {
        "dietary_restriction": random.choice(DIMENSIONS["dietary_restriction"]),
        "cuisine_type": random.choice(DIMENSIONS["cuisine_type"]),
        "meal_type": random.choice(DIMENSIONS["meal_type"]),
        "skill_level": random.choice(DIMENSIONS["skill_level"]),
    }
    dimension_tuples.append(tuple_data)

Convert Tuples to Natural Language Queries Using an LLM

You can use any LLM to convert dimension tuples into realistic queries. Here are platform-specific approaches:

With LangWatch (built-in generation):

import langwatch

QUERY_GEN_PROMPT = """Convert this dimension tuple into a realistic user query
for a Recipe Bot. Be creative and vary your style.

Dimension tuple: {tuple_description}

Generate 1 unique, realistic query:"""

queries = []
for t in dimension_tuples:
    result = langwatch.completion(
        prompt=QUERY_GEN_PROMPT.format(tuple_description=str(t)),
        model="gpt-4o-mini",
        temperature=0.9
    )
    queries.append(result.text)

With any LLM (platform-agnostic):

import openai

client = openai.OpenAI()

QUERY_GEN_PROMPT = """Convert this dimension tuple into a realistic user query
for a Recipe Bot. Be creative and vary your style.

Dimension tuple: {tuple_description}

Generate 1 unique, realistic query:"""

queries = []
for t in dimension_tuples:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": QUERY_GEN_PROMPT.format(
            tuple_description=str(t)
        )}],
        temperature=0.9
    )
    queries.append(response.choices[0].message.content)

With Langfuse (manual tracking):

from langfuse.openai import OpenAI

client = OpenAI()  # Auto-traced

queries = []
for t in dimension_tuples:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": QUERY_GEN_PROMPT.format(
            tuple_description=str(t)
        )}],
        temperature=0.9
    )
    queries.append(response.choices[0].message.content)

Example conversions:

Dimension Tuple	Generated Query
vegan, Italian, dinner, beginner	"Hey, I'm new to cooking and vegan. Can you suggest an easy Italian dinner?"
gluten-free, any, dessert, intermediate	"I'm looking for a gluten-free dessert that's a bit of a challenge to make"
keto, American, breakfast, advanced	"Give me a complex keto breakfast recipe, American style"

For PMs/QAs: This dimensional approach ensures you test the full space of user needs. Without it, you'll only test the obvious cases and miss edge cases where users combine unexpected requirements.

Step 2: Review 100 Traces and Take Notes (Open Coding)

The process (30 seconds per trace):

Open your trace viewer (LangWatch dashboard, Langfuse UI, or any tool)
Look at the first trace
Scan through it:
- Read the user message
- Check if AI called the right tools
- Look at what the tools returned
- Read the assistant's response
- Note any problems you see

Example notes from a real error analysis session:

TRACE #1:
"Told user it would check on bathrooms but didn't do it.
Did not follow user instructions.
Rendered markdown in a text message."

TRACE #2:
"Returned properties outside user's price range.
Tool call had correct parameters but didn't filter results."

TRACE #3:
"Good response. No issues."

TRACE #4:
"Failed to hand off to human when user asked for same-day tour.
Policy violation."

Rules for error analysis:

Don't try to catch everything - Just note the most important things
Don't debate every trace - Think quickly, write it down, move on
Skip the system prompt - If it's usually the same, you don't need to read it every time
Get into a flow state - This should feel fast, not tedious

Time commitment:

First trace: 45 seconds
After 10 traces: 25 seconds each
After 50 traces: 20 seconds each
Total time for 100 traces: ~45 minutes

Platform-Specific Note:

LangWatch: Use the "Annotations" feature to add notes directly to traces in the UI
Langfuse: Use the "Comments" feature to add notes to traces

Step 3: Categorize Errors Using Axial Coding

Now you have 40-50 notes scattered across traces. Time to organize them.

This process is called "axial coding" (a research method from sociology). You're grouping similar errors into categories.

Using an LLM to Help Discover Categories

Export your notes, then use this prompt:

prompt = f"""
You are analyzing Recipe Bot failures. Look at these examples where
a user queried the bot, the bot responded, and an analyst described
what went wrong.

EXAMPLES:
{combined_df.to_json(orient="records", lines=True)}

Based on the patterns you see in the analyst's descriptions,
create 4-6 systematic failure mode labels.

Each label should:
- Be short and clear (2 words max)
- Capture a distinct type of failure pattern
- Be applicable to multiple traces

Respond with a list:
["label1", "label2", "label3", "label4", "label5", "label6"]
"""

Example results from a real recipe bot evaluation:

["Dietary Ignored", "Formatting Error", "Complexity Mismatch",
 "Meal Type Mismatch", "Ingredient Omission", "Skill Level Misalignment"]

Refine Categories to Be Specific and Actionable

Problem: Generic LLM suggestions are too vague!

"Temporal issues" - what does that mean? "Quality issues" - too generic!

Better categories (specific and actionable):

Dietary Ignored - Bot suggests ingredients that violate dietary restrictions
Formatting Error - Markdown in SMS, wrong structure
Complexity Mismatch - Recipe too hard/easy for stated skill level
Meal Type Mismatch - Suggests dinner when asked for breakfast
Ingredient Omission - Doesn't include unique ingredients user asked for
Skill Level Misalignment - Advanced techniques for beginners

Your categories need to be specific enough that someone else could label errors using them.

Step 4: Label Your Errors with LLM Assistance

This step works with any LLM. Use batch processing if your platform supports it:

CLASSIFICATION_PROMPT = """Look at this Recipe Bot interaction and the
analyst's description. Apply the most appropriate failure mode label.

USER QUERY: {input_query}
BOT RESPONSE: {bot_response}
ANALYST'S ISSUE DESCRIPTION: {issue_description}

AVAILABLE LABELS:
{failure_mode_labels}

Respond with just the label name."""

# Run classification on each error note (use your platform's batch API
# or loop with any LLM client)

With LangWatch (batch evaluation):

import langwatch

results = langwatch.evaluate.batch(
    dataset=error_notes_df,
    evaluator="custom_classifier",
    prompt_template=CLASSIFICATION_PROMPT,
    model="gpt-4o-mini"
)

With Langfuse (manual iteration):

from langfuse.openai import OpenAI

client = OpenAI()

for note in error_notes:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": CLASSIFICATION_PROMPT.format(**note)}],
        temperature=0
    )
    note["label"] = response.choices[0].message.content

Step 5: Count and Prioritize

Count how many times each category appears:

label_counts = results["output"].value_counts()

Example results from a real evaluation:

Category	Count	Percentage
Complexity Mismatch	2	22%
Meal Type Mismatch	2	22%
Ingredient Omission	2	22%
Dietary Ignored	1	11%
Formatting Error	1	11%
Skill Level Misalignment	1	11%

Why This Changes Everything

Before error analysis:

You're paralyzed
Don't know what to fix first
Can't prioritize

After error analysis:

Clear priorities based on frequency
Understanding of severity (frequency vs. impact)
Evidence for stakeholder discussions
Concrete list of what to build evals for

Example prioritization discussion:

"Dietary restriction violations happen in 11% of cases, but when
they occur, we could harm users with allergies. This is HIGH-SEVERITY.

Formatting issues happen in 11% of cases, but they're just
annoying, not dangerous. This is LOW-SEVERITY.

Let's fix dietary adherence first, then complexity matching."

The "Theoretical Saturation" Concept

When to stop reviewing traces?

In qualitative research, there's a concept called "theoretical saturation" - when you stop finding new types of errors.

Review your first 50 traces: You find 10 different error types
Review next 25 traces: You find 2 new error types
Review next 25 traces: You find 0 new error types
Stop here! You've reached saturation

You don't need to review 1000 traces if you're not finding new patterns after 100.

For PMs/QAs: Your Error Analysis Checklist

Ask engineering to set up tracing (LangWatch, Langfuse, or any tool)
Open the trace viewer UI
Browse 100 traces, taking quick notes on problems
Use an LLM to help categorize your notes into 4-6 failure modes
Count occurrences of each failure mode
Create a prioritized list considering both frequency and severity
Present findings to your team with data-backed recommendations
Repeat monthly with new traces to catch new failure patterns

Chapter 4: Building LLM-as-a-Judge Evaluators

What is LLM-as-a-Judge?

An LLM judge is an AI that evaluates other AI outputs. It reads traces and scores them.

Why use it?

Automates evaluation at scale
Provides consistent judgment
Much faster than manual review

The challenge: Most people build judges wrong. Their judges hallucinate, miss problems, or create false confidence.

When to Use LLM-as-a-Judge

Use LLM judges for:

Subjective quality assessments
Policy compliance checking
Context understanding
Dietary adherence
Tone appropriateness
Multi-step reasoning checks

Don't use LLM judges for:

Format validation (use code)
Required field checks (use code)
Simple pattern matching (use code)
Exact string matching (use code)

Rule of thumb: If you can express it as an if/else statement, use code. If you need judgment, use LLM.

The Complete LLM Judge Workflow

Building reliable LLM judges requires a rigorous 7-step workflow:

Overview: The Pipeline

1. Generate traces (run your AI on test queries)
2. Label a subset manually (or with a powerful LLM)
3. Split into Train / Dev / Test sets
4. Develop your judge prompt using Train examples
5. Validate on Dev set (iterate until good)
6. Final evaluation on Test set (unbiased metrics)
7. Run on all traces + correct with judgy

Step 1: Generate Traces

Run your AI system on diverse test queries to create traces. Use your platform's auto-instrumentation (see Chapter 2) to capture everything automatically.

Step 2: Label Ground Truth Data

Label 150-200 traces as PASS or FAIL. You can do this manually (most accurate) or use a powerful LLM:

You are an expert nutritionist evaluating dietary adherence.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Keto: Very low carb (<20g net carbs), high fat, moderate protein
[... full definitions — see Appendix C for the full list ...]

EVALUATION CRITERIA:
- PASS: Recipe clearly adheres to the dietary preferences
- FAIL: Recipe contains ingredients that violate dietary preferences

Query: {query}
Dietary Restriction: {dietary_restriction}
Response: {response}

Return JSON: {"label": "PASS" or "FAIL", "explanation": "..."}

Platform-Specific Labeling:

With LangWatch (built-in evaluators):

import langwatch

# LangWatch has 40+ built-in evaluators including dietary compliance
results = langwatch.evaluate.batch(
    dataset=traces_df,
    evaluators=["dietary_compliance"],  # Built-in evaluator
    model="gpt-4o"
)

# Or create custom evaluator
custom_evaluator = langwatch.evaluators.create(
    name="dietary_adherence",
    prompt=LABELING_PROMPT,
    model="gpt-4o"
)

results = langwatch.evaluate.batch(
    dataset=traces_df,
    evaluators=[custom_evaluator]
)

With Langfuse (custom implementation):

from langfuse.openai import OpenAI

client = OpenAI()

labels = []
for trace in traces:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": LABELING_PROMPT.format(**trace)}],
        temperature=0
    )
    labels.append(parse_json(response.choices[0].message.content))

LangWatch advantage: 40+ built-in evaluators save time for common use cases. Langfuse advantage: Complete control over custom evaluation logic.

Step 3: Split Data (Train / Dev / Test)

This is critical and often skipped! You need three separate sets:

Train (~15%): Used to select few-shot examples for your judge prompt
Dev (~40%): Used to iterate and improve your judge prompt
Test (~45%): Used ONCE for final, unbiased evaluation

from sklearn.model_selection import train_test_split

# First split: separate test set
train_dev, test = train_test_split(
    labeled_data, test_size=0.45,
    stratify=labeled_data['label'],  # Maintain PASS/FAIL ratio
    random_state=42
)

# Second split: separate train from dev
train, dev = train_test_split(
    train_dev, test_size=0.73,  # 40% of original
    stratify=train_dev['label'],
    random_state=42
)

Why stratified splitting? You need both PASS and FAIL examples in every split. Without stratification, you might get a dev set with all PASS examples, making it useless for testing failure detection.

Step 4: Build Your Judge Prompt

Your judge prompt needs four key parts:

Part 1: Role and Domain Definitions

You are an expert nutritionist and dietary specialist evaluating
whether recipe responses properly adhere to specified dietary
restrictions.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
[... all 16 definitions — see Appendix C for the full list ...]

Part 2: Clear Evaluation Criteria

EVALUATION CRITERIA:
- PASS: The recipe clearly adheres to the dietary preferences
  with appropriate ingredients and preparation methods
- FAIL: The recipe contains ingredients or methods that violate
  the dietary preferences
- Consider both explicit ingredients AND cooking methods

Part 3: Few-Shot Examples (From Your Train Set!)

This is where the train set pays off. Select 1-3 examples of correct judgments:

Example 1 (PASS):
Query: "Gluten-free pizza dough that actually tastes good"
Response: [Recipe using gluten-free all-purpose flour blend,
  baking powder, olive oil, honey, apple cider vinegar...]
Explanation: The recipe uses gluten-free flour blend. All other
  ingredients (baking powder, salt, olive oil, honey) do not
  contain gluten. The preparation method does not introduce any
  gluten-containing elements.
Label: PASS

Example 2 (FAIL):
Query: "Raw vegan Mediterranean quinoa salad"
Response: [Recipe with cooked quinoa, fresh vegetables,
  olive oil, lemon juice...]
Explanation: The recipe FAILS because it includes cooked quinoa.
  Raw vegan diets do not allow foods heated above 118 degrees F (48 degrees C),
  and cooking quinoa involves boiling, which exceeds this limit.
Label: FAIL

Part 4: Output Format

Now evaluate the following:
Query: {query}
Dietary Restriction: {dietary_restriction}
Recipe Response: {response}

RETURN YOUR EVALUATION IN JSON FORMAT:
"label": "PASS" or "FAIL",
"explanation": "Detailed explanation citing specific ingredients or methods"

Why Binary Scores Work Best

Some people want 1-5 scales or percentages. Don't do this.

With binary (PASS/FAIL):

Only need to verify two things
Clear decision boundary
Easier to debug
Simpler to explain to stakeholders

With 1-5 scale:

Need to verify every score aligns
What's the difference between 2 and 3?
5x more work to validate
Business decisions are binary anyway

Remember: Either you fix something or you don't. Either it's broken or it's not.

Step 5: Validate on Dev Set

Run your judge on the Dev set and compare to ground truth. Here's how with each platform:

Evaluator Functions (Platform-Agnostic)

def eval_tp(*, output, expected, **kwargs):
    """True Positive: Judge says PASS, ground truth is PASS"""
    judge = output.get("label", "").upper()
    truth = expected.get("label", "").upper()
    return 1.0 if judge == "PASS" and truth == "PASS" else 0.0

def eval_tn(*, output, expected, **kwargs):
    """True Negative: Judge says FAIL, ground truth is FAIL"""
    judge = output.get("label", "").upper()
    truth = expected.get("label", "").upper()
    return 1.0 if judge == "FAIL" and truth == "FAIL" else 0.0

def eval_fp(*, output, expected, **kwargs):
    """False Positive: Judge says PASS, ground truth is FAIL"""
    judge = output.get("label", "").upper()
    truth = expected.get("label", "").upper()
    return 1.0 if judge == "PASS" and truth == "FAIL" else 0.0

def eval_fn(*, output, expected, **kwargs):
    """False Negative: Judge says FAIL, ground truth is PASS"""
    judge = output.get("label", "").upper()
    truth = expected.get("label", "").upper()
    return 1.0 if judge == "FAIL" and truth == "PASS" else 0.0

Running the Experiment

With LangWatch:

import langwatch

# Create custom evaluator with your judge prompt
judge_evaluator = langwatch.evaluators.create(
    name="dietary-judge-v1",
    prompt=judge_prompt_template,
    model="gpt-4o",
    temperature=0
)

# Run on dev set
results = langwatch.evaluate.batch(
    dataset=dev_dataset,
    evaluators=[judge_evaluator],
    metrics=["tp", "tn", "fp", "fn", "tpr", "tnr"]
)

# LangWatch automatically calculates TPR and TNR
print(f"TPR: {results.metrics['tpr']:.1%}")
print(f"TNR: {results.metrics['tnr']:.1%}")

LangWatch advantage: Built-in metric calculation, no need to manually compute confusion matrix.

With Langfuse:

from langfuse import Evaluation

def accuracy_evaluator(*, input, output, expected_output, **kwargs):
    judge = output.get("label", "").upper()
    truth = expected_output.get("label", "").upper()
    correct = judge == truth
    return Evaluation(name="accuracy", value=1.0 if correct else 0.0)

result = langfuse.run_experiment(
    name="judge-dev-validation",
    data=dev_data,  # list of {"input": ..., "expected_output": ...}
    task=judge_task,
    evaluators=[accuracy_evaluator],
)

print(result.format())

# Calculate TPR/TNR manually from results
tp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "PASS")
tn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "FAIL")
fp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "FAIL")
fn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "PASS")

tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr = tn / (tn + fp) if (tn + fp) > 0 else 0

print(f"TPR: {tpr:.1%}")
print(f"TNR: {tnr:.1%}")

Langfuse advantage: More control over evaluation logic, better for complex custom metrics.

The Metrics That Actually Matter

Most people only look at "agreement":

Agreement = (Judge agrees with me) / (Total traces)
Example: 90% agreement

Why this is misleading:

If failures only happen 10% of the time, a judge that always says "pass" gets 90% accuracy by being completely useless!

The two metrics you actually need:

1. TPR (True Positive Rate) - Recall

"When there's actually a PASS, how often does the judge correctly say PASS?"

TPR = True Positives / (True Positives + False Negatives)

2. TNR (True Negative Rate) - Specificity

"When there's actually a FAIL, how often does the judge correctly say FAIL?"

TNR = True Negatives / (True Negatives + False Positives)

Real Results: Why Iteration Matters

After careful prompt iteration (production-quality judge):

Test Set Performance:
  True Positive Rate (TPR): 95.7%
  True Negative Rate (TNR): 100.0%
  Balanced Accuracy: 97.8%
  Total predictions: 33
  Correct predictions: 32
  Overall Accuracy: 97.0%

First attempt (before iteration):

Test Set Performance:
  True Positive Rate (TPR): 90.1%
  True Negative Rate (TNR): 22.2%  <-- TOO LOW!
  Accuracy: 84.0%

Notice the first attempt had a TNR of only 22.2% — meaning when a recipe actually violated dietary restrictions, the judge only caught it 22% of the time! This is dangerous (imagine telling a diabetic a recipe is safe when it isn't). After careful prompt iteration, the judge achieved 100% TNR.

Target Metrics

Good judge:

TPR > 80%
TNR > 80%

Great judge:

TPR > 90%
TNR > 90%

Both must be high! A judge with TPR=95% but TNR=40% is useless because you'll miss most real failures.

Iterating on Your Judge Prompt

Your first prompt won't be perfect. That's expected.

Process:

Test your judge on Dev set
Calculate TPR and TNR
Look at errors:
- Where did it miss real failures? (False Negatives)
- Where did it false alarm? (False Positives)
Update the prompt:
- Add missed scenarios to criteria
- Add false alarm scenarios to "NOT a failure" section
- Add 1-2 more examples of correct judgments
Test again on Dev set
Repeat until both metrics > 80%
THEN test once on Test set for final, unbiased metrics

Step 6: Final Evaluation on Test Set

Once your judge performs well on Dev, run it on the Test set ONCE:

# Calculate final metrics from test set results
tp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "PASS")
tn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "FAIL")
fp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "FAIL")
fn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "PASS")

tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

print(f"Final TPR: {tpr:.1%}")
print(f"Final TNR: {tnr:.1%}")

Step 7: Run on All Traces at Scale

Once validated, run your judge on ALL production traces:

With LangWatch (batch evaluation with built-in concurrency):

import langwatch

# Run judge on all production traces
results = langwatch.evaluate.batch(
    dataset=all_traces_df,
    evaluators=[judge_evaluator],
    concurrency=20,  # Parallel processing
    cache=True  # Cache results for duplicate traces
)

# Get summary statistics
pass_rate = results.metrics["pass_rate"]
print(f"Raw pass rate: {pass_rate:.1%}")

LangWatch advantage: Automatic concurrency management, built-in caching, progress tracking.

With Langfuse (experiment on dataset):

result = langfuse.run_experiment(
    name="full-evaluation",
    data=all_traces_data,
    task=judge_task,
    evaluators=[accuracy_evaluator],
    max_concurrency=20,
)

With plain OpenAI (platform-agnostic):

import openai
from concurrent.futures import ThreadPoolExecutor

client = openai.OpenAI()

def run_judge(trace):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt.format(**trace)}],
        temperature=0,
    )
    return parse_json(response.choices[0].message.content)

with ThreadPoolExecutor(max_workers=20) as executor:
    results = list(executor.map(run_judge, all_traces))

Example result: Raw pass rate on 1000 traces = 84.4%

But this raw rate doesn't account for judge errors. Chapter 10 covers how to correct for this using the judgy library.

LLM-as-Judge Across Different Domains

The recipe bot is one example. Here's how the same methodology applies to other domains:

Customer Support Bot:

Criterion: "Did the agent follow the refund policy correctly?"
PASS: Agent offered refund within 30-day window per policy
FAIL: Agent denied valid refund or offered refund outside policy

Code Generation Assistant:

Criterion: "Does the generated code actually solve the user's problem?"
PASS: Code compiles, handles edge cases, follows the user's constraints
FAIL: Code has syntax errors, misses requirements, or uses deprecated APIs

Medical Information Bot:

Criterion: "Does the response include appropriate disclaimers?"
PASS: Includes "consult your doctor" and avoids specific diagnoses
FAIL: Provides diagnosis-like statements without medical disclaimers

E-commerce Search:

Criterion: "Are the recommended products relevant to the query?"
PASS: Products match stated preferences (size, color, price range)
FAIL: Products violate stated filters or preferences

The structure is always the same: define the criterion, write PASS/FAIL definitions, add few-shot examples, validate with TPR/TNR.

Chapter 5: Code-Based Evaluators

What Are Code-Based Evals?

Code-based evals are checks you write in programming code (like Python) to verify specific, objective properties of your AI's outputs.

When to Use Code-Based Evals

Use code when you can test something without calling an LLM:

Format validation - Is markdown appearing in text messages?
Required field checks - Did the AI include all required information?
Tool call validation - Did the AI call the right tool?
Response length constraints - Is the response under 500 characters?
Prohibited content patterns - Are there PII (emails, phone numbers)?

Example 1: Check for Markdown in Text Messages

import re

def eval_no_markdown_in_sms(trace) -> dict:
    response = trace['assistant_message']
    channel = trace['metadata']['channel']

    if channel != 'sms':
        return {'passed': True, 'reason': 'Not SMS'}

    markdown_patterns = [
        r'\*\*.*?\*\*',  # Bold
        r'\_\_.*?\_\_',   # Bold alt
        r'\#\#\s',        # Headers
        r'```',           # Code blocks
        r'\[.*?\]\(.*?\)'  # Links
    ]

    for pattern in markdown_patterns:
        if re.search(pattern, response):
            return {
                'passed': False,
                'reason': f'Found markdown pattern: {pattern}'
            }

    return {'passed': True, 'reason': 'No markdown found'}

Platform Integration:

With LangWatch:

import langwatch

# Register as custom evaluator
@langwatch.evaluator(name="no_markdown_sms")
def eval_no_markdown_in_sms(trace):
    # ... implementation above ...
    return {'passed': result['passed'], 'score': 1.0 if result['passed'] else 0.0}

# Run on dataset
results = langwatch.evaluate.batch(
    dataset=traces_df,
    evaluators=["no_markdown_sms"]
)

With Langfuse:

from langfuse import get_client

langfuse = get_client()

# Run on each trace and log scores
for trace in traces:
    result = eval_no_markdown_in_sms(trace)
    
    langfuse.create_score(
        trace_id=trace.id,
        name="no_markdown_sms",
        value=1 if result['passed'] else 0,
        data_type="BOOLEAN",
        comment=result['reason']
    )

Example 2: Validate Tool Calls

def eval_correct_tool_called(trace) -> dict:
    user_message = trace['user_message'].lower()
    tool_calls = trace['tool_calls']

    rules = {
        'availability': ['available', 'vacant', 'open units'],
        'schedule_tour': ['tour', 'visit', 'see'],
        'get_price': ['price', 'rent', 'cost', 'how much']
    }

    expected_tool = None
    for tool, keywords in rules.items():
        if any(keyword in user_message for keyword in keywords):
            expected_tool = tool
            break

    if not expected_tool:
        return {'passed': True, 'reason': 'No specific tool expected'}

    called_tools = [call['function'] for call in tool_calls]

    if expected_tool in called_tools:
        return {'passed': True, 'reason': f'Correctly called {expected_tool}'}
    else:
        return {
            'passed': False,
            'reason': f'Expected {expected_tool}, called {called_tools}'
        }

Example 3: Validate Required Information in Tour Confirmations

import re

def eval_tour_confirmation_complete(trace) -> dict:
    response = trace['assistant_message'].lower()

    if 'tour' not in response and 'visit' not in response:
        return {'passed': True, 'reason': 'Not a tour confirmation'}

    required_elements = {'date': False, 'time': False, 'address': False}

    date_patterns = [
        r'\d{1,2}/\d{1,2}/\d{4}',
        r'\d{1,2}-\d{1,2}-\d{4}',
        r'(mon|tues|wednes|thurs|fri|satur|sun)day'
    ]
    if any(re.search(p, response) for p in date_patterns):
        required_elements['date'] = True

    time_patterns = [r'\d{1,2}:\d{2}\s?(am|pm)', r'\d{1,2}\s?(am|pm)']
    if any(re.search(p, response) for p in time_patterns):
        required_elements['time'] = True

    if 'street' in response or 'ave' in response or 'unit' in response:
        required_elements['address'] = True

    missing = [k for k, v in required_elements.items() if not v]

    if not missing:
        return {'passed': True, 'reason': 'All required elements present'}
    else:
        return {'passed': False, 'reason': f'Missing: {", ".join(missing)}'}

Benefits of Code-Based Evals

Fast - No API calls, instant results
Cheap - No tokens used
Deterministic - Same input always gives same output
Easy to debug - Stack traces, breakpoints work normally
No hallucination - Code does exactly what you tell it

Combining Code-Based and LLM-Based Evals

A complete eval suite typically has:

2-3 code-based evals for objective checks
1-2 LLM-based evals for subjective judgments

# Code-based evals (fast, cheap, deterministic)
1. check_no_markdown_in_sms()
2. validate_tool_calls()
3. check_response_length()

# LLM-based evals (slower, but handles nuance)
4. evaluate_dietary_adherence()
5. evaluate_response_helpfulness()

Platform Comparison for Mixed Eval Suites:

LangWatch approach (unified):

import langwatch

# All evaluators registered in one place
langwatch.evaluate.batch(
    dataset=traces_df,
    evaluators=[
        "no_markdown_sms",  # Code-based (custom)
        "tool_validation",   # Code-based (custom)
        "dietary_compliance", # LLM-based (built-in)
        "helpfulness"        # LLM-based (built-in)
    ]
)

Langfuse approach (flexible but manual):

# Run code-based evals
for trace in traces:
    markdown_result = eval_no_markdown_in_sms(trace)
    tool_result = eval_correct_tool_called(trace)
    
    # Log code-based scores
    langfuse.create_score(trace_id=trace.id, name="markdown", ...)
    langfuse.create_score(trace_id=trace.id, name="tools", ...)

# Run LLM-based evals separately
llm_results = run_llm_judges(traces)
for result in llm_results:
    langfuse.create_score(trace_id=result.trace_id, ...)

Testing Your Code-Based Evals

Always test your evals with known good and bad cases:

def test_no_markdown_evaluator():
    eval = NoMarkdownEvaluator()

    # Test case 1: Clean SMS
    clean_trace = {
        'assistant_message': 'Your tour is scheduled for 2PM',
        'metadata': {'channel': 'sms'}
    }
    result = eval.evaluate(clean_trace)
    assert result.passed == True

    # Test case 2: SMS with markdown
    markdown_trace = {
        'assistant_message': 'Your tour is **confirmed** for 2PM',
        'metadata': {'channel': 'sms'}
    }
    result = eval.evaluate(markdown_trace)
    assert result.passed == False

    # Test case 3: Email (should pass, we don't check email)
    email_trace = {
        'assistant_message': 'Your tour is **confirmed**',
        'metadata': {'channel': 'email'}
    }
    result = eval.evaluate(email_trace)
    assert result.passed == True

    print("All tests passed!")

Chapter 6: RAG System Evaluation

What is RAG?

RAG (Retrieval Augmented Generation) means your AI:

Retrieves relevant information from a database
Uses that information to generate a response

Why RAG Needs Special Evaluation

RAG has two failure modes:

Retrieval fails - Doesn't find the right information
Generation fails - Uses the information wrong

You need to evaluate both separately to know where problems occur.

Building a BM25 Retrieval Engine

When building keyword-based retrieval for a domain like recipes, here's the key insight: your tokenizer matters.

Custom Tokenizer for Domain-Specific Content

import re

# Preserves numbers, temperatures, measurements
_TOKEN_RE = re.compile(
    r"\d+\s*[x x]\s*\d+"      # Dimensions like 9x13
    r"|(?:\d+/?\d+)"           # Fractions like 1/2
    r"|(?:\d+(?:\.\d+)?)"      # Numbers like 375
    r"|(?:degrees[fc])"           # Temperature units
    r"|[a-z]+"                  # Regular words
)

def tokenize(text: str) -> list[str]:
    s = (text or "").lower()
    # Normalize temperature references
    s = s.replace("degrees f", "degreesf").replace("degree f", "degreesf")
    s = s.replace("mins", "min").replace("minutes", "min")
    return _TOKEN_RE.findall(s)

Why this matters: Standard tokenizers strip numbers. But in recipes, "375" (temperature), "9x13" (pan size), and "1/2" (measurement) are critical search terms.

Generating Synthetic Queries for RAG Testing

Instead of manually writing test queries, use an LLM to generate queries that depend on specific facts in your documents:

SYSTEM_PROMPT = """You are an advanced user of a recipe search engine.
Given a recipe, write ONE realistic cooking question that depends on
a precise, technical detail contained in THIS recipe. Focus on:
1) Specific methods (e.g., marinate 4 hours, bake at 375 degrees F)
2) Appliance settings (e.g., air fryer 400 degrees F for 12 minutes)
3) Ingredient prep details (e.g., slice onions paper-thin)
4) Timing specifics (e.g., rest dough 30 minutes)
5) Temperature precision (e.g., internal 165 degrees F)

Return EXACTLY a single JSON object:
{"query": "...?", "salient_fact": "<exact quote or paraphrase>"}"""

This generates queries like:

"What temperature should I bake the gingerbread castle cookies at?" (salient fact: "350 degrees F for 8-10 minutes")
"How long should I let the bread dough rise?" (salient fact: "rise for 1 hour until doubled")

The salient_fact is your ground truth - you know which recipe has the answer.

Evaluating Retrieval Quality

Recall@K

"Did the correct recipe appear in the top K results?"

def recall_at_k(k, output, metadata, **kwargs):
    """Check if ground-truth recipe is in top-k results"""
    ground_truth_id = metadata.get("source_recipe_id")
    if not ground_truth_id:
        return 0.0

    top_ids = output.get("top_ids", [])
    for rank, doc_id in enumerate(top_ids, 1):
        if str(doc_id) == str(ground_truth_id):
            return 1.0 if rank <= k else 0.0
    return 0.0

# Create specific evaluators
def RecallAt1(**kwargs): return recall_at_k(1, **kwargs)
def RecallAt3(**kwargs): return recall_at_k(3, **kwargs)
def RecallAt5(**kwargs): return recall_at_k(5, **kwargs)

Mean Reciprocal Rank (MRR)

"If we found it, how high did it rank?"

def MRR(output, metadata, **kwargs):
    ground_truth_id = metadata.get("source_recipe_id")
    if not ground_truth_id:
        return 0.0

    top_ids = output.get("top_ids", [])
    for rank, doc_id in enumerate(top_ids, 1):
        if str(doc_id) == str(ground_truth_id):
            return 1.0 / rank
    return 0.0

Running RAG Experiments

With LangWatch

import langwatch

def bm25_task(example):
    query = example["input"]["input"]
    hits = retrieve_bm25(query, corpus, bm25, tokenized_corpus, top_n=5)
    return {"top_ids": [h["id"] for h in hits], "top_titles": [h["title"] for h in hits]}

# Register custom metrics
@langwatch.metric(name="recall_at_1")
def recall_at_1_metric(output, expected):
    return recall_at_k(1, output, expected)

# Run experiment
results = langwatch.evaluate.batch(
    dataset=synthetic_queries_dataset,
    task=bm25_task,
    metrics=["recall_at_1", "recall_at_3", "recall_at_5", "mrr"]
)

LangWatch advantage: Built-in RAG metrics, automatic visualization of retrieval performance.

With Langfuse

from langfuse import Evaluation

def bm25_task(*, item, **kwargs):
    query = item["input"]["query"]
    hits = retrieve_bm25(query, corpus, bm25, tokenized_corpus, top_n=5)
    return {"top_ids": [h["id"] for h in hits], "top_titles": [h["title"] for h in hits]}

def recall_at_1_eval(*, output, expected_output, **kwargs):
    ground_truth_id = expected_output.get("source_recipe_id")
    found = str(ground_truth_id) in [str(x) for x in output.get("top_ids", [])[:1]]
    return Evaluation(name="recall@1", value=1.0 if found else 0.0)

result = langfuse.run_experiment(
    name="bm25-retrieval",
    data=synthetic_queries_data,
    task=bm25_task,
    evaluators=[recall_at_1_eval],
)

Diagnosing RAG Failures

When a RAG test fails, diagnose WHERE:

def diagnose_rag_failure(query, target_recipe_id, retriever, pipeline):
    # Step 1: Check retrieval
    retrieved = retriever.search(query, k=5)
    retrieved_ids = [d.id for d in retrieved]

    if target_recipe_id not in retrieved_ids:
        return {'failure_point': 'RETRIEVAL',
                'issue': f'Recipe not in top 5'}

    # Step 2: Check document quality
    correct_doc = [d for d in retrieved if d.id == target_recipe_id][0]
    # Does the doc actually contain the answer?

    # Step 3: Check generation
    answer = pipeline(query, retrieved)
    is_correct = eval_factual_correctness(query, retrieved, answer)

    if not is_correct:
        return {'failure_point': 'GENERATION',
                'issue': 'Answer incorrect despite good retrieval'}

    return {'failure_point': None, 'status': 'PASS'}

Improving RAG Performance

When retrieval fails:

Try different chunking strategies
Add metadata filters
Use hybrid search (keyword + semantic)
Implement query expansion
Try reranking models
Use domain-specific tokenizers (like the number-preserving one above)

When generation fails:

Improve system prompt
Add few-shot examples
Use chain-of-thought prompting
Add explicit grounding instructions
Implement citation requirements

Chapter 7: Multi-Step Pipeline Evaluation

What is a Multi-Step Pipeline?

A multi-step pipeline is when your AI breaks a task into several stages, each doing a specific job.

The 7-State Recipe Bot Pipeline

Here's an example of a complete 7-state pipeline for a recipe assistant:

User query
    |
[1. ParseRequest]     -> Extract intent, dietary constraints, servings
    |
[2. PlanToolCalls]    -> Decide which tools to use and in what order
    |
[3. GenRecipeArgs]    -> Create recipe database search arguments
    |
[4. GetRecipes]       -> Execute recipe search (retriever)
    |
[5. GenWebArgs]       -> Create web search arguments
    |
[6. GetWebInfo]       -> Execute web search for supplemental info
    |
[7. ComposeResponse]  -> Write final response combining everything
    |
Final response

Why State-Level Evaluation Matters

Problem: If your pipeline fails, where did it fail?

Without state-level evals, you only know:

"The system produced a bad response"

With state-level evals, you know:

"The GenRecipeArgs state dropped the oatmeal filter"
"That caused GetRecipes to return wrong recipes"
"Which led to the bad final response"

Building State-Level Evaluators

Each pipeline state gets its own evaluator prompt. Here are real evaluators for a recipe pipeline:

ParseRequest Evaluator

You are an expert evaluator for the ParseRequest state.

What ParseRequest should do:
- Extract the user's intent from their query
- Identify dietary constraints (gluten-free, vegetarian, dairy-free)
- Determine the number of servings if mentioned
- Capture any other specific requirements

What counts as a failure:
- Misinterpretation: Key requirements are misunderstood
- Missing information: Important constraints are omitted
- Invalid format: Output is not parseable JSON
- Logical inconsistency: Extracted requirements contradict the query

Here is the input: {input}
Here is the output: {output}

Return JSON: {"explanation": "...", "label": "pass" or "fail"}

PlanToolCalls Evaluator

You are an expert evaluator for the PlanToolCalls state.

What PlanToolCalls should do:
- Analyze the parsed request to determine which tools are needed
- Plan the order of tool execution
- Provide rationale for the tool selection

What counts as a failure:
- Missing tools: Required tools for the task are not included
- Incorrect tools: Tools that don't exist are selected
- Poor ordering: Tool sequence doesn't make logical sense
- Unreasonable rationale: The reasoning is flawed

Here is the input: {input}
Here is the output: {output}

Return JSON: {"explanation": "...", "label": "pass" or "fail"}

ComposeResponse Evaluator

You are an expert evaluator for the ComposeResponse state.

What ComposeResponse should do:
- Summarize one recommended recipe
- Provide clear numbered cooking steps
- Incorporate relevant tips from web information
- Respect dietary constraints throughout

What counts as a failure:
- Recipe contradiction: Final recipe doesn't match retrieved data
- Inconsistent steps: Cooking instructions are illogical
- Missing web integration: Useful web info is ignored
- Constraint violation: Dietary restrictions are violated
- Unit mismatches: Temperatures or measurements are wrong

Here is the input: {input}
Here is the output: {output}

Return JSON: {"explanation": "...", "label": "pass" or "fail"}

Running State-Level Evaluations

The approach is the same regardless of platform: query spans by pipeline state, run the appropriate evaluator, and log results.

With LangWatch

import langwatch

STATES = [
    "ParseRequest", "PlanToolCalls", "GenRecipeArgs",
    "GetRecipes", "GenWebArgs", "GetWebInfo", "ComposeResponse"
]

for state_name in STATES:
    # Get all spans for this state
    spans_df = langwatch.get_spans(
        filters={"name": state_name}
    )
    
    # Load evaluator for this state
    with open(f"evaluators/{state_name.lower()}_eval.txt") as f:
        eval_prompt = f.read()
    
    # Create custom evaluator
    evaluator = langwatch.evaluators.create(
        name=f"{state_name}_eval",
        prompt=eval_prompt,
        model="gpt-4o"
    )
    
    # Run evaluation
    results = langwatch.evaluate.batch(
        dataset=spans_df,
        evaluators=[evaluator]
    )
    
    # Results automatically logged to LangWatch
    print(f"{state_name}: {results.metrics['pass_rate']:.1%} pass rate")

LangWatch advantage: Automatic span querying, built-in result aggregation by state.

With Langfuse

from langfuse import get_client, observe

langfuse = get_client()

# Fetch traces and filter by span name
traces = langfuse.api.trace.list(limit=500, tags=["recipe-pipeline"])

for trace in traces.data:
    trace_detail = langfuse.api.trace.get(trace.id)
    for observation in trace_detail.observations:
        if observation.name in STATES:
            # Run evaluator
            result = run_evaluator(observation.name, observation.input, observation.output)

            # Log score back to Langfuse
            langfuse.create_score(
                trace_id=trace.id,
                observation_id=observation.id,
                name=f"{observation.name}_eval",
                value=1 if result["label"] == "pass" else 0,
                data_type="BOOLEAN",
                comment=result["explanation"],
            )

Analyzing Failure Distribution

Example results from evaluating 100 synthetic traces with intentional failures:

Pipeline State Failure Distribution:
  GetWebInfo:       33 failures (most problematic!)
  ParseRequest:     18 failures
  PlanToolCalls:    17 failures
  GenRecipeArgs:    12 failures
  GetRecipes:       10 failures
  GenWebArgs:        8 failures
  ComposeResponse:   1 failure  (most reliable)

Summary:
  ~1/3 of traces complete successfully
  ~2/3 have at least one failure
  Bimodal pattern: traces either run flawlessly or fail at
  predictable spots

Key insight: GetWebInfo is the biggest bottleneck. Focus optimization there first.

Platform Comparison for Analytics:

LangWatch: Built-in analytics dashboard automatically shows failure distribution by state, no manual aggregation needed.

Langfuse: More flexible custom queries, but requires manual aggregation to generate these statistics.

Using LLM to Synthesize Improvement Strategies

def synthesize_fixes(state_name, failed_traces):
    failure_descriptions = [
        trace['explanation'] for trace in failed_traces
        if trace.get('label') == 'fail'
    ]

    prompt = f"""
    You are analyzing failures in the '{state_name}' stage.

    Here are the failure descriptions:
    {chr(10).join(f"- {desc}" for desc in failure_descriptions)}

    Please:
    1. Identify common patterns (group similar failures)
    2. Suggest specific fixes for each pattern
    3. Recommend validator rules to catch these failures
    4. Propose unit tests to prevent regression

    Format as:
    PATTERN: description
    FREQUENCY: count
    FIX: specific actionable fix
    VALIDATOR: rule to add
    TEST: unit test to write
    """
    return llm(prompt)

For PMs/QAs: Pipeline Evaluation Without Code

Even without writing code, you can:

Open your observability UI (LangWatch or Langfuse) and look at traces by pipeline state
Filter for failed states using the annotation/score filters
Read the failure explanations generated by the LLM evaluators
Identify patterns (e.g., "GetWebInfo fails whenever the query is about cooking techniques")
File specific, data-backed bugs (e.g., "GenRecipeArgs drops dietary filters 12% of the time")

Chapter 8: Multi-Turn Conversation Evaluation

Why Multi-Turn Is Different

Most eval examples show single-turn Q&A: user asks, AI answers, done. But real applications have conversations — and new failure modes emerge across turns:

Context loss — AI forgets what the user said 3 messages ago
Contradiction — AI says one thing in turn 2, contradicts it in turn 5
Instruction drift — AI gradually stops following the original system prompt
Repetition — AI repeats the same information or suggestion
Escalation failure — AI doesn't know when to hand off to a human

Strategies for Multi-Turn Evaluation

Strategy 1: Evaluate Each Turn Independently

Treat each assistant response as a separate eval, but include the full conversation history as context:

MULTI_TURN_JUDGE_PROMPT = """You are evaluating one response in a multi-turn conversation.

FULL CONVERSATION HISTORY:
{conversation_history}

CURRENT ASSISTANT RESPONSE (the one being evaluated):
{current_response}

CRITERIA:
- Does this response stay consistent with previous responses?
- Does it remember and respect earlier context?
- Does it advance the conversation productively?

Return JSON: {"label": "PASS" or "FAIL", "explanation": "..."}
"""

Strategy 2: Evaluate the Entire Conversation

Score the conversation as a whole after it ends:

CONVERSATION_JUDGE_PROMPT = """Evaluate this complete conversation.

CONVERSATION:
{full_conversation}

Score on these dimensions:
1. Task completion: Did the user's goal get achieved?
2. Consistency: Did the AI contradict itself?
3. Context retention: Did the AI remember earlier details?
4. Appropriate escalation: Did it hand off when needed?

Return JSON: {"label": "PASS" or "FAIL", "explanation": "..."}
"""

Strategy 3: Synthetic Multi-Turn Tests

Generate multi-turn test scenarios that specifically target failure modes:

SCENARIOS = [
    {
        "turns": [
            "I'm looking for a vegan restaurant",
            "Actually, make that vegetarian — I eat eggs",
            "What about that first place you mentioned?"  # Tests context retention
        ],
        "failure_mode": "context_retention"
    },
    {
        "turns": [
            "Help me plan a trip to Tokyo",
            "My budget is $3000",
            "Can you add business class flights?"  # Tests budget contradiction
        ],
        "failure_mode": "contradiction_detection"
    },
]

Key Metrics for Multi-Turn

Context retention rate: % of turns where the AI correctly referenced earlier information
Contradiction rate: % of conversations with at least one self-contradiction
Task completion rate: % of conversations where the user's goal was achieved
Average turns to resolution: How many turns it takes to complete the task

Chapter 9: Production Evals: Safety, Guardrails & Monitoring

Offline vs. Online Evals

Everything in Chapters 3-8 is offline evaluation — you run evals after the fact on collected traces. But production systems also need online evaluation:

	Offline Evals	Online Evals
When	After traces are collected	In real-time, before/during response
Speed	Minutes to hours	Milliseconds to seconds
Purpose	Measure quality trends	Prevent bad responses
Examples	TPR/TNR on test set	Guardrails, content filters

Safety Evaluations

Every production AI system should evaluate for these safety risks:

Prompt Injection Detection

def eval_prompt_injection(trace) -> dict:
    user_message = trace['user_message'].lower()

    injection_patterns = [
        "ignore previous instructions",
        "ignore all prior",
        "you are now",
        "new instructions:",
        "system prompt:",
        "forget everything",
        "disregard the above",
    ]

    for pattern in injection_patterns:
        if pattern in user_message:
            return {
                'passed': False,
                'risk': 'prompt_injection',
                'reason': f'Detected injection pattern: "{pattern}"'
            }

    return {'passed': True, 'reason': 'No injection patterns detected'}

PII Leakage Detection

import re

def eval_no_pii_in_response(trace) -> dict:
    response = trace['assistant_message']

    pii_patterns = {
        'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
        'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
        'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
    }

    for pii_type, pattern in pii_patterns.items():
        if re.search(pattern, response):
            return {
                'passed': False,
                'risk': 'pii_leakage',
                'reason': f'Response contains {pii_type} pattern'
            }

    return {'passed': True, 'reason': 'No PII patterns detected'}

Harmful Content / Off-Topic Detection

Use an LLM judge for nuanced safety checks:

You are a safety evaluator. Determine if the AI response contains:
1. Medical advice without appropriate disclaimers
2. Financial advice presented as fact
3. Harmful or dangerous instructions
4. Content that is completely off-topic for the application's purpose

Response to evaluate: {response}

Return JSON: {"safe": true/false, "category": "...", "explanation": "..."}

Platform Integration for Safety Evals:

With LangWatch (built-in safety evaluators):

import langwatch

# LangWatch has 40+ built-in evaluators including safety checks
results = langwatch.evaluate.realtime(
    trace=current_trace,
    evaluators=[
        "prompt_injection",  # Built-in
        "pii_detection",     # Built-in
        "toxicity",          # Built-in
        "off_topic",         # Built-in
    ],
    blocking=True  # Block response if fails
)

if not results.all_passed:
    return "I'm sorry, I can't help with that."

With Langfuse (custom implementation):

# Run safety checks
injection_result = eval_prompt_injection(trace)
pii_result = eval_no_pii_in_response(trace)

if not injection_result['passed'] or not pii_result['passed']:
    # Block and log
    langfuse.create_score(
        trace_id=trace.id,
        name="safety_block",
        value=0,
        comment=f"Blocked: {injection_result['reason']} / {pii_result['reason']}"
    )
    return "I'm sorry, I can't help with that."

Real-Time Guardrails

Guardrails run before the response reaches the user:

class GuardrailPipeline:
    def __init__(self):
        self.checks = [
            eval_no_pii_in_response,
            eval_prompt_injection,
            eval_response_length,
            eval_no_harmful_content,
        ]

    def check(self, trace) -> dict:
        for check_fn in self.checks:
            result = check_fn(trace)
            if not result['passed']:
                return {
                    'action': 'block',
                    'reason': result['reason'],
                    'fallback': "I'm sorry, I can't help with that. Let me connect you with a human agent."
                }
        return {'action': 'allow'}

Production Monitoring

Set up automated checks that run on a sample of production traces:

def daily_eval_report(traces_df):
    """Run daily on a sample of yesterday's production traces"""
    results = {
        'total_traces': len(traces_df),
        'safety_failures': sum(1 for t in traces_df if not eval_no_pii(t)['passed']),
        'quality_failures': sum(1 for t in traces_df if not eval_quality(t)['passed']),
        'injection_attempts': sum(1 for t in traces_df if not eval_injection(t)['passed']),
    }

    # Alert if failure rates spike
    if results['safety_failures'] / results['total_traces'] > 0.01:
        send_alert("Safety failure rate above 1%!")

    return results

Platform Monitoring Dashboards:

LangWatch: Built-in monitoring dashboard with automatic alerts for safety violations, cost spikes, and latency increases.

Langfuse: Custom dashboards via API, requires manual setup but more flexible for complex alerting logic.

For PMs: Safety Eval Checklist

Before any AI feature ships, ensure these evals exist:

PII leakage detection (code-based)
Prompt injection detection (code-based + LLM)
Off-topic/harmful content (LLM judge)
Response length limits (code-based)
Appropriate disclaimers for regulated domains (LLM judge)

Chapter 10: Statistical Correction with judgy

The Problem: Your Judge Isn't Perfect

Even a good judge makes mistakes. If your judge has:

TPR = 95.7% (misses 4.3% of real passes)
TNR = 100% (never misses a real fail)

Then the raw pass rate from your judge is slightly biased.

What is judgy?

judgy is a Python library that corrects for judge errors using statistical methods. It takes:

Test labels (ground truth from your labeled data)
Test predictions (what your judge said about the labeled data)
Unlabeled predictions (what your judge said about all production traces)

And returns a corrected success rate with confidence intervals.

How to Use judgy

import numpy as np
from judgy import estimate_success_rate

# From your test set evaluation (Step 6 from Chapter 4)
test_labels = np.array([0, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...])  # Ground truth
test_preds = np.array([0, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...])   # Judge predictions

# From running judge on all production traces (Step 7)
unlabeled_preds = np.array([1, 1, 0, 1, 1, 1, 0, 1, ...])  # Judge on all data

# Compute corrected rate
results = estimate_success_rate(
    test_labels=test_labels,
    test_preds=test_preds,
    unlabeled_preds=unlabeled_preds
)

Real Results: Before and After Correction

Final Evaluation on 1000 traces:
  Raw observed success rate:  84.4%
  Corrected success rate:     88.2%  (+3.8 percentage points)
  95% Confidence Interval:    [84.4%, 98.5%]

Interpretation:
  The Recipe Bot adheres to dietary preferences approximately
  88.2% of the time. We are 95% confident the true rate is
  between 84.4% and 98.5%.

Why the correction matters: The raw rate (84.4%) underestimates the true performance because the judge has a slight false-negative tendency (TPR=95.7%, not 100%). The corrected rate (88.2%) accounts for this bias.

Platform Integration

Platform-agnostic: judgy works with results from any platform. Export your test set results and production predictions, then run the correction:

# With LangWatch results
test_results = langwatch.get_experiment_results(experiment_id="test-eval")
test_labels = test_results["ground_truth"]
test_preds = test_results["judge_predictions"]

production_results = langwatch.get_evaluation_results(eval_id="production-run")
unlabeled_preds = production_results["predictions"]

# Run judgy correction
corrected = estimate_success_rate(test_labels, test_preds, unlabeled_preds)

# With Langfuse results (manual export)
test_labels = [score.value for score in test_scores if score.name == "ground_truth"]
test_preds = [score.value for score in test_scores if score.name == "judge"]
unlabeled_preds = [score.value for score in production_scores if score.name == "judge"]

# Run judgy correction
corrected = estimate_success_rate(test_labels, test_preds, unlabeled_preds)

For PMs: How to Report These Results

When presenting to stakeholders:

"Our Recipe Bot correctly follows dietary restrictions 88% of the time,
with 95% confidence that the true rate is between 84% and 99%.

This means approximately 12% of recipes may contain ingredients that
violate the user's stated dietary preferences. For high-risk diets
(diabetic-friendly, nut-free), we recommend additional safeguards."

This is much more credible than "we tested it and it seems to work."

Chapter 11: Closing the Loop — From Evals to Improvements

The Most Common Failure: Measuring Without Acting

Many teams build great eval suites, then never systematically use the results to improve their system. Evals are only valuable if they drive action.

The Improvement Cycle

1. Run evals → identify top failure mode
2. Root-cause the failure (is it prompt? retrieval? tool? data?)
3. Implement a fix (change prompt, add guardrail, fix tool)
4. Run evals again → confirm improvement, check for regressions
5. Repeat with the next failure mode

Root-Causing Failures

When your eval identifies a failure, ask where in the pipeline it happened:

Failure Location	Symptoms	Typical Fixes
System prompt	Wrong tone, missing capabilities, policy violations	Edit prompt, add examples, add constraints
Retrieval	Wrong documents, missing context	Better chunking, reranking, query expansion
Tool calls	Wrong tool selected, wrong parameters	Improve tool descriptions, add validation
Generation	Hallucination, wrong format, ignores context	Few-shot examples, structured output, temperature tuning
Post-processing	Truncation, encoding issues, format errors	Fix parsing code, add validation

Regression Testing

Every time you fix something, you risk breaking something else. Set up regression tests:

class RegressionSuite:
    def __init__(self):
        self.known_cases = []  # Cases that previously failed and were fixed

    def add_regression_case(self, input, expected_output, failure_description):
        self.known_cases.append({
            "input": input,
            "expected": expected_output,
            "original_failure": failure_description,
        })

    def run(self, pipeline):
        regressions = []
        for case in self.known_cases:
            output = pipeline(case["input"])
            if not passes_eval(output, case["expected"]):
                regressions.append({
                    "input": case["input"],
                    "original_failure": case["original_failure"],
                    "current_output": output,
                })
        return regressions

# Usage: run before every prompt change or model switch
suite = RegressionSuite()
suite.add_regression_case(
    input="Give me a vegan recipe with honey",
    expected_output="Should explain honey isn't vegan and suggest alternatives",
    failure_description="Bot used to include honey in vegan recipes"
)

Platform Support for Regression Testing:

LangWatch: Built-in regression test suites, automatic comparison with baseline runs.

Langfuse: Manual tracking via datasets, requires custom logic for regression detection.

Model Comparison with Evals

When evaluating whether to switch models (e.g., GPT-4o vs. Claude vs. Gemini):

MODELS = ["gpt-4o", "claude-sonnet-4-5-20250929", "gemini-2.0-flash"]

for model in MODELS:
    results = run_eval_suite(model=model, test_set=test_data)
    print(f"{model}: TPR={results['tpr']:.1%}, TNR={results['tnr']:.1%}, "
          f"cost=${results['cost']:.2f}, latency={results['latency_p50']:.0f}ms")

For PMs: The Improvement Playbook

After every eval cycle, create a simple report:

EVAL REPORT — Week of [date]

Top 3 failure modes this week:
1. [Failure mode] — [X]% of traces — [Root cause] — [Action item]
2. [Failure mode] — [X]% of traces — [Root cause] — [Action item]
3. [Failure mode] — [X]% of traces — [Root cause] — [Action item]

Improvements from last week:
- [Previous fix]: Failure rate went from X% to Y% ✅

Regressions detected: [None / List]

Chapter 12: Human Annotation Best Practices

When Manual Labels Beat LLM Labels

Ambiguous cases where even experts disagree — you need to capture that disagreement
High-stakes domains (medical, legal, financial) where errors have real consequences
New failure modes that your LLM judge hasn't been trained to detect
Ground truth calibration — even if you use LLM labeling at scale, validate a sample manually

Inter-Annotator Agreement

If two humans disagree on a label, your eval criteria aren't clear enough.

Process:

Have 2-3 people independently label the same 50 traces
Calculate agreement rate (% they agree)
If agreement < 80%, your criteria need to be more specific
Discuss disagreements, update criteria, re-label

def cohen_kappa(labels_a, labels_b):
    """Calculate inter-annotator agreement"""
    agree = sum(a == b for a, b in zip(labels_a, labels_b))
    p_observed = agree / len(labels_a)

    # Expected agreement by chance
    p_a_pos = sum(a == "PASS" for a in labels_a) / len(labels_a)
    p_b_pos = sum(b == "PASS" for b in labels_b) / len(labels_b)
    p_expected = p_a_pos * p_b_pos + (1 - p_a_pos) * (1 - p_b_pos)

    kappa = (p_observed - p_expected) / (1 - p_expected)
    return kappa

# Interpretation:
# kappa > 0.8: Excellent agreement (criteria are clear)
# kappa 0.6-0.8: Good agreement (minor clarifications needed)
# kappa < 0.6: Poor agreement (rewrite criteria)

Label Quality > Label Quantity

50 high-quality labels beat 500 noisy labels. Invest time in:

Clear, written labeling guidelines with examples
Edge case documentation ("if you see X, label it as Y because...")
Regular calibration sessions where labelers discuss disagreements

For PMs/QAs: You Are the Best Labelers

PMs and QAs often produce better labels than engineers because:

You know what a good user experience looks like
You understand the product's policies and constraints
You think from the user's perspective, not the code's perspective

Chapter 13: Cost, Latency & Scaling Evals

The Cost Problem

Running GPT-4o as a judge on 10,000 traces is expensive. Here's how to manage costs:

Strategy 1: Use Cheaper Models for Judges

Not every eval needs the best model:

Judge Model	Cost (per 1K traces)	When to Use
GPT-4o / Claude Opus	~$5-15	Complex subjective judgments, safety-critical
GPT-4o-mini / Claude Haiku	~$0.50-1.50	Clear-cut criteria, well-defined rubrics
Code-based	$0	Format checks, pattern matching, validation

Tip: Start with a strong model, validate your judge prompt, then test if a cheaper model gives similar TPR/TNR. Often it does.

Strategy 2: Sample Instead of Exhaustive

You don't need to eval every trace:

import random

def sample_traces(traces, sample_rate=0.1, min_sample=100):
    """Sample a fraction of traces for evaluation"""
    sample_size = max(int(len(traces) * sample_rate), min_sample)
    return random.sample(traces, min(sample_size, len(traces)))

# 10% sample of 50,000 daily traces = 5,000 evals
# Statistical confidence is still high with proper sampling

Strategy 3: Tiered Evaluation

Run cheap evals on everything, expensive evals on a sample:

# Tier 1: Run on ALL traces (code-based, free)
tier1_results = [eval_format(t) for t in all_traces]

# Tier 2: Run on traces that passed Tier 1 (cheap LLM, ~$0.50/1K)
tier1_passed = [t for t, r in zip(all_traces, tier1_results) if r['passed']]
tier2_results = run_llm_eval(tier1_passed, model="gpt-4o-mini")

# Tier 3: Run on a sample (expensive LLM, ~$5/1K)
sample = random.sample(tier1_passed, 500)
tier3_results = run_llm_eval(sample, model="gpt-4o")

Strategy 4: Cache Duplicate Evaluations

If the same input appears multiple times, cache the eval result:

import hashlib

eval_cache = {}

def cached_eval(trace, eval_fn):
    key = hashlib.md5(str(trace['input'] + trace['output']).encode()).hexdigest()
    if key not in eval_cache:
        eval_cache[key] = eval_fn(trace)
    return eval_cache[key]

Platform Support for Caching:

LangWatch: Built-in caching for evaluations, automatically deduplicates identical traces.

Langfuse: Manual caching required, but supports custom cache keys via metadata.

Latency Considerations for Real-Time Guardrails

Check Type	Typical Latency	Suitable for Real-Time?
Regex/code checks	<1ms	Yes
Embedding similarity	10-50ms	Yes
Small LLM (Haiku-class)	200-500ms	Marginal (adds noticeable delay)
Large LLM (GPT-4o-class)	1-3s	No (use offline only)

Chapter 14: Practical Implementation Guide

Your First Two Weeks with Evals

Week 1: Foundation

Day 1-2: Set Up Logging (4 hours)

Goal: Capture traces of every AI interaction.

Pick your platform and set it up:

LangWatch:

pip install langwatch
# Sign up at langwatch.ai or run self-hosted Docker

import langwatch
langwatch.init()  # That's it! Auto-instrumentation enabled

Langfuse:

pip install langfuse openai
# Sign up at cloud.langfuse.com or self-host

from langfuse.openai import OpenAI  # Drop-in replacement
client = OpenAI()  # Auto-traced

Then instrument your app (see Chapter 2 for full examples).

Deliverable: Every AI interaction is logged and visible in your observability UI.

Day 3: Manual Error Analysis (3 hours)

Goal: Review 100 traces and take notes.

Open your trace viewer (LangWatch or Langfuse UI)
Browse through traces
Note problems in a spreadsheet or CSV
Budget 30-60 seconds per trace

Deliverable: 40-50 error notes from 100 traces.

Day 4: Categorize Errors (2 hours)

Goal: Group your notes into 5-6 categories.

Export your notes
Use an LLM to suggest categories
Refine categories to be specific and actionable
Label each note with a category
Count occurrences

Deliverable: Prioritized list of what's broken, with frequency data.

Day 5-7: Build Your First Eval (6 hours)

Goal: Create one code-based eval and one LLM judge.

Code-based eval (Day 5): Pick your highest-frequency objective issue.

LLM judge (Day 6-7):

Write the judge prompt with criteria + examples
Label 50-100 traces as ground truth
Split into train/dev/test
Validate on dev set (iterate prompt until TPR/TNR > 80%)
Test on test set for final metrics

Deliverable: Two working evals you can run on new traces.

Week 2: Automation and Monitoring

Day 8-9: Automate Eval Runs

With LangWatch:

import langwatch

# All evaluators (code + LLM) in one place
results = langwatch.evaluate.batch(
    dataset=daily_traces,
    evaluators=[
        "no_markdown_sms",      # Code-based (custom)
        "dietary_compliance",   # LLM-based (built-in)
    ]
)

print(f"Pass rate: {results.metrics['pass_rate']:.1%}")

With Langfuse:

# Run evaluators separately
for trace in daily_traces:
    # Code-based
    markdown_result = eval_no_markdown(trace)
    langfuse.create_score(trace_id=trace.id, name="markdown", ...)
    
    # LLM-based
    dietary_result = run_dietary_judge(trace)
    langfuse.create_score(trace_id=trace.id, name="dietary", ...)

Day 10-11: Set Up Alerts

def check_for_degradation(current_rate, historical_avg, threshold=1.5):
    """Alert if failure rate spikes"""
    return current_rate > historical_avg * threshold

# Example alert
if check_for_degradation(today_failure_rate, avg_failure_rate):
    send_slack_alert("Eval failure rate spiked!")

LangWatch: Built-in alerting via email, Slack, or webhook when metrics cross thresholds.

Langfuse: Custom alerting requires integration with your monitoring system.

Day 12-14: Dashboard

LangWatch: Built-in analytics dashboard, no setup needed.

Langfuse: Create custom dashboard using their API:

# Fetch recent scores
scores = langfuse.api.score.list(limit=1000, from_timestamp=last_week)

# Aggregate and visualize
failure_rates = aggregate_by_day(scores)
plot_dashboard(failure_rates)

Ongoing: 30 Minutes Per Week

Every Monday (15 minutes):

Check your observability UI for anomalies
Review any alerts from past week
Note patterns

Every Month (2 hours):

Do error analysis on 50 new traces
Look for new failure modes
Add new evals if needed
Retire evals that never fire

After Major Changes (1 hour):

Run full eval suite
Compare to baseline
Investigate any regressions

Chapter 15: Common Mistakes to Avoid

Mistake #1: Skipping Error Analysis

What people do: Jump straight to building LLM judges or dashboards. Why it's wrong: You don't know what to measure yet. Fix: Always start with error analysis. Spend real time reviewing traces.

Mistake #2: Using Only Agreement for Validation

What people do: "My judge has 90% agreement with humans, ship it!" Why it's wrong: A judge that always says "pass" gets 90% agreement when failures are rare. Fix: Always calculate TPR and TNR separately. Both must be high.

Mistake #3: PM/QA Delegates Error Analysis

What people do: "This is technical, let engineering review the logs." Why it's wrong: Engineers don't have product intuition or domain expertise. Fix: PMs and QAs must do error analysis. This is core product/quality work.

Mistake #4: Not Splitting Data (Train/Dev/Test)

What people do: Use all labeled data to build and test the judge. Why it's wrong: You're overfitting to your test data. Your metrics are meaningless. Fix: Use the 15%/40%/45% split. Never touch the test set until final evaluation.

Mistake #5: Not Doing Evals Until After Launch

What people do: Build the product, ship it, then start thinking about evals. Fix: Build evals while building the product, not after.

Mistake #6: Building Too Many Evals

What people do: "Let's have an eval for everything!" Fix: Start with 2-3 evals for your biggest problems. Add more only when needed. Rule: If an eval hasn't fired in 3 months, remove it.

Mistake #7: Low TNR (Ignoring False Positives)

What people do: "My eval catches all real problems (TPR=95%), good enough." Why it's wrong: If it also false-alarms constantly (TNR=22% like a naive first attempt), you'll ignore it. Fix: Both TPR AND TNR must be high. Low TNR means the eval is useless.

Mistake #8: Not Testing the Evals Themselves

What people do: Write an eval, assume it works, run it on all data. Fix: Test your evals with known good and bad cases before deploying.

Mistake #9: Copy-Pasting Eval Prompts

What people do: "This LLM judge prompt worked for someone else, I'll use it." Fix: Write evals specific to YOUR product, YOUR policies, YOUR users.

Mistake #10: Not Versioning System Prompts

What people do: Edit system prompt directly in production. Fix: Use your platform's prompt management (LangWatch, Langfuse, etc.) to version prompts. Log which version was used with each trace.

Mistake #11: Not Correcting for Judge Bias

What people do: Report the raw pass rate from the judge as the true rate. Fix: Use judgy to correct for judge errors and report confidence intervals.

Mistake #12: Over-Engineering Early

What people do: Build a distributed eval platform before reviewing a single trace. Fix: Start simple. CSV + Python script + any observability tool. Add complexity only when simple stops working.

Chapter 16: Tools and Resources

Observability Platforms

Tool	Type	Best For	Cost
LangWatch	Open source, cloud or self-hosted	Simple setup, built-in evaluators, great analytics	Free tier + paid
Langfuse	Open source, cloud or self-hosted	Custom pipelines, maximum flexibility, large community	Free tier + paid
Braintrust	Cloud	Excellent UI, team collaboration	Paid
LangSmith	Cloud	LangChain users	Paid
Build Your Own	Custom	Learning, custom needs	Free

Eval Frameworks

LangWatch Evaluators - 40+ built-in evaluators covering safety, quality, RAG, and custom domains
Langfuse Evals - Built-in LLM-as-Judge, custom evaluators via SDK
Simple Evals (OpenAI) - Lightweight model-graded evals
Ragas - Specialized for RAG evaluation
DeepEval - Comprehensive eval framework

Key Libraries

judgy - Statistical bias correction for LLM judges: github.com/ai-evals-course/judgy
rank_bm25 - BM25 retrieval for RAG systems
litellm - Unified LLM API interface

Platform Comparison Matrix

Feature	LangWatch	Langfuse	Notes
Setup Time	5 min (3 lines)	15 min (more config)	LangWatch: langwatch.init()
Built-in Evaluators	40+	0 (all custom)	LangWatch saves significant dev time
Custom Evaluators	Yes (decorator)	Yes (full SDK)	Both support custom logic
Analytics Dashboard	Built-in, automatic	Build your own	LangWatch: zero-config analytics
Cost Tracking	Automatic	Manual tagging	LangWatch tracks per-model costs
Community Size	Growing	Large, established	Langfuse has more integrations
Self-Hosting	Docker (simple)	Docker (more complex)	Both are fully open-source
Prompt Management	Yes	Yes (more mature)	Langfuse has richer versioning UI
Caching	Built-in	Manual	LangWatch auto-caches duplicate evals
Batch Evaluation	Native API	Via experiments	LangWatch: simpler for large batches
Real-time Evals	Supported	Via scores API	Both work, LangWatch is faster to set up

When to choose LangWatch:

You want to start fast (< 10 min setup)
You need built-in evaluators for common use cases
You want automatic analytics without configuration
You prefer opinionated tooling that "just works"

When to choose Langfuse:

You need maximum flexibility for custom workflows
You have complex evaluation logic
You want the largest community and integration ecosystem
You prefer building your own dashboards and analytics

Why not both? Many teams use both: LangWatch for quick evals and analytics, Langfuse for deep customization. They're complementary, not competitive.

Key Principles (Revisited)

Start simple - Don't over-engineer
Error analysis first - Always
PMs and QAs must be involved - This is product/quality work
Both TPR and TNR matter - Not just agreement
Code evals when possible - LLM judges when needed
Test your evals - They can have bugs too
Split your data - Train/Dev/Test is non-negotiable
Correct for bias - Use judgy for honest metrics
Version your prompts - Track what changed when
Iterate based on data - Not hunches

Appendix A: Glossary for PMs & QAs

A plain-language glossary of the technical terms used throughout this guide. Share this with non-technical stakeholders.

Evaluation & Metrics Terms

Term	Definition
Eval (Evaluation)	A systematic test that checks if an AI system is working correctly for a specific criterion
LLM-as-a-Judge	Using a language model to automatically evaluate the output of another AI system
Ground Truth	Human-verified labels that represent the "correct" answer; used to measure judge accuracy
True Positive Rate (TPR)	The percentage of actual positives (e.g., good responses) that the judge correctly identifies. Also called recall or sensitivity. Formula: TP / (TP + FN)
True Negative Rate (TNR)	The percentage of actual negatives (e.g., bad responses) that the judge correctly catches. Also called specificity. Formula: TN / (TN + FP)
False Positive (FP)	When the judge says "Pass" but the real answer is "Fail" — a missed defect
False Negative (FN)	When the judge says "Fail" but the real answer is "Pass" — a false alarm
Precision	Of all items the judge labeled positive, how many were actually positive. Formula: TP / (TP + FP)
F1 Score	The harmonic mean of precision and recall — a single number balancing both. Formula: 2 * (Precision * Recall) / (Precision + Recall)
Confusion Matrix	A 2x2 table showing TP, FP, FN, TN counts — the foundation of all classification metrics
Confidence Interval (CI)	A range of values (e.g., 72%–81%) within which the true metric likely falls, given sampling uncertainty
Bias Correction	Adjusting raw judge scores to account for systematic over- or under-counting of passes/fails
Cohen's Kappa	A statistic measuring agreement between two raters (or a rater and ground truth), adjusting for chance agreement. Values: <0.2 poor, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect

Data & Workflow Terms

Term	Definition
Train/Dev/Test Split	Dividing labeled data into three sets: Train (for building the judge prompt), Dev (for iterating), Test (for final unbiased measurement)
Stratified Split	Splitting data so each subset has the same proportion of Pass/Fail labels as the original
Few-Shot Examples	Example input-output pairs included in a prompt to show the model what good evaluation looks like
Open Coding	Reading traces and writing freeform notes about what's going wrong — no categories yet
Axial Coding	Grouping your open-coded notes into categories (error types) and counting frequency
Dimensional Sampling	Systematically creating test inputs that cover all important dimensions (topics, edge cases, user types)
Failure Mode	A specific, named way the AI system can fail (e.g., "dietary violation," "hallucinated citation")
Error Taxonomy	The organized list of all failure modes for your application, ranked by frequency and severity

Observability & Platform Terms

Term	Definition
Trace	A complete record of one AI interaction — from user input through all processing steps to final output
Span	A single unit of work within a trace (e.g., one LLM call, one database lookup, one tool invocation)
Instrumentation	Adding code to your application so that traces and spans are automatically captured
Dataset	A stored collection of examples (inputs + expected outputs) used for running experiments
Experiment	Running your AI system (or judge) against a dataset and recording all results
Annotation	A label or score attached to a trace or span — can be human-generated or from an automated eval
Prompt Version	A saved snapshot of a prompt template, allowing you to track changes and compare performance

RAG-Specific Terms

Term	Definition
RAG (Retrieval-Augmented Generation)	An AI architecture that retrieves relevant documents before generating a response
BM25	A classic keyword-based search algorithm used as a baseline for retrieval quality
Recall@K	Of all relevant documents, what fraction appear in the top K retrieved results
MRR (Mean Reciprocal Rank)	Average of 1/rank for the first relevant document — higher means relevant docs appear sooner
Chunking	Splitting large documents into smaller pieces for retrieval
Context Window	The maximum amount of text an LLM can process in a single call
Hallucination	When an LLM generates information not supported by the retrieved context

Statistical Terms

Term	Definition
p_obs (Observed Rate)	The raw pass rate from the judge, before any correction
θ̂ (Theta-hat)	The corrected true success rate after accounting for judge errors
judgy	A Python library that computes corrected success rates and confidence intervals given TPR and TNR
Sampling	Evaluating a random subset of traces instead of all traces — used to manage cost
Statistical Significance	Whether an observed difference is likely real or could be due to random chance

Appendix B: Quick Reference

When to Use What Type of Eval

Situation	Type	Example
Format checking	Code-based	No markdown in SMS
Required fields	Code-based	Tour confirmation has date/time
Tool selection	Code-based	Called correct function
Subjective quality	LLM judge	Response is helpful
Policy compliance	LLM judge	Handoff requirements met
Dietary adherence	LLM judge	Recipe follows restrictions
Factual accuracy	LLM judge	Answer matches sources
Response length	Code-based	Under 500 characters

Metrics Cheat Sheet

Confusion Matrix:
                 Actual Positive  |  Actual Negative
                 -----------------|-----------------
Predicted Pos    |      TP        |       FP        |
Predicted Neg    |      FN        |       TN        |

TPR (Recall) = TP / (TP + FN)      "Catches real positives"
TNR (Specificity) = TN / (TN + FP) "Avoids false alarms"
Precision = TP / (TP + FP)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Target for evals:
- TPR > 80% (catches real issues)
- TNR > 80% (doesn't false alarm)

Data Split Ratios

Train: ~15%  (few-shot examples for judge prompt)
Dev:   ~40%  (iterate and improve judge prompt)
Test:  ~45%  (final, unbiased evaluation - use ONCE)

Time Estimates

Activity	Time	Frequency
Initial setup (LangWatch)	30 min	Once
Initial setup (Langfuse)	1 hour	Once
Error analysis (100 traces)	1 hour	Monthly
Build code-based eval	1 hour	As needed
Build LLM judge (full pipeline)	4-6 hours	As needed
Validate eval on dev set	1 hour	Per iteration
Weekly maintenance	30 min	Weekly

Platform Quick Start

LangWatch (fastest):

import langwatch
langwatch.init()
# Done! Auto-tracing enabled

Langfuse (more config):

from langfuse.openai import OpenAI
client = OpenAI()
# Set environment variables first

Appendix C: Complete Judge Prompts from Production

This is a production-quality judge prompt that achieved TPR=95.7% and TNR=100%:

You are an expert nutritionist and dietary specialist evaluating whether
recipe responses properly adhere to specified dietary restrictions.

DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat, moderate protein
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
- Pescatarian: No meat except fish and seafood
- Kosher: Follows Jewish dietary laws (no pork, shellfish, mixing meat/dairy)
- Halal: Follows Islamic dietary laws (no pork, alcohol, proper slaughter)
- Nut-free: No tree nuts or peanuts
- Low-carb: Significantly reduced carbohydrates (typically <50g per day)
- Sugar-free: No added sugars or high-sugar ingredients
- Raw vegan: Vegan foods not heated above 118 degrees F (48 degrees C)
- Whole30: No grains, dairy, legumes, sugar, alcohol, or processed foods
- Diabetic-friendly: Low glycemic index, controlled carbohydrates
- Low-sodium: Reduced sodium content for heart health

EVALUATION CRITERIA:
- PASS: The recipe clearly adheres to the dietary preferences with
  appropriate ingredients and preparation methods
- FAIL: The recipe contains ingredients or methods that violate the
  dietary preferences
- Consider both explicit ingredients and cooking methods

Example 1:
Query and Response: [Gluten-free pizza dough using gluten-free flour blend,
baking powder, olive oil, honey, apple cider vinegar...]
Explanation: The recipe uses gluten-free flour blend. All other ingredients
do not contain gluten. The preparation method does not introduce any
gluten-containing elements.
Label: PASS

Example 2:
Query and Response: [Raw vegan quinoa salad with cooked quinoa,
fresh vegetables, olive oil, lemon juice...]
Explanation: The recipe FAILS because it includes cooked quinoa.
Raw vegan diets do not allow foods heated above 118 degrees F (48 degrees C),
and cooking quinoa involves boiling, which exceeds this limit.
Label: FAIL

Now evaluate the following recipe response:

Query: {query}
Dietary Restriction: {dietary_restriction}
Recipe Response: {response}

RETURN YOUR EVALUATION IN JSON FORMAT:
"label": "PASS" or "FAIL",
"explanation": "Detailed explanation citing specific ingredients or methods"

Appendix D: Pipeline State Evaluator Prompts

Complete evaluator prompts for each pipeline state. Each follows the same structure:

Standard Evaluator Structure

1. Role definition ("You are an expert evaluator for the X state")
2. What the state should do (3-4 bullet points)
3. Evaluation criteria (3-4 numbered criteria)
4. What counts as a failure (4-5 specific failure types)
5. What does NOT count as a failure (2-3 acceptable variations)
6. Input/Output template variables
7. Output format (JSON with label and explanation)

Available Evaluators

State	Key Criteria	Common Failures
ParseRequest	Accuracy, completeness, format	Misinterpretation, missing constraints
PlanToolCalls	Tool selection, ordering, rationale	Missing tools, incorrect tools
GenRecipeArgs	Query relevance, filter accuracy	Missing dietary filters, wrong servings
GetRecipes	Relevance, dietary compliance	Irrelevant recipes, dietary violations
GenWebArgs	Relevance, context alignment	Off-topic queries, too generic
GetWebInfo	Relevance, quality	Irrelevant results, off-topic content
ComposeResponse	Recipe accuracy, step clarity, constraint compliance	Contradictions, missing info, violations

Full evaluator prompts for each state follow the structure above, tailored to the specific responsibilities and failure modes of each pipeline stage.

Appendix E: Judge Prompt Engineering Tips

A collection of techniques that consistently improve LLM judge accuracy. Use this as a checklist when building or debugging a judge.

1. Explanation Before Verdict

Always ask the judge to explain its reasoning before giving the final label. This is the single most impactful technique.

❌ Bad:  "Label: PASS or FAIL. Explanation: ..."
✅ Good: "Explanation: [your reasoning]. Label: PASS or FAIL"

Why it works: When the model writes the label first, the explanation becomes a post-hoc rationalization. When reasoning comes first, the model actually deliberates, and the label follows logically.

2. Be Ruthlessly Specific About Criteria

Vague criteria lead to inconsistent judgments. Define exactly what counts as Pass and Fail.

❌ Vague:  "Does the response follow dietary restrictions?"
✅ Specific: "PASS: Every ingredient in the recipe is compatible with the stated
   dietary restriction. FAIL: At least one ingredient violates the restriction,
   OR the cooking method introduces a violation (e.g., frying in butter for
   dairy-free)."

3. Include "What Does NOT Count as a Failure"

Judges tend to be overly strict. Explicitly list acceptable variations to calibrate leniency.

What does NOT count as a failure:
- Suggesting optional toppings that can be omitted
- Using brand names instead of generic ingredient names
- Minor formatting issues in the recipe
- Providing substitution suggestions alongside the main recipe

4. Use Domain-Specific Few-Shot Examples

Generic examples are far less effective than examples from your actual data. Always pull few-shot examples from your Train set.

Example selection strategy:

1 clear Pass (easy case)
1 clear Fail (easy case)
1 borderline case (the kind the judge will struggle with most)

Include the reasoning in each example, not just the label. The judge learns the reasoning pattern, not just the answer.

5. Temperature Settings

Use Case	Temperature	Rationale
Binary classification (Pass/Fail)	0.0	Deterministic, reproducible
Likert scale scoring (1-5)	0.0–0.3	Low variance, consistent
Generating diverse critiques	0.5–0.7	Some creativity for different angles
Brainstorming failure modes	0.7–1.0	High creativity for exploration

For judge evaluation, always use temperature 0.0. You want the same input to produce the same output every time.

6. Structured Output Formats

Tell the judge exactly how to format its response. JSON is preferred for parsing reliability.

Return your evaluation as JSON:
{
  "explanation": "Step-by-step reasoning about the response...",
  "label": "PASS or FAIL",
  "confidence": "HIGH, MEDIUM, or LOW",
  "flagged_items": ["list of specific problematic items, if any"]
}

Tip: The confidence field is useful for identifying borderline cases during error analysis but is not a reliable calibrated probability.

7. Guard Against Common Judge Biases

Bias	Description	Mitigation
Leniency bias	Judge defaults to "Pass" too often	Add explicit failure examples; emphasize "when in doubt, FAIL"
Verbosity bias	Judge favors longer, more detailed responses	Add examples where a short response passes and a long one fails
Position bias	Judge favors the first/last option in a list	Randomize order if comparing multiple outputs
Sycophancy bias	Judge agrees with confident-sounding text	Add examples where confident text is wrong
Anchoring bias	Judge is swayed by the first piece of evidence	Instruct judge to consider ALL evidence before concluding

8. Iterative Refinement Workflow

1. Write initial prompt with 2-3 few-shot examples
2. Run on Dev set → calculate TPR and TNR
3. Find the worst errors (cases where judge was wrong)
4. For each error:
   a. Understand WHY the judge was wrong
   b. Add a clarification, edge case, or new example to the prompt
5. Re-run on Dev set → check if metrics improved
6. Repeat steps 3-5 until TPR > 80% and TNR > 80%
7. Run ONCE on Test set for final, unbiased metrics

Common iteration patterns:

TPR too low → Judge is missing real failures. Add more Fail examples, make fail criteria more explicit.
TNR too low → Judge has too many false alarms. Add "what does NOT count as a failure" section, add Pass examples for edge cases.
Both low → Criteria are ambiguous. Rewrite from scratch with clearer definitions.

9. Model Selection for Judges

Model Tier	When to Use	Typical Accuracy
GPT-4o / Claude Sonnet 4.5	High-stakes evals, complex reasoning	85–95%
GPT-4o-mini / Claude Haiku	Cost-sensitive, high-volume evals	75–90%
Open-source (Llama, Mistral)	Self-hosted, privacy-sensitive	70–85%

Tip: Start with the most capable model to establish a performance ceiling. Then test whether a cheaper model can match it for your specific use case. Often it can — especially with good few-shot examples.

10. Prompt Versioning

Always version your judge prompts. Track:

The prompt text
The few-shot examples used
The model and temperature
Dev set metrics (TPR, TNR) at that version
Date and reason for the change

Both LangWatch and Langfuse have built-in prompt versioning. Use it.

With LangWatch:

import langwatch

langwatch.prompts.create(
    name="dietary-judge-v3",
    description="Added edge cases for keto",
    template=judge_prompt_text,
    model="gpt-4o",
    temperature=0
)

With Langfuse:

from langfuse import get_client

langfuse = get_client()

langfuse.create_prompt(
    name="dietary-judge",
    prompt=judge_prompt_text,
    labels=["staging"],  # promote to "production" after validation
)

Appendix F: Platform Methods Reference (LangWatch & Langfuse)

LangWatch

Tracing

import langwatch

# Initialize (auto-instruments OpenAI, LangChain, LlamaIndex, etc.)
langwatch.init()

# Add custom spans
@langwatch.span(type="chain")
def my_pipeline(question):
    """Parent span for the whole pipeline"""
    sql = generate_sql(question)
    results = execute_query(sql)
    return synthesize_answer(question, results)

@langwatch.span(type="llm")
def generate_sql(question):
    """Tracked as an LLM generation"""
    return client.chat.completions.create(...)

@langwatch.span(type="tool")
def execute_query(sql):
    """Tracked as a tool call"""
    return db.execute(sql)

Querying Spans

import langwatch

# Get all spans for a specific name
spans_df = langwatch.get_spans(
    filters={"name": "ParseRequest"}
)

# Get spans within a time range
spans_df = langwatch.get_spans(
    filters={
        "timestamp_gte": "2025-02-01",
        "timestamp_lte": "2025-02-09"
    }
)

Datasets

import pandas as pd
import langwatch

df = pd.DataFrame({
    "query": ["Query 1", "Query 2"],
    "expected_answer": ["Answer 1", "Answer 2"]
})

dataset = langwatch.datasets.create(
    name="my-dataset",
    dataframe=df
)

Evaluators

import langwatch

# Use built-in evaluators (40+ available)
results = langwatch.evaluate.batch(
    dataset=traces_df,
    evaluators=[
        "dietary_compliance",   # Built-in
        "toxicity",             # Built-in
        "prompt_injection",     # Built-in
    ]
)

# Create custom evaluator
@langwatch.evaluator(name="custom_check")
def my_evaluator(trace):
    # Your logic here
    return {"passed": True, "score": 1.0}

# Run custom evaluator
results = langwatch.evaluate.batch(
    dataset=traces_df,
    evaluators=["custom_check"]
)

Experiments

import langwatch

def my_task(example):
    query = example["input"]["query"]
    return {"answer": my_pipeline(query)}

# Run experiment with automatic metrics
results = langwatch.evaluate.batch(
    dataset=dataset,
    task=my_task,
    evaluators=["accuracy", "latency", "cost"]
)

# View results
print(results.metrics)

Prompt Management

import langwatch

# Create prompt
prompt = langwatch.prompts.create(
    name="recipe-assistant-v1",
    template=[
        {"role": "system", "content": "You are a recipe assistant..."},
        {"role": "user", "content": "{{question}}"}
    ],
    model="gpt-4o-mini",
    temperature=0.7
)

# Use at runtime
messages = prompt.render(question="How do I make pancakes?")
response = client.chat.completions.create(
    messages=messages,
    model=prompt.model,
    temperature=prompt.temperature
)

Langfuse

Tracing

from langfuse.openai import OpenAI  # Drop-in replacement
from langfuse import observe, get_client

client = OpenAI()  # Calls are automatically traced

@observe()
def my_pipeline(question):
    """Creates a parent trace"""
    return generate_answer(question)

@observe(as_type="generation")
def generate_answer(question):
    """Tracked as a generation"""
    return client.chat.completions.create(...)

Querying Traces

from langfuse import get_client

langfuse = get_client()

traces = langfuse.api.trace.list(limit=100, tags=["production"])
trace = langfuse.api.trace.get("trace_id")

Datasets

from langfuse import get_client

langfuse = get_client()

langfuse.create_dataset(name="my-dataset")

langfuse.create_dataset_item(
    dataset_name="my-dataset",
    input={"query": "What is AI?"},
    expected_output={"answer": "Artificial Intelligence"},
)

Experiments

from langfuse import Evaluation

def my_task(*, item, **kwargs):
    query = item["input"]["query"]
    return my_pipeline(query)

def my_evaluator(*, output, expected_output, **kwargs):
    correct = output == expected_output.get("answer")
    return Evaluation(name="accuracy", value=1.0 if correct else 0.0)

result = langfuse.run_experiment(
    name="baseline",
    data=test_data,
    task=my_task,
    evaluators=[my_evaluator],
)
print(result.format())

Scores (Evaluation Results)

from langfuse import get_client

langfuse = get_client()

# Score a trace
langfuse.create_score(
    trace_id="trace_id",
    name="dietary_adherence",
    value=1,  # 0 or 1
    data_type="BOOLEAN",
    comment="Recipe correctly follows vegan restrictions",
)

# Score within context
with langfuse.start_as_current_observation(as_type="span", name="eval") as span:
    span.score(name="accuracy", value=0.95, data_type="NUMERIC")

Prompt Management

from langfuse import get_client

langfuse = get_client()

langfuse.create_prompt(
    name="my-prompt",
    type="chat",
    prompt=[
        {"role": "system", "content": "You are a {{role}}"},
        {"role": "user", "content": "{{question}}"},
    ],
    labels=["production"],
)

prompt = langfuse.get_prompt("my-prompt", type="chat")
compiled = prompt.compile(role="chef", question="Best pasta recipe?")

Appendix G: 30-Day Learning Path

Week 1: Foundations (Engineer, PM, or QA)

Day	Activity	Time	Role Focus
1	Pick your platform (LangWatch or Langfuse), install it	1h	All
2	Instrument your app with auto-tracing	2h	Engineer
2	Browse the trace viewer UI, understand traces visually	1h	PM/QA
3	Create a test dataset with dimensional sampling	2h	All
4	Upload dataset to your platform, run first experiment	1h	All
5	Review 50 traces, take notes (open coding)	1h	All
6	Categorize errors using LLM (axial coding)	1h	All
7	Prioritize: frequency x severity matrix	30m	All

Week 2: Code-Based Evals

Day	Activity	Time	Role Focus
8	Build 2 code-based evals for your top issues	2h	Engineer
8	Define eval criteria in plain English	1h	PM/QA
9	Test evals with known good/bad cases	1h	All
10	Run evals on all traces, calculate failure rates	1h	All
11-14	Iterate based on results	2h	All

Week 3: LLM Judge

Day	Activity	Time	Role Focus
15	Label 100-150 traces as ground truth	3h	All
16	Split into Train/Dev/Test	30m	Engineer
17	Write first judge prompt with few-shot examples	2h	All
18	Validate on Dev set, calculate TPR/TNR	1h	All
19	Iterate prompt until metrics > 80%	2h	All
20	Final test on Test set	30m	All
21	Run judge on all traces + correct with judgy	1h	All

Week 4: Advanced Topics & Production

Day	Activity	Time	Role Focus
22	RAG evaluation — retrieval metrics + answer quality (Ch. 6)	2h	Engineer
23	Multi-step pipeline evaluation (Ch. 7)	2h	Engineer
24	Multi-turn conversation evaluation (Ch. 8)	2h	Engineer
25	Safety evals — prompt injection, PII leakage (Ch. 9)	2h	All
26	Set up regression test suite (Ch. 11)	2h	Engineer
27	Human annotation calibration — measure inter-annotator agreement (Ch. 12)	1h	All
28	Optimize for cost — tiered evaluation, sampling strategy (Ch. 13)	1h	All
29	Create monitoring dashboard + automated eval runs	2h	Engineer
30	Document eval suite, present to stakeholders, plan maintenance	2h	All

Lessons Learned

Real lessons from implementing complete eval pipelines in production:

On Building Judges (Chapters 4, 10)

LLM-as-Judge is powerful but needs guardrails - Without proper validation, a judge can confidently give wrong answers. Always validate against ground truth.
You must test evaluators against ground truth - A judge that seems reasonable but has TNR=22% is actively harmful — it misses most real failures.
Train/Dev/Test splits enable confidence - Without them, you're fooling yourself about your judge's quality. This is non-negotiable.
Iterating on judge prompts is crucial - The first prompt is never good enough. Plan for 3-5 iterations minimum. See Appendix E for techniques.
Explanation-before-verdict is the #1 technique - Asking the judge to reason before labeling improves accuracy more than any other single change.

On Process & Methodology (Chapters 3, 11, 12)

Error analysis is the real work - Fancy tools don't matter if you haven't sat down and looked at your failures. Open coding → axial coding → prioritization is the workflow that works.
Human annotators disagree more than you think - Measure inter-annotator agreement (Cohen's kappa) before trusting your ground truth. If humans can't agree, the judge won't either.
Closing the loop is what separates good teams from great ones - Running evals is only half the job. The other half is systematically turning failures into improvements and preventing regressions.

On Production & Scale (Chapters 9, 13)

Safety evals are not optional - Prompt injection, PII leakage, and jailbreak detection should be running before you worry about quality evals.
Start expensive, then optimize - Use GPT-4o/Claude Sonnet to establish your performance ceiling, then test whether a cheaper model can match it. Often it can.
Sampling beats exhaustive evaluation - Evaluating 10% of traces with statistical rigor gives you a better answer than evaluating 100% with a bad judge.
Good observability tools make the workflow 10x faster - Integrated tracing, evaluation, datasets, and experiments in one platform (LangWatch, Langfuse, etc.) saves enormous time vs. stitching together custom scripts.

On Platform Choice

LangWatch for speed, Langfuse for depth - LangWatch gets you results in hours with built-in evaluators. Langfuse gives maximum control for complex custom logic. Many teams use both.
Built-in evaluators save weeks of dev time - LangWatch's 40+ built-in evaluators cover most common use cases. If you're reinventing safety checks or RAG metrics, you're wasting time.
Community matters for long-term success - Langfuse's larger community means more integrations, more examples, more support. LangWatch's simpler API means faster onboarding.

Conclusion

AI evals are not just "testing" — they're a product development methodology that touches engineering, product management, and quality assurance.

Key takeaways:

Everyone needs evals — Not just big companies. If your AI app touches users, you need systematic evaluation.
Start with error analysis — Sit down and look at your failures before building anything automated (Chapter 3).
PMs and QAs must lead — Error analysis and criteria definition are product/quality work, not just engineering tasks.
Build incrementally — Start with code-based evals, then add LLM judges, then add safety evals. Don't try to do everything at once.
Measure what matters — Application-specific criteria, not generic "helpfulness" scores.
Both TPR and TNR — A judge that catches failures but also false-alarms is harmful. Measure both.
Split your data — Train/Dev/Test is mandatory. Without it, you're overfitting your judge.
Correct for bias — Use statistical correction (Chapter 10) for honest metrics.
Close the loop — Evals that don't lead to improvements are wasted effort (Chapter 11).
Plan for scale — Start with the best model, then optimize for cost (Chapter 13).

Your action plan (see Appendix G for details):

Week 1: Set up observability (LangWatch or Langfuse), do error analysis
Week 2: Build 2-3 core code-based evals
Week 3: Build and validate an LLM judge with proper train/dev/test splits
Week 4: Advanced topics — RAG evals, multi-turn evals, safety evals, automation
Ongoing: 30 minutes per week maintenance + regression testing

Platform decision:

Choose LangWatch if you want to start fast (<30 min setup) and use built-in evaluators
Choose Langfuse if you need maximum flexibility and have complex custom workflows
Use both if you want the best of both worlds (many teams do this)

Remember: The teams shipping the best AI products are the ones with the best evals. Not the fanciest models. Not the biggest teams. The ones who systematically measure and improve.

Start today. Your future self will thank you.

Learning Resources

Platform Documentation & Learning Hubs

LangWatch Docs: docs.langwatch.ai
LangWatch GitHub: github.com/langwatch/langwatch
Langfuse Docs: langfuse.com/docs
Langfuse GitHub: github.com/langfuse/langfuse
Maven Course (AI Evals for Engineers & PMs): maven.com/parlance-labs/evals
HuggingFace Evaluation Guidebook: github.com/huggingface/evaluation-guidebook

Research & Thought Leadership

OpenAI Evals Platform: evals.openai.com
OpenAI Cookbook (practical examples & guides): cookbook.openai.com
OpenAI Research: openai.com/research
OpenAI Docs (Evals): platform.openai.com/docs/guides/evals
Anthropic Research: anthropic.com/research
METR (Model Evaluation & Threat Research): metr.org
Eugene Yan on eval process: eugeneyan.com/writing/eval-process

Blogs That Shaped This Guide

Hamel Husain's Blog: hamel.dev — Applied AI engineering, LLM evals deep-dives
Shreya Shankar's Site: sh-reya.com — LLM data systems research, eval methodology
Maxim AI Articles: getmaxim.ai/articles — Agentic evaluation patterns

Open-Source Tools & Libraries

Tool	Focus	License	Links
LangWatch	Observability & built-in evals	Apache 2.0	GitHub · Docs
Langfuse	Custom pipelines & tracing	MIT	GitHub · Docs
RAGAS	RAG-specific evaluation	Apache 2.0	GitHub · Docs
Comet Opik	LLM tracing & evaluation	Apache 2.0	GitHub · Site
judgy	Statistical bias correction	Open	GitHub
Braintrust	Experimentation & logging	Partial	Docs
Galileo	Hallucination detection	Proprietary	Site
Maxim	Agentic system evaluation	Proprietary	Site

Strategy Comparison Matrix

Company	Focus	Open Source	Best For	Unique Strength
LangWatch	Observability + Built-in Evals	Yes (Apache 2.0)	Fast setup, analytics	40+ built-in evaluators, auto-analytics
Langfuse	Custom Pipelines	Yes (MIT)	Data sovereignty, flexibility	Self-hostable, full control over data
Anthropic	Safety / Red Teaming	Partial	Frontier risks	Constitutional classifiers, multi-attempt adversarial testing
OpenAI	Preparedness / Business	Evals toolkit	Enterprise context	SME probing, contextual evals
RAGAS	RAG-specific	Yes (Apache 2.0)	RAG pipelines	Reference-free metrics, synthetic test data generation
Maxim	Agentic Systems	No	Multi-agent apps	Simulation framework, no-code evaluation
Braintrust	Experimentation	Partial	Early-stage teams	Collaborative design, fast iteration
Galileo	Hallucinations	No	Quality assurance	ChainPoll, real-time monitoring
Comet Opik	LLM Tracing & Evals	Yes (Apache 2.0)	End-to-end observability	Framework integrations, online evaluation rules
METR	Catastrophic Risk	Research	Policy guidance	Autonomous capability assessment

Contact Me

Om Bharatiya: @ombharatiya

Reference Work Credits

This guide was built on the foundation of the following people's work and ideas. Their courses, blogs, and open-source contributions made this guide possible:

Hamel Husain: @HamelHusain — hamel.dev
Shreya Shankar: @sh_reya — sh-reya.com
Eugene Yan: @eugeneyan — eugeneyan.com

This guide was inspired by and builds upon the AI Evals for Engineers & PMs course by Hamel Husain and Shreya Shankar, extended with additional research, production-ready code examples, and multi-platform guides covering LangWatch, Langfuse, and the broader eval tooling ecosystem.

Author: Om Bharatiya | Created: February 2026

FilesExpand file tree

ai_evals_complete_guide_langwatch_langfuse.md

Latest commit

History

ai_evals_complete_guide_langwatch_langfuse.md

File metadata and controls

AI Evals For Engineers, PMs & QAs: Complete Study Guide

Table of Contents

Chapter 1: What Are AI Evals and Why You Need Them

Simple Definition

Why Everyone Needs Evals

The Three Core Truths About Evals

The AI Development Cycle is the Scientific Method

What Can Go Wrong Without Evals?

For PMs: Why This Is Your Job

For QAs: Your New Superpower

Chapter 2: Setting Up Observability

What is a Trace?

Example of a Complete Trace

What Information to Capture

Choosing an Observability Platform

Setting Up LangWatch (Open-Source, Cloud or Self-Hosted)

Install and Configure

Instrument Your Application (Auto-Tracing)

Add Custom Spans with Decorators

Setting Up Langfuse (Open-Source, Cloud or Self-Hosted)

Install and Configure

Instrument Your Application (Drop-In Replacement)

Add Custom Spans with Decorators

Creating and Managing Prompts

LangWatch

Langfuse

Uploading Test Datasets

LangWatch

Langfuse

Key Principle

Chapter 3: Error Analysis: The Secret Sauce

What is Error Analysis?

Why PMs and QAs MUST Do This (Not Just Engineers)

Step 1: Generate Diverse Test Queries

Define Key Dimensions

Generate Random Combinations

Convert Tuples to Natural Language Queries Using an LLM

Step 2: Review 100 Traces and Take Notes (Open Coding)

Step 3: Categorize Errors Using Axial Coding

Using an LLM to Help Discover Categories

Refine Categories to Be Specific and Actionable

Step 4: Label Your Errors with LLM Assistance

Step 5: Count and Prioritize

Why This Changes Everything

The "Theoretical Saturation" Concept

For PMs/QAs: Your Error Analysis Checklist

Chapter 4: Building LLM-as-a-Judge Evaluators

What is LLM-as-a-Judge?

When to Use LLM-as-a-Judge

The Complete LLM Judge Workflow

Overview: The Pipeline

Step 1: Generate Traces

Step 2: Label Ground Truth Data

Step 3: Split Data (Train / Dev / Test)

Step 4: Build Your Judge Prompt

Part 1: Role and Domain Definitions

Part 2: Clear Evaluation Criteria

Part 3: Few-Shot Examples (From Your Train Set!)

Part 4: Output Format

Why Binary Scores Work Best

Step 5: Validate on Dev Set

Evaluator Functions (Platform-Agnostic)

Running the Experiment

The Metrics That Actually Matter

1. TPR (True Positive Rate) - Recall

2. TNR (True Negative Rate) - Specificity

Real Results: Why Iteration Matters

Target Metrics

Iterating on Your Judge Prompt

Step 6: Final Evaluation on Test Set

Step 7: Run on All Traces at Scale

LLM-as-Judge Across Different Domains

Chapter 5: Code-Based Evaluators

What Are Code-Based Evals?