Based on the Maven course by Hamel Husain & Shreya Shankar, enriched with hands-on examples, production-ready code, and platform-specific guides for LangWatch, Langfuse, and more
Who is this guide for?
- Engineers building AI-powered products who need to systematically evaluate quality
- Product Managers who own the product experience and need to lead error analysis
- QA Engineers who need to build automated evaluation pipelines for AI systems
- Anyone who wants to learn how to evaluate AI applications without taking the full course
What you'll learn:
- How to set up observability for any AI application
- How to systematically find what's broken (error analysis)
- How to build automated evaluators (code-based and LLM judges)
- How to evaluate RAG systems, multi-step pipelines, and multi-turn conversations
- How to run production evals: guardrails, safety, and real-time monitoring
- How to use statistical correction to account for judge errors
- How to close the loop: turn eval results into system improvements
- How to do all of this with your observability platform of choice (LangWatch, Langfuse, Braintrust, LangSmith, or your own)
Platform Examples: This guide uses LangWatch (open-source, self-hosted or cloud) and Langfuse (open-source, cloud or self-hosted) as primary examples. The methodology is platform-agnostic — adapt it to whichever tool you use.
LangWatch vs Langfuse: Both are excellent open-source platforms with similar core capabilities. LangWatch offers simpler setup and built-in evaluators, while Langfuse provides more flexibility for custom pipelines and has a larger community. This guide shows both so you can choose based on your needs.
- What Are AI Evals and Why You Need Them
- Setting Up Observability
- Error Analysis: The Secret Sauce
- Building LLM-as-a-Judge Evaluators
- Code-Based Evaluators
- RAG System Evaluation
- Multi-Step Pipeline Evaluation
- Multi-Turn Conversation Evaluation
- Production Evals: Safety, Guardrails & Monitoring
- Statistical Correction with judgy
- Closing the Loop: From Evals to Improvements
- Human Annotation Best Practices
- Cost, Latency & Scaling Evals
- Practical Implementation Guide
- Common Mistakes to Avoid
- Tools and Resources
Appendices:
- A: Glossary for PMs & QAs
- B: Quick Reference
- C: Complete Judge Prompts from Production
- D: Pipeline State Evaluator Prompts
- E: Judge Prompt Engineering Tips
- F: Platform Methods Reference (LangWatch & Langfuse)
- G: 30-Day Learning Path
Evals (Evaluations) are systematic tests that check if your AI application is working correctly. Think of them like unit tests for traditional software, but for AI systems.
There's a debate in the AI community: some people say "just vibe check your app" (meaning: just use it yourself and see if it feels good). But here's the truth:
Everyone needs evals. The people who say they don't need evals are actually benefiting from evals that someone else did upstream.
Example: If you're building a coding assistant with GPT-4, OpenAI already tested GPT-4 on massive code benchmarks. So you can "vibe check" your app. But for most applications that aren't simple uses of foundation models, you need your own evals.
-
You can't improve what you don't measure
- Generic metrics like "helpfulness score" won't catch specific problems
- You need application-specific evals
-
Error analysis is the most important step
- More important than LLM judges
- More important than fancy observability tools
- This is where you actually learn what's broken
-
PMs and QAs must own error analysis, not just engineers
- Engineers know if code works
- PMs know if the product experience is good
- QAs know how to systematically break things
- You have the domain expertise
- This is product work, not just technical work
Building great AI products requires a rigorous evaluation process. In many ways, AI development IS the scientific method:
- Observe - Trace your AI's behavior (Chapter 2)
- Hypothesize - Identify what's broken through error analysis (Chapter 3)
- Experiment - Build evaluators and test changes (Chapters 4-9)
- Measure - Calculate metrics and correct for bias (Chapter 10)
- Iterate - Improve based on data, not hunches (Chapter 11)
Your demo works great. Then production happens:
- Users hit edge cases you never thought of
- Text messages contain typos and unusual formatting
- Dates are formatted differently than expected
- AI tries to handle requests it should hand off to humans
- Small prompt changes break things that were working
Example from real production data:
User: "I need a one bedroom with the bathroom NOT connected"
AI: Returns apartments with connected bathrooms (WRONG!)
User: "I do NOT want a bathroom connected to the room"
AI: "I'll check on that" but never actually checks
PLUS: AI used markdown formatting (* asterisks *) in a text message
Three different problems in one interaction! Without proper logging and evaluation, you'd never catch these patterns.
Wrong approach: "This is technical AI stuff, let engineering figure it out"
Right approach: PMs should lead error analysis because:
- You understand user needs
- You have product taste
- You have domain expertise
- This is product work disguised as technical work
The teams shipping the best AI products have PMs who've personally reviewed hundreds or thousands of traces.
Traditional QA involves test cases with expected outputs. AI QA is different:
- Outputs are non-deterministic (same input can give different outputs)
- "Correct" is often subjective
- Edge cases are virtually infinite
- You need automated evaluators that scale
But the core QA mindset - systematic testing, edge case thinking, regression prevention - is exactly what AI evals need. QAs who learn evals become incredibly valuable.
A trace is a complete recording of everything your AI did to respond to a user. It's like a detailed log that shows:
- System prompt (instructions given to the AI)
- User messages (what the person asked)
- Tool calls (functions the AI tried to use)
- Tool responses (what those functions returned)
- Assistant responses (what the AI said back)
- All context (everything the LLM saw when making decisions)
=== TRACE ID: abc123 ===
SYSTEM PROMPT:
"You are a helpful property management assistant..."
USER MESSAGE:
"I need a one bedroom with the bathroom not connected"
TOOL CALL:
get_availability(bedrooms=1, bathroom_connected=None)
TOOL RESPONSE:
[
{unit: "A101", bedrooms: 1, bathroom_connected: True},
{unit: "B205", bedrooms: 1, bathroom_connected: True}
]
ASSISTANT RESPONSE:
"I found these apartments: A101 and B205..."
(Used markdown: ** ** in text message)
Minimum requirements:
- Input (user message)
- Output (AI response)
- Timestamp
- Unique ID for the interaction
Better to include:
- System prompts used
- Tool calls and their results
- Model parameters (temperature, max_tokens, etc.)
- Token counts
- Latency (response time)
- Cost per request
Best practice:
- User context (session history)
- Error messages if any occurred
- Model version used
- Feature flags active at the time
| Tool | Type | Best For | Cost |
|---|---|---|---|
| LangWatch | Open source, cloud or self-hosted | Simple setup, built-in evaluators, great UX | Free tier + paid |
| Langfuse | Open source, cloud or self-hosted | Custom pipelines, large community | Free tier + paid |
| Braintrust | Cloud | Excellent UI, team collaboration | Paid |
| LangSmith | Cloud | LangChain users | Paid |
| Build Your Own | Custom | Learning, custom needs | Free |
LangWatch vs Langfuse comparison:
- Setup: LangWatch is simpler (3-line integration), Langfuse requires more configuration
- Evaluators: LangWatch has 40+ built-in evaluators, Langfuse requires custom implementation
- Flexibility: Langfuse is more flexible for custom workflows, LangWatch is more opinionated
- Community: Langfuse has a larger community and more integrations
- UI: Both have excellent UIs; LangWatch focuses on analytics, Langfuse on workflow
All of these support the same core concepts: traces, spans, datasets, evaluations, and experiments. The methodology in this guide works with any of them.
LangWatch is an open-source LLM observability and analytics platform. It provides tracing, evaluation, datasets, experiments, and 40+ built-in evaluators.
pip install langwatch# Set your API key (get one at langwatch.ai or self-host)
import os
os.environ["LANGWATCH_API_KEY"] = "lw_..." # or set in .env fileCloud vs Self-Hosted:
- Cloud: Sign up at langwatch.ai, get API key, done in 5 minutes
- Self-Hosted: Run
docker-compose upwith their Docker setup, point to your own instance
LangWatch supports auto-instrumentation for most frameworks:
import langwatch
# Initialize LangWatch
langwatch.init()
# Your existing OpenAI code now gets traced automatically!
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a recipe assistant."},
{"role": "user", "content": "How do I make pancakes?"}
],
temperature=0.7
)
# This call is automatically captured by LangWatch!Framework Support:
- OpenAI (automatic)
- LangChain (automatic)
- LlamaIndex (automatic)
- Anthropic Claude (automatic)
- Any custom LLM (manual spans)
import langwatch
@langwatch.span(type="chain")
def my_pipeline(question):
"""Parent span for the whole pipeline"""
sql = generate_sql(question)
results = execute_query(sql)
return synthesize_answer(question, results)
@langwatch.span(type="llm")
def generate_sql(question):
"""Tracked as an LLM generation"""
return client.chat.completions.create(...)
@langwatch.span(type="tool")
def execute_query(sql):
"""Tracked as a tool call"""
return db.execute(sql)Comparison with Langfuse:
Both use decorators, but LangWatch's @langwatch.span() is simpler than Langfuse's @observe(). LangWatch automatically categorizes spans by type, while Langfuse requires explicit as_type parameters.
Langfuse provides tracing, evaluation, datasets, experiments, and prompt management. It offers a managed cloud and a self-hosted option.
pip install langfuse openai# Set environment variables (or pass to constructor)
# LANGFUSE_SECRET_KEY="sk-lf-..."
# LANGFUSE_PUBLIC_KEY="pk-lf-..."
# LANGFUSE_HOST="https://cloud.langfuse.com" # or your self-hosted URL# Just change your import — everything else stays the same!
from langfuse.openai import OpenAI
client = OpenAI()
# This call is automatically traced by Langfuse
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a recipe assistant."},
{"role": "user", "content": "How do I make pancakes?"}
],
temperature=0.7
)from langfuse import observe
@observe()
def my_pipeline(question):
"""Parent trace for the whole pipeline"""
sql = generate_sql(question)
results = execute_query(sql)
return synthesize_answer(question, results)
@observe(as_type="generation")
def generate_sql(question):
"""Tracked as a generation (LLM call)"""
return client.chat.completions.create(...)Both platforms support versioned prompt management:
import langwatch
# Create a prompt template
langwatch.prompts.create(
name="recipe-assistant-v1",
template=[
{"role": "system", "content": "You are a recipe assistant..."},
{"role": "user", "content": "{{question}}"}
],
model="gpt-4o-mini",
temperature=0.7
)
# Use at runtime
prompt = langwatch.prompts.get("recipe-assistant-v1")
messages = prompt.render(question="How do I make pancakes?")
response = client.chat.completions.create(messages=messages, **prompt.settings)LangWatch advantage: Simpler API, automatic parameter management (temperature, model stored with prompt).
from langfuse import get_client
langfuse = get_client()
langfuse.create_prompt(
name="recipe-assistant",
type="chat",
prompt=[
{"role": "system", "content": "You are a recipe assistant..."},
{"role": "user", "content": "{{query}}"},
],
labels=["production"],
)
# Use at runtime
prompt = langfuse.get_prompt("recipe-assistant", type="chat")
compiled = prompt.compile(query="How do I make pancakes?")Langfuse advantage: More mature prompt management, better versioning UI, labels for organization.
import langwatch
import pandas as pd
df = pd.DataFrame({
"query": [
"Suggest a quick vegan breakfast recipe",
"I have chicken and rice. What can I cook?",
"Give me a dessert recipe with chocolate",
]
})
dataset = langwatch.datasets.create(
name="recipe-queries",
dataframe=df,
)LangWatch advantage: Direct pandas DataFrame support, simpler API.
from langfuse import get_client
langfuse = get_client()
langfuse.create_dataset(name="recipe-queries")
for query in ["Suggest a quick vegan breakfast recipe",
"I have chicken and rice. What can I cook?",
"Give me a dessert recipe with chocolate"]:
langfuse.create_dataset_item(
dataset_name="recipe-queries",
input={"query": query},
)Langfuse advantage: More control over individual items, better for incremental additions.
Without traces, you can't do evals. This is your foundation. Set this up first before anything else.
For PMs/QAs: You don't need to write the instrumentation code. Ask your engineers to set up tracing, then use the web UI to review traces visually. Both LangWatch (langwatch.ai or your self-hosted URL) and Langfuse (cloud.langfuse.com or your self-hosted URL) provide UIs that let you browse, search, and annotate traces without writing any code.
Platform Choice Guidance:
- Choose LangWatch if: You want the fastest setup, built-in evaluators, and focus on analytics
- Choose Langfuse if: You need maximum flexibility, have complex custom workflows, or want the largest community
- Use both: They complement each other - LangWatch for quick evals, Langfuse for deep workflow customization
Error analysis is the systematic process of:
- Reviewing traces (logs of AI interactions)
- Taking notes on problems you see
- Categorizing those problems
- Counting how often each type of problem occurs
This is THE most important skill in building reliable AI products.
Most teams skip straight to building fancy dashboards or LLM judges. That's backwards. You need to understand what's wrong before you can measure it.
Wrong approach: "This is technical AI stuff, let engineering figure it out"
Right approach: PMs and QAs should lead error analysis because:
- You understand user needs - Engineers don't know if a "connected bathroom" vs "disconnected bathroom" matters to users
- You have product taste - You know what good experiences look like
- You have domain expertise - You understand business requirements
- This is product work - Disguised as technical work, but it's really about product quality
Real impact: The teams shipping the best AI products have PMs who've personally reviewed hundreds or thousands of traces.
Before you can review traces, you need diverse test inputs. A powerful technique for this is dimensional sampling.
Identify 3-4 dimensions that matter for your product:
DIMENSIONS = {
"dietary_restriction": [
"vegan", "vegetarian", "gluten-free", "keto", "no restrictions"
],
"cuisine_type": [
"Italian", "Asian", "Mexican", "Mediterranean", "American"
],
"meal_type": [
"breakfast", "lunch", "dinner", "snack", "dessert"
],
"skill_level": [
"beginner", "intermediate", "advanced"
],
}
# Total possible combinations: 5 x 5 x 5 x 3 = 375import random
random.seed(42)
dimension_tuples = []
for i in range(25): # Generate 25 diverse tuples
tuple_data = {
"dietary_restriction": random.choice(DIMENSIONS["dietary_restriction"]),
"cuisine_type": random.choice(DIMENSIONS["cuisine_type"]),
"meal_type": random.choice(DIMENSIONS["meal_type"]),
"skill_level": random.choice(DIMENSIONS["skill_level"]),
}
dimension_tuples.append(tuple_data)You can use any LLM to convert dimension tuples into realistic queries. Here are platform-specific approaches:
With LangWatch (built-in generation):
import langwatch
QUERY_GEN_PROMPT = """Convert this dimension tuple into a realistic user query
for a Recipe Bot. Be creative and vary your style.
Dimension tuple: {tuple_description}
Generate 1 unique, realistic query:"""
queries = []
for t in dimension_tuples:
result = langwatch.completion(
prompt=QUERY_GEN_PROMPT.format(tuple_description=str(t)),
model="gpt-4o-mini",
temperature=0.9
)
queries.append(result.text)With any LLM (platform-agnostic):
import openai
client = openai.OpenAI()
QUERY_GEN_PROMPT = """Convert this dimension tuple into a realistic user query
for a Recipe Bot. Be creative and vary your style.
Dimension tuple: {tuple_description}
Generate 1 unique, realistic query:"""
queries = []
for t in dimension_tuples:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": QUERY_GEN_PROMPT.format(
tuple_description=str(t)
)}],
temperature=0.9
)
queries.append(response.choices[0].message.content)With Langfuse (manual tracking):
from langfuse.openai import OpenAI
client = OpenAI() # Auto-traced
queries = []
for t in dimension_tuples:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": QUERY_GEN_PROMPT.format(
tuple_description=str(t)
)}],
temperature=0.9
)
queries.append(response.choices[0].message.content)Example conversions:
| Dimension Tuple | Generated Query |
|---|---|
| vegan, Italian, dinner, beginner | "Hey, I'm new to cooking and vegan. Can you suggest an easy Italian dinner?" |
| gluten-free, any, dessert, intermediate | "I'm looking for a gluten-free dessert that's a bit of a challenge to make" |
| keto, American, breakfast, advanced | "Give me a complex keto breakfast recipe, American style" |
For PMs/QAs: This dimensional approach ensures you test the full space of user needs. Without it, you'll only test the obvious cases and miss edge cases where users combine unexpected requirements.
The process (30 seconds per trace):
- Open your trace viewer (LangWatch dashboard, Langfuse UI, or any tool)
- Look at the first trace
- Scan through it:
- Read the user message
- Check if AI called the right tools
- Look at what the tools returned
- Read the assistant's response
- Note any problems you see
Example notes from a real error analysis session:
TRACE #1:
"Told user it would check on bathrooms but didn't do it.
Did not follow user instructions.
Rendered markdown in a text message."
TRACE #2:
"Returned properties outside user's price range.
Tool call had correct parameters but didn't filter results."
TRACE #3:
"Good response. No issues."
TRACE #4:
"Failed to hand off to human when user asked for same-day tour.
Policy violation."
Rules for error analysis:
- Don't try to catch everything - Just note the most important things
- Don't debate every trace - Think quickly, write it down, move on
- Skip the system prompt - If it's usually the same, you don't need to read it every time
- Get into a flow state - This should feel fast, not tedious
Time commitment:
- First trace: 45 seconds
- After 10 traces: 25 seconds each
- After 50 traces: 20 seconds each
- Total time for 100 traces: ~45 minutes
Platform-Specific Note:
- LangWatch: Use the "Annotations" feature to add notes directly to traces in the UI
- Langfuse: Use the "Comments" feature to add notes to traces
Now you have 40-50 notes scattered across traces. Time to organize them.
This process is called "axial coding" (a research method from sociology). You're grouping similar errors into categories.
Export your notes, then use this prompt:
prompt = f"""
You are analyzing Recipe Bot failures. Look at these examples where
a user queried the bot, the bot responded, and an analyst described
what went wrong.
EXAMPLES:
{combined_df.to_json(orient="records", lines=True)}
Based on the patterns you see in the analyst's descriptions,
create 4-6 systematic failure mode labels.
Each label should:
- Be short and clear (2 words max)
- Capture a distinct type of failure pattern
- Be applicable to multiple traces
Respond with a list:
["label1", "label2", "label3", "label4", "label5", "label6"]
"""Example results from a real recipe bot evaluation:
["Dietary Ignored", "Formatting Error", "Complexity Mismatch",
"Meal Type Mismatch", "Ingredient Omission", "Skill Level Misalignment"]
Problem: Generic LLM suggestions are too vague!
"Temporal issues" - what does that mean? "Quality issues" - too generic!
Better categories (specific and actionable):
- Dietary Ignored - Bot suggests ingredients that violate dietary restrictions
- Formatting Error - Markdown in SMS, wrong structure
- Complexity Mismatch - Recipe too hard/easy for stated skill level
- Meal Type Mismatch - Suggests dinner when asked for breakfast
- Ingredient Omission - Doesn't include unique ingredients user asked for
- Skill Level Misalignment - Advanced techniques for beginners
Your categories need to be specific enough that someone else could label errors using them.
This step works with any LLM. Use batch processing if your platform supports it:
CLASSIFICATION_PROMPT = """Look at this Recipe Bot interaction and the
analyst's description. Apply the most appropriate failure mode label.
USER QUERY: {input_query}
BOT RESPONSE: {bot_response}
ANALYST'S ISSUE DESCRIPTION: {issue_description}
AVAILABLE LABELS:
{failure_mode_labels}
Respond with just the label name."""
# Run classification on each error note (use your platform's batch API
# or loop with any LLM client)With LangWatch (batch evaluation):
import langwatch
results = langwatch.evaluate.batch(
dataset=error_notes_df,
evaluator="custom_classifier",
prompt_template=CLASSIFICATION_PROMPT,
model="gpt-4o-mini"
)With Langfuse (manual iteration):
from langfuse.openai import OpenAI
client = OpenAI()
for note in error_notes:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": CLASSIFICATION_PROMPT.format(**note)}],
temperature=0
)
note["label"] = response.choices[0].message.contentCount how many times each category appears:
label_counts = results["output"].value_counts()Example results from a real evaluation:
| Category | Count | Percentage |
|---|---|---|
| Complexity Mismatch | 2 | 22% |
| Meal Type Mismatch | 2 | 22% |
| Ingredient Omission | 2 | 22% |
| Dietary Ignored | 1 | 11% |
| Formatting Error | 1 | 11% |
| Skill Level Misalignment | 1 | 11% |
Before error analysis:
- You're paralyzed
- Don't know what to fix first
- Can't prioritize
After error analysis:
- Clear priorities based on frequency
- Understanding of severity (frequency vs. impact)
- Evidence for stakeholder discussions
- Concrete list of what to build evals for
Example prioritization discussion:
"Dietary restriction violations happen in 11% of cases, but when
they occur, we could harm users with allergies. This is HIGH-SEVERITY.
Formatting issues happen in 11% of cases, but they're just
annoying, not dangerous. This is LOW-SEVERITY.
Let's fix dietary adherence first, then complexity matching."
When to stop reviewing traces?
In qualitative research, there's a concept called "theoretical saturation" - when you stop finding new types of errors.
- Review your first 50 traces: You find 10 different error types
- Review next 25 traces: You find 2 new error types
- Review next 25 traces: You find 0 new error types
- Stop here! You've reached saturation
You don't need to review 1000 traces if you're not finding new patterns after 100.
- Ask engineering to set up tracing (LangWatch, Langfuse, or any tool)
- Open the trace viewer UI
- Browse 100 traces, taking quick notes on problems
- Use an LLM to help categorize your notes into 4-6 failure modes
- Count occurrences of each failure mode
- Create a prioritized list considering both frequency and severity
- Present findings to your team with data-backed recommendations
- Repeat monthly with new traces to catch new failure patterns
An LLM judge is an AI that evaluates other AI outputs. It reads traces and scores them.
Why use it?
- Automates evaluation at scale
- Provides consistent judgment
- Much faster than manual review
The challenge: Most people build judges wrong. Their judges hallucinate, miss problems, or create false confidence.
Use LLM judges for:
- Subjective quality assessments
- Policy compliance checking
- Context understanding
- Dietary adherence
- Tone appropriateness
- Multi-step reasoning checks
Don't use LLM judges for:
- Format validation (use code)
- Required field checks (use code)
- Simple pattern matching (use code)
- Exact string matching (use code)
Rule of thumb: If you can express it as an if/else statement, use code. If you need judgment, use LLM.
Building reliable LLM judges requires a rigorous 7-step workflow:
1. Generate traces (run your AI on test queries)
2. Label a subset manually (or with a powerful LLM)
3. Split into Train / Dev / Test sets
4. Develop your judge prompt using Train examples
5. Validate on Dev set (iterate until good)
6. Final evaluation on Test set (unbiased metrics)
7. Run on all traces + correct with judgy
Run your AI system on diverse test queries to create traces. Use your platform's auto-instrumentation (see Chapter 2) to capture everything automatically.
Label 150-200 traces as PASS or FAIL. You can do this manually (most accurate) or use a powerful LLM:
You are an expert nutritionist evaluating dietary adherence.
DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Keto: Very low carb (<20g net carbs), high fat, moderate protein
[... full definitions — see Appendix C for the full list ...]
EVALUATION CRITERIA:
- PASS: Recipe clearly adheres to the dietary preferences
- FAIL: Recipe contains ingredients that violate dietary preferences
Query: {query}
Dietary Restriction: {dietary_restriction}
Response: {response}
Return JSON: {"label": "PASS" or "FAIL", "explanation": "..."}
Platform-Specific Labeling:
With LangWatch (built-in evaluators):
import langwatch
# LangWatch has 40+ built-in evaluators including dietary compliance
results = langwatch.evaluate.batch(
dataset=traces_df,
evaluators=["dietary_compliance"], # Built-in evaluator
model="gpt-4o"
)
# Or create custom evaluator
custom_evaluator = langwatch.evaluators.create(
name="dietary_adherence",
prompt=LABELING_PROMPT,
model="gpt-4o"
)
results = langwatch.evaluate.batch(
dataset=traces_df,
evaluators=[custom_evaluator]
)With Langfuse (custom implementation):
from langfuse.openai import OpenAI
client = OpenAI()
labels = []
for trace in traces:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": LABELING_PROMPT.format(**trace)}],
temperature=0
)
labels.append(parse_json(response.choices[0].message.content))LangWatch advantage: 40+ built-in evaluators save time for common use cases. Langfuse advantage: Complete control over custom evaluation logic.
This is critical and often skipped! You need three separate sets:
- Train (~15%): Used to select few-shot examples for your judge prompt
- Dev (~40%): Used to iterate and improve your judge prompt
- Test (~45%): Used ONCE for final, unbiased evaluation
from sklearn.model_selection import train_test_split
# First split: separate test set
train_dev, test = train_test_split(
labeled_data, test_size=0.45,
stratify=labeled_data['label'], # Maintain PASS/FAIL ratio
random_state=42
)
# Second split: separate train from dev
train, dev = train_test_split(
train_dev, test_size=0.73, # 40% of original
stratify=train_dev['label'],
random_state=42
)Why stratified splitting? You need both PASS and FAIL examples in every split. Without stratification, you might get a dev set with all PASS examples, making it useless for testing failure detection.
Your judge prompt needs four key parts:
You are an expert nutritionist and dietary specialist evaluating
whether recipe responses properly adhere to specified dietary
restrictions.
DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
[... all 16 definitions — see Appendix C for the full list ...]
EVALUATION CRITERIA:
- PASS: The recipe clearly adheres to the dietary preferences
with appropriate ingredients and preparation methods
- FAIL: The recipe contains ingredients or methods that violate
the dietary preferences
- Consider both explicit ingredients AND cooking methods
This is where the train set pays off. Select 1-3 examples of correct judgments:
Example 1 (PASS):
Query: "Gluten-free pizza dough that actually tastes good"
Response: [Recipe using gluten-free all-purpose flour blend,
baking powder, olive oil, honey, apple cider vinegar...]
Explanation: The recipe uses gluten-free flour blend. All other
ingredients (baking powder, salt, olive oil, honey) do not
contain gluten. The preparation method does not introduce any
gluten-containing elements.
Label: PASS
Example 2 (FAIL):
Query: "Raw vegan Mediterranean quinoa salad"
Response: [Recipe with cooked quinoa, fresh vegetables,
olive oil, lemon juice...]
Explanation: The recipe FAILS because it includes cooked quinoa.
Raw vegan diets do not allow foods heated above 118 degrees F (48 degrees C),
and cooking quinoa involves boiling, which exceeds this limit.
Label: FAIL
Now evaluate the following:
Query: {query}
Dietary Restriction: {dietary_restriction}
Recipe Response: {response}
RETURN YOUR EVALUATION IN JSON FORMAT:
"label": "PASS" or "FAIL",
"explanation": "Detailed explanation citing specific ingredients or methods"
Some people want 1-5 scales or percentages. Don't do this.
With binary (PASS/FAIL):
- Only need to verify two things
- Clear decision boundary
- Easier to debug
- Simpler to explain to stakeholders
With 1-5 scale:
- Need to verify every score aligns
- What's the difference between 2 and 3?
- 5x more work to validate
- Business decisions are binary anyway
Remember: Either you fix something or you don't. Either it's broken or it's not.
Run your judge on the Dev set and compare to ground truth. Here's how with each platform:
def eval_tp(*, output, expected, **kwargs):
"""True Positive: Judge says PASS, ground truth is PASS"""
judge = output.get("label", "").upper()
truth = expected.get("label", "").upper()
return 1.0 if judge == "PASS" and truth == "PASS" else 0.0
def eval_tn(*, output, expected, **kwargs):
"""True Negative: Judge says FAIL, ground truth is FAIL"""
judge = output.get("label", "").upper()
truth = expected.get("label", "").upper()
return 1.0 if judge == "FAIL" and truth == "FAIL" else 0.0
def eval_fp(*, output, expected, **kwargs):
"""False Positive: Judge says PASS, ground truth is FAIL"""
judge = output.get("label", "").upper()
truth = expected.get("label", "").upper()
return 1.0 if judge == "PASS" and truth == "FAIL" else 0.0
def eval_fn(*, output, expected, **kwargs):
"""False Negative: Judge says FAIL, ground truth is PASS"""
judge = output.get("label", "").upper()
truth = expected.get("label", "").upper()
return 1.0 if judge == "FAIL" and truth == "PASS" else 0.0With LangWatch:
import langwatch
# Create custom evaluator with your judge prompt
judge_evaluator = langwatch.evaluators.create(
name="dietary-judge-v1",
prompt=judge_prompt_template,
model="gpt-4o",
temperature=0
)
# Run on dev set
results = langwatch.evaluate.batch(
dataset=dev_dataset,
evaluators=[judge_evaluator],
metrics=["tp", "tn", "fp", "fn", "tpr", "tnr"]
)
# LangWatch automatically calculates TPR and TNR
print(f"TPR: {results.metrics['tpr']:.1%}")
print(f"TNR: {results.metrics['tnr']:.1%}")LangWatch advantage: Built-in metric calculation, no need to manually compute confusion matrix.
With Langfuse:
from langfuse import Evaluation
def accuracy_evaluator(*, input, output, expected_output, **kwargs):
judge = output.get("label", "").upper()
truth = expected_output.get("label", "").upper()
correct = judge == truth
return Evaluation(name="accuracy", value=1.0 if correct else 0.0)
result = langfuse.run_experiment(
name="judge-dev-validation",
data=dev_data, # list of {"input": ..., "expected_output": ...}
task=judge_task,
evaluators=[accuracy_evaluator],
)
print(result.format())
# Calculate TPR/TNR manually from results
tp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "PASS")
tn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "FAIL")
fp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "FAIL")
fn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "PASS")
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
print(f"TPR: {tpr:.1%}")
print(f"TNR: {tnr:.1%}")Langfuse advantage: More control over evaluation logic, better for complex custom metrics.
Most people only look at "agreement":
Agreement = (Judge agrees with me) / (Total traces)
Example: 90% agreement
Why this is misleading:
If failures only happen 10% of the time, a judge that always says "pass" gets 90% accuracy by being completely useless!
The two metrics you actually need:
"When there's actually a PASS, how often does the judge correctly say PASS?"
TPR = True Positives / (True Positives + False Negatives)
"When there's actually a FAIL, how often does the judge correctly say FAIL?"
TNR = True Negatives / (True Negatives + False Positives)
After careful prompt iteration (production-quality judge):
Test Set Performance:
True Positive Rate (TPR): 95.7%
True Negative Rate (TNR): 100.0%
Balanced Accuracy: 97.8%
Total predictions: 33
Correct predictions: 32
Overall Accuracy: 97.0%
First attempt (before iteration):
Test Set Performance:
True Positive Rate (TPR): 90.1%
True Negative Rate (TNR): 22.2% <-- TOO LOW!
Accuracy: 84.0%
Notice the first attempt had a TNR of only 22.2% — meaning when a recipe actually violated dietary restrictions, the judge only caught it 22% of the time! This is dangerous (imagine telling a diabetic a recipe is safe when it isn't). After careful prompt iteration, the judge achieved 100% TNR.
Good judge:
- TPR > 80%
- TNR > 80%
Great judge:
- TPR > 90%
- TNR > 90%
Both must be high! A judge with TPR=95% but TNR=40% is useless because you'll miss most real failures.
Your first prompt won't be perfect. That's expected.
Process:
- Test your judge on Dev set
- Calculate TPR and TNR
- Look at errors:
- Where did it miss real failures? (False Negatives)
- Where did it false alarm? (False Positives)
- Update the prompt:
- Add missed scenarios to criteria
- Add false alarm scenarios to "NOT a failure" section
- Add 1-2 more examples of correct judgments
- Test again on Dev set
- Repeat until both metrics > 80%
- THEN test once on Test set for final, unbiased metrics
Once your judge performs well on Dev, run it on the Test set ONCE:
# Calculate final metrics from test set results
tp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "PASS")
tn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "FAIL")
fp = sum(1 for r in results if r["judge"] == "PASS" and r["truth"] == "FAIL")
fn = sum(1 for r in results if r["judge"] == "FAIL" and r["truth"] == "PASS")
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
print(f"Final TPR: {tpr:.1%}")
print(f"Final TNR: {tnr:.1%}")Once validated, run your judge on ALL production traces:
With LangWatch (batch evaluation with built-in concurrency):
import langwatch
# Run judge on all production traces
results = langwatch.evaluate.batch(
dataset=all_traces_df,
evaluators=[judge_evaluator],
concurrency=20, # Parallel processing
cache=True # Cache results for duplicate traces
)
# Get summary statistics
pass_rate = results.metrics["pass_rate"]
print(f"Raw pass rate: {pass_rate:.1%}")LangWatch advantage: Automatic concurrency management, built-in caching, progress tracking.
With Langfuse (experiment on dataset):
result = langfuse.run_experiment(
name="full-evaluation",
data=all_traces_data,
task=judge_task,
evaluators=[accuracy_evaluator],
max_concurrency=20,
)With plain OpenAI (platform-agnostic):
import openai
from concurrent.futures import ThreadPoolExecutor
client = openai.OpenAI()
def run_judge(trace):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt.format(**trace)}],
temperature=0,
)
return parse_json(response.choices[0].message.content)
with ThreadPoolExecutor(max_workers=20) as executor:
results = list(executor.map(run_judge, all_traces))Example result: Raw pass rate on 1000 traces = 84.4%
But this raw rate doesn't account for judge errors. Chapter 10 covers how to correct for this using the judgy library.
The recipe bot is one example. Here's how the same methodology applies to other domains:
Customer Support Bot:
Criterion: "Did the agent follow the refund policy correctly?"
PASS: Agent offered refund within 30-day window per policy
FAIL: Agent denied valid refund or offered refund outside policy
Code Generation Assistant:
Criterion: "Does the generated code actually solve the user's problem?"
PASS: Code compiles, handles edge cases, follows the user's constraints
FAIL: Code has syntax errors, misses requirements, or uses deprecated APIs
Medical Information Bot:
Criterion: "Does the response include appropriate disclaimers?"
PASS: Includes "consult your doctor" and avoids specific diagnoses
FAIL: Provides diagnosis-like statements without medical disclaimers
E-commerce Search:
Criterion: "Are the recommended products relevant to the query?"
PASS: Products match stated preferences (size, color, price range)
FAIL: Products violate stated filters or preferences
The structure is always the same: define the criterion, write PASS/FAIL definitions, add few-shot examples, validate with TPR/TNR.
Code-based evals are checks you write in programming code (like Python) to verify specific, objective properties of your AI's outputs.
Use code when you can test something without calling an LLM:
- Format validation - Is markdown appearing in text messages?
- Required field checks - Did the AI include all required information?
- Tool call validation - Did the AI call the right tool?
- Response length constraints - Is the response under 500 characters?
- Prohibited content patterns - Are there PII (emails, phone numbers)?
import re
def eval_no_markdown_in_sms(trace) -> dict:
response = trace['assistant_message']
channel = trace['metadata']['channel']
if channel != 'sms':
return {'passed': True, 'reason': 'Not SMS'}
markdown_patterns = [
r'\*\*.*?\*\*', # Bold
r'\_\_.*?\_\_', # Bold alt
r'\#\#\s', # Headers
r'```', # Code blocks
r'\[.*?\]\(.*?\)' # Links
]
for pattern in markdown_patterns:
if re.search(pattern, response):
return {
'passed': False,
'reason': f'Found markdown pattern: {pattern}'
}
return {'passed': True, 'reason': 'No markdown found'}Platform Integration:
With LangWatch:
import langwatch
# Register as custom evaluator
@langwatch.evaluator(name="no_markdown_sms")
def eval_no_markdown_in_sms(trace):
# ... implementation above ...
return {'passed': result['passed'], 'score': 1.0 if result['passed'] else 0.0}
# Run on dataset
results = langwatch.evaluate.batch(
dataset=traces_df,
evaluators=["no_markdown_sms"]
)With Langfuse:
from langfuse import get_client
langfuse = get_client()
# Run on each trace and log scores
for trace in traces:
result = eval_no_markdown_in_sms(trace)
langfuse.create_score(
trace_id=trace.id,
name="no_markdown_sms",
value=1 if result['passed'] else 0,
data_type="BOOLEAN",
comment=result['reason']
)def eval_correct_tool_called(trace) -> dict:
user_message = trace['user_message'].lower()
tool_calls = trace['tool_calls']
rules = {
'availability': ['available', 'vacant', 'open units'],
'schedule_tour': ['tour', 'visit', 'see'],
'get_price': ['price', 'rent', 'cost', 'how much']
}
expected_tool = None
for tool, keywords in rules.items():
if any(keyword in user_message for keyword in keywords):
expected_tool = tool
break
if not expected_tool:
return {'passed': True, 'reason': 'No specific tool expected'}
called_tools = [call['function'] for call in tool_calls]
if expected_tool in called_tools:
return {'passed': True, 'reason': f'Correctly called {expected_tool}'}
else:
return {
'passed': False,
'reason': f'Expected {expected_tool}, called {called_tools}'
}import re
def eval_tour_confirmation_complete(trace) -> dict:
response = trace['assistant_message'].lower()
if 'tour' not in response and 'visit' not in response:
return {'passed': True, 'reason': 'Not a tour confirmation'}
required_elements = {'date': False, 'time': False, 'address': False}
date_patterns = [
r'\d{1,2}/\d{1,2}/\d{4}',
r'\d{1,2}-\d{1,2}-\d{4}',
r'(mon|tues|wednes|thurs|fri|satur|sun)day'
]
if any(re.search(p, response) for p in date_patterns):
required_elements['date'] = True
time_patterns = [r'\d{1,2}:\d{2}\s?(am|pm)', r'\d{1,2}\s?(am|pm)']
if any(re.search(p, response) for p in time_patterns):
required_elements['time'] = True
if 'street' in response or 'ave' in response or 'unit' in response:
required_elements['address'] = True
missing = [k for k, v in required_elements.items() if not v]
if not missing:
return {'passed': True, 'reason': 'All required elements present'}
else:
return {'passed': False, 'reason': f'Missing: {", ".join(missing)}'}- Fast - No API calls, instant results
- Cheap - No tokens used
- Deterministic - Same input always gives same output
- Easy to debug - Stack traces, breakpoints work normally
- No hallucination - Code does exactly what you tell it
A complete eval suite typically has:
- 2-3 code-based evals for objective checks
- 1-2 LLM-based evals for subjective judgments
# Code-based evals (fast, cheap, deterministic)
1. check_no_markdown_in_sms()
2. validate_tool_calls()
3. check_response_length()
# LLM-based evals (slower, but handles nuance)
4. evaluate_dietary_adherence()
5. evaluate_response_helpfulness()Platform Comparison for Mixed Eval Suites:
LangWatch approach (unified):
import langwatch
# All evaluators registered in one place
langwatch.evaluate.batch(
dataset=traces_df,
evaluators=[
"no_markdown_sms", # Code-based (custom)
"tool_validation", # Code-based (custom)
"dietary_compliance", # LLM-based (built-in)
"helpfulness" # LLM-based (built-in)
]
)Langfuse approach (flexible but manual):
# Run code-based evals
for trace in traces:
markdown_result = eval_no_markdown_in_sms(trace)
tool_result = eval_correct_tool_called(trace)
# Log code-based scores
langfuse.create_score(trace_id=trace.id, name="markdown", ...)
langfuse.create_score(trace_id=trace.id, name="tools", ...)
# Run LLM-based evals separately
llm_results = run_llm_judges(traces)
for result in llm_results:
langfuse.create_score(trace_id=result.trace_id, ...)Always test your evals with known good and bad cases:
def test_no_markdown_evaluator():
eval = NoMarkdownEvaluator()
# Test case 1: Clean SMS
clean_trace = {
'assistant_message': 'Your tour is scheduled for 2PM',
'metadata': {'channel': 'sms'}
}
result = eval.evaluate(clean_trace)
assert result.passed == True
# Test case 2: SMS with markdown
markdown_trace = {
'assistant_message': 'Your tour is **confirmed** for 2PM',
'metadata': {'channel': 'sms'}
}
result = eval.evaluate(markdown_trace)
assert result.passed == False
# Test case 3: Email (should pass, we don't check email)
email_trace = {
'assistant_message': 'Your tour is **confirmed**',
'metadata': {'channel': 'email'}
}
result = eval.evaluate(email_trace)
assert result.passed == True
print("All tests passed!")RAG (Retrieval Augmented Generation) means your AI:
- Retrieves relevant information from a database
- Uses that information to generate a response
RAG has two failure modes:
- Retrieval fails - Doesn't find the right information
- Generation fails - Uses the information wrong
You need to evaluate both separately to know where problems occur.
When building keyword-based retrieval for a domain like recipes, here's the key insight: your tokenizer matters.
import re
# Preserves numbers, temperatures, measurements
_TOKEN_RE = re.compile(
r"\d+\s*[x x]\s*\d+" # Dimensions like 9x13
r"|(?:\d+/?\d+)" # Fractions like 1/2
r"|(?:\d+(?:\.\d+)?)" # Numbers like 375
r"|(?:degrees[fc])" # Temperature units
r"|[a-z]+" # Regular words
)
def tokenize(text: str) -> list[str]:
s = (text or "").lower()
# Normalize temperature references
s = s.replace("degrees f", "degreesf").replace("degree f", "degreesf")
s = s.replace("mins", "min").replace("minutes", "min")
return _TOKEN_RE.findall(s)Why this matters: Standard tokenizers strip numbers. But in recipes, "375" (temperature), "9x13" (pan size), and "1/2" (measurement) are critical search terms.
Instead of manually writing test queries, use an LLM to generate queries that depend on specific facts in your documents:
SYSTEM_PROMPT = """You are an advanced user of a recipe search engine.
Given a recipe, write ONE realistic cooking question that depends on
a precise, technical detail contained in THIS recipe. Focus on:
1) Specific methods (e.g., marinate 4 hours, bake at 375 degrees F)
2) Appliance settings (e.g., air fryer 400 degrees F for 12 minutes)
3) Ingredient prep details (e.g., slice onions paper-thin)
4) Timing specifics (e.g., rest dough 30 minutes)
5) Temperature precision (e.g., internal 165 degrees F)
Return EXACTLY a single JSON object:
{"query": "...?", "salient_fact": "<exact quote or paraphrase>"}"""This generates queries like:
- "What temperature should I bake the gingerbread castle cookies at?" (salient fact: "350 degrees F for 8-10 minutes")
- "How long should I let the bread dough rise?" (salient fact: "rise for 1 hour until doubled")
The salient_fact is your ground truth - you know which recipe has the answer.
"Did the correct recipe appear in the top K results?"
def recall_at_k(k, output, metadata, **kwargs):
"""Check if ground-truth recipe is in top-k results"""
ground_truth_id = metadata.get("source_recipe_id")
if not ground_truth_id:
return 0.0
top_ids = output.get("top_ids", [])
for rank, doc_id in enumerate(top_ids, 1):
if str(doc_id) == str(ground_truth_id):
return 1.0 if rank <= k else 0.0
return 0.0
# Create specific evaluators
def RecallAt1(**kwargs): return recall_at_k(1, **kwargs)
def RecallAt3(**kwargs): return recall_at_k(3, **kwargs)
def RecallAt5(**kwargs): return recall_at_k(5, **kwargs)"If we found it, how high did it rank?"
def MRR(output, metadata, **kwargs):
ground_truth_id = metadata.get("source_recipe_id")
if not ground_truth_id:
return 0.0
top_ids = output.get("top_ids", [])
for rank, doc_id in enumerate(top_ids, 1):
if str(doc_id) == str(ground_truth_id):
return 1.0 / rank
return 0.0import langwatch
def bm25_task(example):
query = example["input"]["input"]
hits = retrieve_bm25(query, corpus, bm25, tokenized_corpus, top_n=5)
return {"top_ids": [h["id"] for h in hits], "top_titles": [h["title"] for h in hits]}
# Register custom metrics
@langwatch.metric(name="recall_at_1")
def recall_at_1_metric(output, expected):
return recall_at_k(1, output, expected)
# Run experiment
results = langwatch.evaluate.batch(
dataset=synthetic_queries_dataset,
task=bm25_task,
metrics=["recall_at_1", "recall_at_3", "recall_at_5", "mrr"]
)LangWatch advantage: Built-in RAG metrics, automatic visualization of retrieval performance.
from langfuse import Evaluation
def bm25_task(*, item, **kwargs):
query = item["input"]["query"]
hits = retrieve_bm25(query, corpus, bm25, tokenized_corpus, top_n=5)
return {"top_ids": [h["id"] for h in hits], "top_titles": [h["title"] for h in hits]}
def recall_at_1_eval(*, output, expected_output, **kwargs):
ground_truth_id = expected_output.get("source_recipe_id")
found = str(ground_truth_id) in [str(x) for x in output.get("top_ids", [])[:1]]
return Evaluation(name="recall@1", value=1.0 if found else 0.0)
result = langfuse.run_experiment(
name="bm25-retrieval",
data=synthetic_queries_data,
task=bm25_task,
evaluators=[recall_at_1_eval],
)When a RAG test fails, diagnose WHERE:
def diagnose_rag_failure(query, target_recipe_id, retriever, pipeline):
# Step 1: Check retrieval
retrieved = retriever.search(query, k=5)
retrieved_ids = [d.id for d in retrieved]
if target_recipe_id not in retrieved_ids:
return {'failure_point': 'RETRIEVAL',
'issue': f'Recipe not in top 5'}
# Step 2: Check document quality
correct_doc = [d for d in retrieved if d.id == target_recipe_id][0]
# Does the doc actually contain the answer?
# Step 3: Check generation
answer = pipeline(query, retrieved)
is_correct = eval_factual_correctness(query, retrieved, answer)
if not is_correct:
return {'failure_point': 'GENERATION',
'issue': 'Answer incorrect despite good retrieval'}
return {'failure_point': None, 'status': 'PASS'}When retrieval fails:
- Try different chunking strategies
- Add metadata filters
- Use hybrid search (keyword + semantic)
- Implement query expansion
- Try reranking models
- Use domain-specific tokenizers (like the number-preserving one above)
When generation fails:
- Improve system prompt
- Add few-shot examples
- Use chain-of-thought prompting
- Add explicit grounding instructions
- Implement citation requirements
A multi-step pipeline is when your AI breaks a task into several stages, each doing a specific job.
Here's an example of a complete 7-state pipeline for a recipe assistant:
User query
|
[1. ParseRequest] -> Extract intent, dietary constraints, servings
|
[2. PlanToolCalls] -> Decide which tools to use and in what order
|
[3. GenRecipeArgs] -> Create recipe database search arguments
|
[4. GetRecipes] -> Execute recipe search (retriever)
|
[5. GenWebArgs] -> Create web search arguments
|
[6. GetWebInfo] -> Execute web search for supplemental info
|
[7. ComposeResponse] -> Write final response combining everything
|
Final response
Problem: If your pipeline fails, where did it fail?
Without state-level evals, you only know:
- "The system produced a bad response"
With state-level evals, you know:
- "The GenRecipeArgs state dropped the oatmeal filter"
- "That caused GetRecipes to return wrong recipes"
- "Which led to the bad final response"
Each pipeline state gets its own evaluator prompt. Here are real evaluators for a recipe pipeline:
You are an expert evaluator for the ParseRequest state.
What ParseRequest should do:
- Extract the user's intent from their query
- Identify dietary constraints (gluten-free, vegetarian, dairy-free)
- Determine the number of servings if mentioned
- Capture any other specific requirements
What counts as a failure:
- Misinterpretation: Key requirements are misunderstood
- Missing information: Important constraints are omitted
- Invalid format: Output is not parseable JSON
- Logical inconsistency: Extracted requirements contradict the query
Here is the input: {input}
Here is the output: {output}
Return JSON: {"explanation": "...", "label": "pass" or "fail"}
You are an expert evaluator for the PlanToolCalls state.
What PlanToolCalls should do:
- Analyze the parsed request to determine which tools are needed
- Plan the order of tool execution
- Provide rationale for the tool selection
What counts as a failure:
- Missing tools: Required tools for the task are not included
- Incorrect tools: Tools that don't exist are selected
- Poor ordering: Tool sequence doesn't make logical sense
- Unreasonable rationale: The reasoning is flawed
Here is the input: {input}
Here is the output: {output}
Return JSON: {"explanation": "...", "label": "pass" or "fail"}
You are an expert evaluator for the ComposeResponse state.
What ComposeResponse should do:
- Summarize one recommended recipe
- Provide clear numbered cooking steps
- Incorporate relevant tips from web information
- Respect dietary constraints throughout
What counts as a failure:
- Recipe contradiction: Final recipe doesn't match retrieved data
- Inconsistent steps: Cooking instructions are illogical
- Missing web integration: Useful web info is ignored
- Constraint violation: Dietary restrictions are violated
- Unit mismatches: Temperatures or measurements are wrong
Here is the input: {input}
Here is the output: {output}
Return JSON: {"explanation": "...", "label": "pass" or "fail"}
The approach is the same regardless of platform: query spans by pipeline state, run the appropriate evaluator, and log results.
import langwatch
STATES = [
"ParseRequest", "PlanToolCalls", "GenRecipeArgs",
"GetRecipes", "GenWebArgs", "GetWebInfo", "ComposeResponse"
]
for state_name in STATES:
# Get all spans for this state
spans_df = langwatch.get_spans(
filters={"name": state_name}
)
# Load evaluator for this state
with open(f"evaluators/{state_name.lower()}_eval.txt") as f:
eval_prompt = f.read()
# Create custom evaluator
evaluator = langwatch.evaluators.create(
name=f"{state_name}_eval",
prompt=eval_prompt,
model="gpt-4o"
)
# Run evaluation
results = langwatch.evaluate.batch(
dataset=spans_df,
evaluators=[evaluator]
)
# Results automatically logged to LangWatch
print(f"{state_name}: {results.metrics['pass_rate']:.1%} pass rate")LangWatch advantage: Automatic span querying, built-in result aggregation by state.
from langfuse import get_client, observe
langfuse = get_client()
# Fetch traces and filter by span name
traces = langfuse.api.trace.list(limit=500, tags=["recipe-pipeline"])
for trace in traces.data:
trace_detail = langfuse.api.trace.get(trace.id)
for observation in trace_detail.observations:
if observation.name in STATES:
# Run evaluator
result = run_evaluator(observation.name, observation.input, observation.output)
# Log score back to Langfuse
langfuse.create_score(
trace_id=trace.id,
observation_id=observation.id,
name=f"{observation.name}_eval",
value=1 if result["label"] == "pass" else 0,
data_type="BOOLEAN",
comment=result["explanation"],
)Example results from evaluating 100 synthetic traces with intentional failures:
Pipeline State Failure Distribution:
GetWebInfo: 33 failures (most problematic!)
ParseRequest: 18 failures
PlanToolCalls: 17 failures
GenRecipeArgs: 12 failures
GetRecipes: 10 failures
GenWebArgs: 8 failures
ComposeResponse: 1 failure (most reliable)
Summary:
~1/3 of traces complete successfully
~2/3 have at least one failure
Bimodal pattern: traces either run flawlessly or fail at
predictable spots
Key insight: GetWebInfo is the biggest bottleneck. Focus optimization there first.
Platform Comparison for Analytics:
LangWatch: Built-in analytics dashboard automatically shows failure distribution by state, no manual aggregation needed.
Langfuse: More flexible custom queries, but requires manual aggregation to generate these statistics.
def synthesize_fixes(state_name, failed_traces):
failure_descriptions = [
trace['explanation'] for trace in failed_traces
if trace.get('label') == 'fail'
]
prompt = f"""
You are analyzing failures in the '{state_name}' stage.
Here are the failure descriptions:
{chr(10).join(f"- {desc}" for desc in failure_descriptions)}
Please:
1. Identify common patterns (group similar failures)
2. Suggest specific fixes for each pattern
3. Recommend validator rules to catch these failures
4. Propose unit tests to prevent regression
Format as:
PATTERN: description
FREQUENCY: count
FIX: specific actionable fix
VALIDATOR: rule to add
TEST: unit test to write
"""
return llm(prompt)Even without writing code, you can:
- Open your observability UI (LangWatch or Langfuse) and look at traces by pipeline state
- Filter for failed states using the annotation/score filters
- Read the failure explanations generated by the LLM evaluators
- Identify patterns (e.g., "GetWebInfo fails whenever the query is about cooking techniques")
- File specific, data-backed bugs (e.g., "GenRecipeArgs drops dietary filters 12% of the time")
Most eval examples show single-turn Q&A: user asks, AI answers, done. But real applications have conversations — and new failure modes emerge across turns:
- Context loss — AI forgets what the user said 3 messages ago
- Contradiction — AI says one thing in turn 2, contradicts it in turn 5
- Instruction drift — AI gradually stops following the original system prompt
- Repetition — AI repeats the same information or suggestion
- Escalation failure — AI doesn't know when to hand off to a human
Treat each assistant response as a separate eval, but include the full conversation history as context:
MULTI_TURN_JUDGE_PROMPT = """You are evaluating one response in a multi-turn conversation.
FULL CONVERSATION HISTORY:
{conversation_history}
CURRENT ASSISTANT RESPONSE (the one being evaluated):
{current_response}
CRITERIA:
- Does this response stay consistent with previous responses?
- Does it remember and respect earlier context?
- Does it advance the conversation productively?
Return JSON: {"label": "PASS" or "FAIL", "explanation": "..."}
"""Score the conversation as a whole after it ends:
CONVERSATION_JUDGE_PROMPT = """Evaluate this complete conversation.
CONVERSATION:
{full_conversation}
Score on these dimensions:
1. Task completion: Did the user's goal get achieved?
2. Consistency: Did the AI contradict itself?
3. Context retention: Did the AI remember earlier details?
4. Appropriate escalation: Did it hand off when needed?
Return JSON: {"label": "PASS" or "FAIL", "explanation": "..."}
"""Generate multi-turn test scenarios that specifically target failure modes:
SCENARIOS = [
{
"turns": [
"I'm looking for a vegan restaurant",
"Actually, make that vegetarian — I eat eggs",
"What about that first place you mentioned?" # Tests context retention
],
"failure_mode": "context_retention"
},
{
"turns": [
"Help me plan a trip to Tokyo",
"My budget is $3000",
"Can you add business class flights?" # Tests budget contradiction
],
"failure_mode": "contradiction_detection"
},
]- Context retention rate: % of turns where the AI correctly referenced earlier information
- Contradiction rate: % of conversations with at least one self-contradiction
- Task completion rate: % of conversations where the user's goal was achieved
- Average turns to resolution: How many turns it takes to complete the task
Everything in Chapters 3-8 is offline evaluation — you run evals after the fact on collected traces. But production systems also need online evaluation:
| Offline Evals | Online Evals | |
|---|---|---|
| When | After traces are collected | In real-time, before/during response |
| Speed | Minutes to hours | Milliseconds to seconds |
| Purpose | Measure quality trends | Prevent bad responses |
| Examples | TPR/TNR on test set | Guardrails, content filters |
Every production AI system should evaluate for these safety risks:
def eval_prompt_injection(trace) -> dict:
user_message = trace['user_message'].lower()
injection_patterns = [
"ignore previous instructions",
"ignore all prior",
"you are now",
"new instructions:",
"system prompt:",
"forget everything",
"disregard the above",
]
for pattern in injection_patterns:
if pattern in user_message:
return {
'passed': False,
'risk': 'prompt_injection',
'reason': f'Detected injection pattern: "{pattern}"'
}
return {'passed': True, 'reason': 'No injection patterns detected'}import re
def eval_no_pii_in_response(trace) -> dict:
response = trace['assistant_message']
pii_patterns = {
'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
}
for pii_type, pattern in pii_patterns.items():
if re.search(pattern, response):
return {
'passed': False,
'risk': 'pii_leakage',
'reason': f'Response contains {pii_type} pattern'
}
return {'passed': True, 'reason': 'No PII patterns detected'}Use an LLM judge for nuanced safety checks:
You are a safety evaluator. Determine if the AI response contains:
1. Medical advice without appropriate disclaimers
2. Financial advice presented as fact
3. Harmful or dangerous instructions
4. Content that is completely off-topic for the application's purpose
Response to evaluate: {response}
Return JSON: {"safe": true/false, "category": "...", "explanation": "..."}
Platform Integration for Safety Evals:
With LangWatch (built-in safety evaluators):
import langwatch
# LangWatch has 40+ built-in evaluators including safety checks
results = langwatch.evaluate.realtime(
trace=current_trace,
evaluators=[
"prompt_injection", # Built-in
"pii_detection", # Built-in
"toxicity", # Built-in
"off_topic", # Built-in
],
blocking=True # Block response if fails
)
if not results.all_passed:
return "I'm sorry, I can't help with that."With Langfuse (custom implementation):
# Run safety checks
injection_result = eval_prompt_injection(trace)
pii_result = eval_no_pii_in_response(trace)
if not injection_result['passed'] or not pii_result['passed']:
# Block and log
langfuse.create_score(
trace_id=trace.id,
name="safety_block",
value=0,
comment=f"Blocked: {injection_result['reason']} / {pii_result['reason']}"
)
return "I'm sorry, I can't help with that."Guardrails run before the response reaches the user:
class GuardrailPipeline:
def __init__(self):
self.checks = [
eval_no_pii_in_response,
eval_prompt_injection,
eval_response_length,
eval_no_harmful_content,
]
def check(self, trace) -> dict:
for check_fn in self.checks:
result = check_fn(trace)
if not result['passed']:
return {
'action': 'block',
'reason': result['reason'],
'fallback': "I'm sorry, I can't help with that. Let me connect you with a human agent."
}
return {'action': 'allow'}Set up automated checks that run on a sample of production traces:
def daily_eval_report(traces_df):
"""Run daily on a sample of yesterday's production traces"""
results = {
'total_traces': len(traces_df),
'safety_failures': sum(1 for t in traces_df if not eval_no_pii(t)['passed']),
'quality_failures': sum(1 for t in traces_df if not eval_quality(t)['passed']),
'injection_attempts': sum(1 for t in traces_df if not eval_injection(t)['passed']),
}
# Alert if failure rates spike
if results['safety_failures'] / results['total_traces'] > 0.01:
send_alert("Safety failure rate above 1%!")
return resultsPlatform Monitoring Dashboards:
LangWatch: Built-in monitoring dashboard with automatic alerts for safety violations, cost spikes, and latency increases.
Langfuse: Custom dashboards via API, requires manual setup but more flexible for complex alerting logic.
Before any AI feature ships, ensure these evals exist:
- PII leakage detection (code-based)
- Prompt injection detection (code-based + LLM)
- Off-topic/harmful content (LLM judge)
- Response length limits (code-based)
- Appropriate disclaimers for regulated domains (LLM judge)
Even a good judge makes mistakes. If your judge has:
- TPR = 95.7% (misses 4.3% of real passes)
- TNR = 100% (never misses a real fail)
Then the raw pass rate from your judge is slightly biased.
judgy is a Python library that corrects for judge errors using statistical methods. It takes:
- Test labels (ground truth from your labeled data)
- Test predictions (what your judge said about the labeled data)
- Unlabeled predictions (what your judge said about all production traces)
And returns a corrected success rate with confidence intervals.
import numpy as np
from judgy import estimate_success_rate
# From your test set evaluation (Step 6 from Chapter 4)
test_labels = np.array([0, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...]) # Ground truth
test_preds = np.array([0, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...]) # Judge predictions
# From running judge on all production traces (Step 7)
unlabeled_preds = np.array([1, 1, 0, 1, 1, 1, 0, 1, ...]) # Judge on all data
# Compute corrected rate
results = estimate_success_rate(
test_labels=test_labels,
test_preds=test_preds,
unlabeled_preds=unlabeled_preds
)Final Evaluation on 1000 traces:
Raw observed success rate: 84.4%
Corrected success rate: 88.2% (+3.8 percentage points)
95% Confidence Interval: [84.4%, 98.5%]
Interpretation:
The Recipe Bot adheres to dietary preferences approximately
88.2% of the time. We are 95% confident the true rate is
between 84.4% and 98.5%.
Why the correction matters: The raw rate (84.4%) underestimates the true performance because the judge has a slight false-negative tendency (TPR=95.7%, not 100%). The corrected rate (88.2%) accounts for this bias.
Platform-agnostic: judgy works with results from any platform. Export your test set results and production predictions, then run the correction:
# With LangWatch results
test_results = langwatch.get_experiment_results(experiment_id="test-eval")
test_labels = test_results["ground_truth"]
test_preds = test_results["judge_predictions"]
production_results = langwatch.get_evaluation_results(eval_id="production-run")
unlabeled_preds = production_results["predictions"]
# Run judgy correction
corrected = estimate_success_rate(test_labels, test_preds, unlabeled_preds)# With Langfuse results (manual export)
test_labels = [score.value for score in test_scores if score.name == "ground_truth"]
test_preds = [score.value for score in test_scores if score.name == "judge"]
unlabeled_preds = [score.value for score in production_scores if score.name == "judge"]
# Run judgy correction
corrected = estimate_success_rate(test_labels, test_preds, unlabeled_preds)When presenting to stakeholders:
"Our Recipe Bot correctly follows dietary restrictions 88% of the time,
with 95% confidence that the true rate is between 84% and 99%.
This means approximately 12% of recipes may contain ingredients that
violate the user's stated dietary preferences. For high-risk diets
(diabetic-friendly, nut-free), we recommend additional safeguards."
This is much more credible than "we tested it and it seems to work."
Many teams build great eval suites, then never systematically use the results to improve their system. Evals are only valuable if they drive action.
1. Run evals → identify top failure mode
2. Root-cause the failure (is it prompt? retrieval? tool? data?)
3. Implement a fix (change prompt, add guardrail, fix tool)
4. Run evals again → confirm improvement, check for regressions
5. Repeat with the next failure mode
When your eval identifies a failure, ask where in the pipeline it happened:
| Failure Location | Symptoms | Typical Fixes |
|---|---|---|
| System prompt | Wrong tone, missing capabilities, policy violations | Edit prompt, add examples, add constraints |
| Retrieval | Wrong documents, missing context | Better chunking, reranking, query expansion |
| Tool calls | Wrong tool selected, wrong parameters | Improve tool descriptions, add validation |
| Generation | Hallucination, wrong format, ignores context | Few-shot examples, structured output, temperature tuning |
| Post-processing | Truncation, encoding issues, format errors | Fix parsing code, add validation |
Every time you fix something, you risk breaking something else. Set up regression tests:
class RegressionSuite:
def __init__(self):
self.known_cases = [] # Cases that previously failed and were fixed
def add_regression_case(self, input, expected_output, failure_description):
self.known_cases.append({
"input": input,
"expected": expected_output,
"original_failure": failure_description,
})
def run(self, pipeline):
regressions = []
for case in self.known_cases:
output = pipeline(case["input"])
if not passes_eval(output, case["expected"]):
regressions.append({
"input": case["input"],
"original_failure": case["original_failure"],
"current_output": output,
})
return regressions
# Usage: run before every prompt change or model switch
suite = RegressionSuite()
suite.add_regression_case(
input="Give me a vegan recipe with honey",
expected_output="Should explain honey isn't vegan and suggest alternatives",
failure_description="Bot used to include honey in vegan recipes"
)Platform Support for Regression Testing:
LangWatch: Built-in regression test suites, automatic comparison with baseline runs.
Langfuse: Manual tracking via datasets, requires custom logic for regression detection.
When evaluating whether to switch models (e.g., GPT-4o vs. Claude vs. Gemini):
MODELS = ["gpt-4o", "claude-sonnet-4-5-20250929", "gemini-2.0-flash"]
for model in MODELS:
results = run_eval_suite(model=model, test_set=test_data)
print(f"{model}: TPR={results['tpr']:.1%}, TNR={results['tnr']:.1%}, "
f"cost=${results['cost']:.2f}, latency={results['latency_p50']:.0f}ms")After every eval cycle, create a simple report:
EVAL REPORT — Week of [date]
Top 3 failure modes this week:
1. [Failure mode] — [X]% of traces — [Root cause] — [Action item]
2. [Failure mode] — [X]% of traces — [Root cause] — [Action item]
3. [Failure mode] — [X]% of traces — [Root cause] — [Action item]
Improvements from last week:
- [Previous fix]: Failure rate went from X% to Y% ✅
Regressions detected: [None / List]
- Ambiguous cases where even experts disagree — you need to capture that disagreement
- High-stakes domains (medical, legal, financial) where errors have real consequences
- New failure modes that your LLM judge hasn't been trained to detect
- Ground truth calibration — even if you use LLM labeling at scale, validate a sample manually
If two humans disagree on a label, your eval criteria aren't clear enough.
Process:
- Have 2-3 people independently label the same 50 traces
- Calculate agreement rate (% they agree)
- If agreement < 80%, your criteria need to be more specific
- Discuss disagreements, update criteria, re-label
def cohen_kappa(labels_a, labels_b):
"""Calculate inter-annotator agreement"""
agree = sum(a == b for a, b in zip(labels_a, labels_b))
p_observed = agree / len(labels_a)
# Expected agreement by chance
p_a_pos = sum(a == "PASS" for a in labels_a) / len(labels_a)
p_b_pos = sum(b == "PASS" for b in labels_b) / len(labels_b)
p_expected = p_a_pos * p_b_pos + (1 - p_a_pos) * (1 - p_b_pos)
kappa = (p_observed - p_expected) / (1 - p_expected)
return kappa
# Interpretation:
# kappa > 0.8: Excellent agreement (criteria are clear)
# kappa 0.6-0.8: Good agreement (minor clarifications needed)
# kappa < 0.6: Poor agreement (rewrite criteria)50 high-quality labels beat 500 noisy labels. Invest time in:
- Clear, written labeling guidelines with examples
- Edge case documentation ("if you see X, label it as Y because...")
- Regular calibration sessions where labelers discuss disagreements
PMs and QAs often produce better labels than engineers because:
- You know what a good user experience looks like
- You understand the product's policies and constraints
- You think from the user's perspective, not the code's perspective
Running GPT-4o as a judge on 10,000 traces is expensive. Here's how to manage costs:
Not every eval needs the best model:
| Judge Model | Cost (per 1K traces) | When to Use |
|---|---|---|
| GPT-4o / Claude Opus | ~$5-15 | Complex subjective judgments, safety-critical |
| GPT-4o-mini / Claude Haiku | ~$0.50-1.50 | Clear-cut criteria, well-defined rubrics |
| Code-based | $0 | Format checks, pattern matching, validation |
Tip: Start with a strong model, validate your judge prompt, then test if a cheaper model gives similar TPR/TNR. Often it does.
You don't need to eval every trace:
import random
def sample_traces(traces, sample_rate=0.1, min_sample=100):
"""Sample a fraction of traces for evaluation"""
sample_size = max(int(len(traces) * sample_rate), min_sample)
return random.sample(traces, min(sample_size, len(traces)))
# 10% sample of 50,000 daily traces = 5,000 evals
# Statistical confidence is still high with proper samplingRun cheap evals on everything, expensive evals on a sample:
# Tier 1: Run on ALL traces (code-based, free)
tier1_results = [eval_format(t) for t in all_traces]
# Tier 2: Run on traces that passed Tier 1 (cheap LLM, ~$0.50/1K)
tier1_passed = [t for t, r in zip(all_traces, tier1_results) if r['passed']]
tier2_results = run_llm_eval(tier1_passed, model="gpt-4o-mini")
# Tier 3: Run on a sample (expensive LLM, ~$5/1K)
sample = random.sample(tier1_passed, 500)
tier3_results = run_llm_eval(sample, model="gpt-4o")If the same input appears multiple times, cache the eval result:
import hashlib
eval_cache = {}
def cached_eval(trace, eval_fn):
key = hashlib.md5(str(trace['input'] + trace['output']).encode()).hexdigest()
if key not in eval_cache:
eval_cache[key] = eval_fn(trace)
return eval_cache[key]Platform Support for Caching:
LangWatch: Built-in caching for evaluations, automatically deduplicates identical traces.
Langfuse: Manual caching required, but supports custom cache keys via metadata.
| Check Type | Typical Latency | Suitable for Real-Time? |
|---|---|---|
| Regex/code checks | <1ms | Yes |
| Embedding similarity | 10-50ms | Yes |
| Small LLM (Haiku-class) | 200-500ms | Marginal (adds noticeable delay) |
| Large LLM (GPT-4o-class) | 1-3s | No (use offline only) |
Goal: Capture traces of every AI interaction.
Pick your platform and set it up:
LangWatch:
pip install langwatch
# Sign up at langwatch.ai or run self-hosted Dockerimport langwatch
langwatch.init() # That's it! Auto-instrumentation enabledLangfuse:
pip install langfuse openai
# Sign up at cloud.langfuse.com or self-hostfrom langfuse.openai import OpenAI # Drop-in replacement
client = OpenAI() # Auto-tracedThen instrument your app (see Chapter 2 for full examples).
Deliverable: Every AI interaction is logged and visible in your observability UI.
Goal: Review 100 traces and take notes.
- Open your trace viewer (LangWatch or Langfuse UI)
- Browse through traces
- Note problems in a spreadsheet or CSV
- Budget 30-60 seconds per trace
Deliverable: 40-50 error notes from 100 traces.
Goal: Group your notes into 5-6 categories.
- Export your notes
- Use an LLM to suggest categories
- Refine categories to be specific and actionable
- Label each note with a category
- Count occurrences
Deliverable: Prioritized list of what's broken, with frequency data.
Goal: Create one code-based eval and one LLM judge.
Code-based eval (Day 5): Pick your highest-frequency objective issue.
LLM judge (Day 6-7):
- Write the judge prompt with criteria + examples
- Label 50-100 traces as ground truth
- Split into train/dev/test
- Validate on dev set (iterate prompt until TPR/TNR > 80%)
- Test on test set for final metrics
Deliverable: Two working evals you can run on new traces.
With LangWatch:
import langwatch
# All evaluators (code + LLM) in one place
results = langwatch.evaluate.batch(
dataset=daily_traces,
evaluators=[
"no_markdown_sms", # Code-based (custom)
"dietary_compliance", # LLM-based (built-in)
]
)
print(f"Pass rate: {results.metrics['pass_rate']:.1%}")With Langfuse:
# Run evaluators separately
for trace in daily_traces:
# Code-based
markdown_result = eval_no_markdown(trace)
langfuse.create_score(trace_id=trace.id, name="markdown", ...)
# LLM-based
dietary_result = run_dietary_judge(trace)
langfuse.create_score(trace_id=trace.id, name="dietary", ...)def check_for_degradation(current_rate, historical_avg, threshold=1.5):
"""Alert if failure rate spikes"""
return current_rate > historical_avg * threshold
# Example alert
if check_for_degradation(today_failure_rate, avg_failure_rate):
send_slack_alert("Eval failure rate spiked!")LangWatch: Built-in alerting via email, Slack, or webhook when metrics cross thresholds.
Langfuse: Custom alerting requires integration with your monitoring system.
LangWatch: Built-in analytics dashboard, no setup needed.
Langfuse: Create custom dashboard using their API:
# Fetch recent scores
scores = langfuse.api.score.list(limit=1000, from_timestamp=last_week)
# Aggregate and visualize
failure_rates = aggregate_by_day(scores)
plot_dashboard(failure_rates)Every Monday (15 minutes):
- Check your observability UI for anomalies
- Review any alerts from past week
- Note patterns
Every Month (2 hours):
- Do error analysis on 50 new traces
- Look for new failure modes
- Add new evals if needed
- Retire evals that never fire
After Major Changes (1 hour):
- Run full eval suite
- Compare to baseline
- Investigate any regressions
What people do: Jump straight to building LLM judges or dashboards. Why it's wrong: You don't know what to measure yet. Fix: Always start with error analysis. Spend real time reviewing traces.
What people do: "My judge has 90% agreement with humans, ship it!" Why it's wrong: A judge that always says "pass" gets 90% agreement when failures are rare. Fix: Always calculate TPR and TNR separately. Both must be high.
What people do: "This is technical, let engineering review the logs." Why it's wrong: Engineers don't have product intuition or domain expertise. Fix: PMs and QAs must do error analysis. This is core product/quality work.
What people do: Use all labeled data to build and test the judge. Why it's wrong: You're overfitting to your test data. Your metrics are meaningless. Fix: Use the 15%/40%/45% split. Never touch the test set until final evaluation.
What people do: Build the product, ship it, then start thinking about evals. Fix: Build evals while building the product, not after.
What people do: "Let's have an eval for everything!" Fix: Start with 2-3 evals for your biggest problems. Add more only when needed. Rule: If an eval hasn't fired in 3 months, remove it.
What people do: "My eval catches all real problems (TPR=95%), good enough." Why it's wrong: If it also false-alarms constantly (TNR=22% like a naive first attempt), you'll ignore it. Fix: Both TPR AND TNR must be high. Low TNR means the eval is useless.
What people do: Write an eval, assume it works, run it on all data. Fix: Test your evals with known good and bad cases before deploying.
What people do: "This LLM judge prompt worked for someone else, I'll use it." Fix: Write evals specific to YOUR product, YOUR policies, YOUR users.
What people do: Edit system prompt directly in production. Fix: Use your platform's prompt management (LangWatch, Langfuse, etc.) to version prompts. Log which version was used with each trace.
What people do: Report the raw pass rate from the judge as the true rate. Fix: Use judgy to correct for judge errors and report confidence intervals.
What people do: Build a distributed eval platform before reviewing a single trace. Fix: Start simple. CSV + Python script + any observability tool. Add complexity only when simple stops working.
| Tool | Type | Best For | Cost |
|---|---|---|---|
| LangWatch | Open source, cloud or self-hosted | Simple setup, built-in evaluators, great analytics | Free tier + paid |
| Langfuse | Open source, cloud or self-hosted | Custom pipelines, maximum flexibility, large community | Free tier + paid |
| Braintrust | Cloud | Excellent UI, team collaboration | Paid |
| LangSmith | Cloud | LangChain users | Paid |
| Build Your Own | Custom | Learning, custom needs | Free |
- LangWatch Evaluators - 40+ built-in evaluators covering safety, quality, RAG, and custom domains
- Langfuse Evals - Built-in LLM-as-Judge, custom evaluators via SDK
- Simple Evals (OpenAI) - Lightweight model-graded evals
- Ragas - Specialized for RAG evaluation
- DeepEval - Comprehensive eval framework
- judgy - Statistical bias correction for LLM judges: github.com/ai-evals-course/judgy
- rank_bm25 - BM25 retrieval for RAG systems
- litellm - Unified LLM API interface
| Feature | LangWatch | Langfuse | Notes |
|---|---|---|---|
| Setup Time | 5 min (3 lines) | 15 min (more config) | LangWatch: langwatch.init() |
| Built-in Evaluators | 40+ | 0 (all custom) | LangWatch saves significant dev time |
| Custom Evaluators | Yes (decorator) | Yes (full SDK) | Both support custom logic |
| Analytics Dashboard | Built-in, automatic | Build your own | LangWatch: zero-config analytics |
| Cost Tracking | Automatic | Manual tagging | LangWatch tracks per-model costs |
| Community Size | Growing | Large, established | Langfuse has more integrations |
| Self-Hosting | Docker (simple) | Docker (more complex) | Both are fully open-source |
| Prompt Management | Yes | Yes (more mature) | Langfuse has richer versioning UI |
| Caching | Built-in | Manual | LangWatch auto-caches duplicate evals |
| Batch Evaluation | Native API | Via experiments | LangWatch: simpler for large batches |
| Real-time Evals | Supported | Via scores API | Both work, LangWatch is faster to set up |
When to choose LangWatch:
- You want to start fast (< 10 min setup)
- You need built-in evaluators for common use cases
- You want automatic analytics without configuration
- You prefer opinionated tooling that "just works"
When to choose Langfuse:
- You need maximum flexibility for custom workflows
- You have complex evaluation logic
- You want the largest community and integration ecosystem
- You prefer building your own dashboards and analytics
Why not both? Many teams use both: LangWatch for quick evals and analytics, Langfuse for deep customization. They're complementary, not competitive.
- Start simple - Don't over-engineer
- Error analysis first - Always
- PMs and QAs must be involved - This is product/quality work
- Both TPR and TNR matter - Not just agreement
- Code evals when possible - LLM judges when needed
- Test your evals - They can have bugs too
- Split your data - Train/Dev/Test is non-negotiable
- Correct for bias - Use judgy for honest metrics
- Version your prompts - Track what changed when
- Iterate based on data - Not hunches
A plain-language glossary of the technical terms used throughout this guide. Share this with non-technical stakeholders.
| Term | Definition |
|---|---|
| Eval (Evaluation) | A systematic test that checks if an AI system is working correctly for a specific criterion |
| LLM-as-a-Judge | Using a language model to automatically evaluate the output of another AI system |
| Ground Truth | Human-verified labels that represent the "correct" answer; used to measure judge accuracy |
| True Positive Rate (TPR) | The percentage of actual positives (e.g., good responses) that the judge correctly identifies. Also called recall or sensitivity. Formula: TP / (TP + FN) |
| True Negative Rate (TNR) | The percentage of actual negatives (e.g., bad responses) that the judge correctly catches. Also called specificity. Formula: TN / (TN + FP) |
| False Positive (FP) | When the judge says "Pass" but the real answer is "Fail" — a missed defect |
| False Negative (FN) | When the judge says "Fail" but the real answer is "Pass" — a false alarm |
| Precision | Of all items the judge labeled positive, how many were actually positive. Formula: TP / (TP + FP) |
| F1 Score | The harmonic mean of precision and recall — a single number balancing both. Formula: 2 * (Precision * Recall) / (Precision + Recall) |
| Confusion Matrix | A 2x2 table showing TP, FP, FN, TN counts — the foundation of all classification metrics |
| Confidence Interval (CI) | A range of values (e.g., 72%–81%) within which the true metric likely falls, given sampling uncertainty |
| Bias Correction | Adjusting raw judge scores to account for systematic over- or under-counting of passes/fails |
| Cohen's Kappa | A statistic measuring agreement between two raters (or a rater and ground truth), adjusting for chance agreement. Values: <0.2 poor, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect |
| Term | Definition |
|---|---|
| Train/Dev/Test Split | Dividing labeled data into three sets: Train (for building the judge prompt), Dev (for iterating), Test (for final unbiased measurement) |
| Stratified Split | Splitting data so each subset has the same proportion of Pass/Fail labels as the original |
| Few-Shot Examples | Example input-output pairs included in a prompt to show the model what good evaluation looks like |
| Open Coding | Reading traces and writing freeform notes about what's going wrong — no categories yet |
| Axial Coding | Grouping your open-coded notes into categories (error types) and counting frequency |
| Dimensional Sampling | Systematically creating test inputs that cover all important dimensions (topics, edge cases, user types) |
| Failure Mode | A specific, named way the AI system can fail (e.g., "dietary violation," "hallucinated citation") |
| Error Taxonomy | The organized list of all failure modes for your application, ranked by frequency and severity |
| Term | Definition |
|---|---|
| Trace | A complete record of one AI interaction — from user input through all processing steps to final output |
| Span | A single unit of work within a trace (e.g., one LLM call, one database lookup, one tool invocation) |
| Instrumentation | Adding code to your application so that traces and spans are automatically captured |
| Dataset | A stored collection of examples (inputs + expected outputs) used for running experiments |
| Experiment | Running your AI system (or judge) against a dataset and recording all results |
| Annotation | A label or score attached to a trace or span — can be human-generated or from an automated eval |
| Prompt Version | A saved snapshot of a prompt template, allowing you to track changes and compare performance |
| Term | Definition |
|---|---|
| RAG (Retrieval-Augmented Generation) | An AI architecture that retrieves relevant documents before generating a response |
| BM25 | A classic keyword-based search algorithm used as a baseline for retrieval quality |
| Recall@K | Of all relevant documents, what fraction appear in the top K retrieved results |
| MRR (Mean Reciprocal Rank) | Average of 1/rank for the first relevant document — higher means relevant docs appear sooner |
| Chunking | Splitting large documents into smaller pieces for retrieval |
| Context Window | The maximum amount of text an LLM can process in a single call |
| Hallucination | When an LLM generates information not supported by the retrieved context |
| Term | Definition |
|---|---|
| p_obs (Observed Rate) | The raw pass rate from the judge, before any correction |
| θ̂ (Theta-hat) | The corrected true success rate after accounting for judge errors |
| judgy | A Python library that computes corrected success rates and confidence intervals given TPR and TNR |
| Sampling | Evaluating a random subset of traces instead of all traces — used to manage cost |
| Statistical Significance | Whether an observed difference is likely real or could be due to random chance |
| Situation | Type | Example |
|---|---|---|
| Format checking | Code-based | No markdown in SMS |
| Required fields | Code-based | Tour confirmation has date/time |
| Tool selection | Code-based | Called correct function |
| Subjective quality | LLM judge | Response is helpful |
| Policy compliance | LLM judge | Handoff requirements met |
| Dietary adherence | LLM judge | Recipe follows restrictions |
| Factual accuracy | LLM judge | Answer matches sources |
| Response length | Code-based | Under 500 characters |
Confusion Matrix:
Actual Positive | Actual Negative
-----------------|-----------------
Predicted Pos | TP | FP |
Predicted Neg | FN | TN |
TPR (Recall) = TP / (TP + FN) "Catches real positives"
TNR (Specificity) = TN / (TN + FP) "Avoids false alarms"
Precision = TP / (TP + FP)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Target for evals:
- TPR > 80% (catches real issues)
- TNR > 80% (doesn't false alarm)
Train: ~15% (few-shot examples for judge prompt)
Dev: ~40% (iterate and improve judge prompt)
Test: ~45% (final, unbiased evaluation - use ONCE)
| Activity | Time | Frequency |
|---|---|---|
| Initial setup (LangWatch) | 30 min | Once |
| Initial setup (Langfuse) | 1 hour | Once |
| Error analysis (100 traces) | 1 hour | Monthly |
| Build code-based eval | 1 hour | As needed |
| Build LLM judge (full pipeline) | 4-6 hours | As needed |
| Validate eval on dev set | 1 hour | Per iteration |
| Weekly maintenance | 30 min | Weekly |
LangWatch (fastest):
import langwatch
langwatch.init()
# Done! Auto-tracing enabledLangfuse (more config):
from langfuse.openai import OpenAI
client = OpenAI()
# Set environment variables firstThis is a production-quality judge prompt that achieved TPR=95.7% and TNR=100%:
You are an expert nutritionist and dietary specialist evaluating whether
recipe responses properly adhere to specified dietary restrictions.
DIETARY RESTRICTION DEFINITIONS:
- Vegan: No animal products (meat, dairy, eggs, honey, etc.)
- Vegetarian: No meat or fish, but dairy and eggs are allowed
- Gluten-free: No wheat, barley, rye, or other gluten-containing grains
- Dairy-free: No milk, cheese, butter, yogurt, or other dairy products
- Keto: Very low carb (typically <20g net carbs), high fat, moderate protein
- Paleo: No grains, legumes, dairy, refined sugar, or processed foods
- Pescatarian: No meat except fish and seafood
- Kosher: Follows Jewish dietary laws (no pork, shellfish, mixing meat/dairy)
- Halal: Follows Islamic dietary laws (no pork, alcohol, proper slaughter)
- Nut-free: No tree nuts or peanuts
- Low-carb: Significantly reduced carbohydrates (typically <50g per day)
- Sugar-free: No added sugars or high-sugar ingredients
- Raw vegan: Vegan foods not heated above 118 degrees F (48 degrees C)
- Whole30: No grains, dairy, legumes, sugar, alcohol, or processed foods
- Diabetic-friendly: Low glycemic index, controlled carbohydrates
- Low-sodium: Reduced sodium content for heart health
EVALUATION CRITERIA:
- PASS: The recipe clearly adheres to the dietary preferences with
appropriate ingredients and preparation methods
- FAIL: The recipe contains ingredients or methods that violate the
dietary preferences
- Consider both explicit ingredients and cooking methods
Example 1:
Query and Response: [Gluten-free pizza dough using gluten-free flour blend,
baking powder, olive oil, honey, apple cider vinegar...]
Explanation: The recipe uses gluten-free flour blend. All other ingredients
do not contain gluten. The preparation method does not introduce any
gluten-containing elements.
Label: PASS
Example 2:
Query and Response: [Raw vegan quinoa salad with cooked quinoa,
fresh vegetables, olive oil, lemon juice...]
Explanation: The recipe FAILS because it includes cooked quinoa.
Raw vegan diets do not allow foods heated above 118 degrees F (48 degrees C),
and cooking quinoa involves boiling, which exceeds this limit.
Label: FAIL
Now evaluate the following recipe response:
Query: {query}
Dietary Restriction: {dietary_restriction}
Recipe Response: {response}
RETURN YOUR EVALUATION IN JSON FORMAT:
"label": "PASS" or "FAIL",
"explanation": "Detailed explanation citing specific ingredients or methods"
Complete evaluator prompts for each pipeline state. Each follows the same structure:
1. Role definition ("You are an expert evaluator for the X state")
2. What the state should do (3-4 bullet points)
3. Evaluation criteria (3-4 numbered criteria)
4. What counts as a failure (4-5 specific failure types)
5. What does NOT count as a failure (2-3 acceptable variations)
6. Input/Output template variables
7. Output format (JSON with label and explanation)
| State | Key Criteria | Common Failures |
|---|---|---|
| ParseRequest | Accuracy, completeness, format | Misinterpretation, missing constraints |
| PlanToolCalls | Tool selection, ordering, rationale | Missing tools, incorrect tools |
| GenRecipeArgs | Query relevance, filter accuracy | Missing dietary filters, wrong servings |
| GetRecipes | Relevance, dietary compliance | Irrelevant recipes, dietary violations |
| GenWebArgs | Relevance, context alignment | Off-topic queries, too generic |
| GetWebInfo | Relevance, quality | Irrelevant results, off-topic content |
| ComposeResponse | Recipe accuracy, step clarity, constraint compliance | Contradictions, missing info, violations |
Full evaluator prompts for each state follow the structure above, tailored to the specific responsibilities and failure modes of each pipeline stage.
A collection of techniques that consistently improve LLM judge accuracy. Use this as a checklist when building or debugging a judge.
Always ask the judge to explain its reasoning before giving the final label. This is the single most impactful technique.
❌ Bad: "Label: PASS or FAIL. Explanation: ..."
✅ Good: "Explanation: [your reasoning]. Label: PASS or FAIL"
Why it works: When the model writes the label first, the explanation becomes a post-hoc rationalization. When reasoning comes first, the model actually deliberates, and the label follows logically.
Vague criteria lead to inconsistent judgments. Define exactly what counts as Pass and Fail.
❌ Vague: "Does the response follow dietary restrictions?"
✅ Specific: "PASS: Every ingredient in the recipe is compatible with the stated
dietary restriction. FAIL: At least one ingredient violates the restriction,
OR the cooking method introduces a violation (e.g., frying in butter for
dairy-free)."
Judges tend to be overly strict. Explicitly list acceptable variations to calibrate leniency.
What does NOT count as a failure:
- Suggesting optional toppings that can be omitted
- Using brand names instead of generic ingredient names
- Minor formatting issues in the recipe
- Providing substitution suggestions alongside the main recipe
Generic examples are far less effective than examples from your actual data. Always pull few-shot examples from your Train set.
Example selection strategy:
- 1 clear Pass (easy case)
- 1 clear Fail (easy case)
- 1 borderline case (the kind the judge will struggle with most)
Include the reasoning in each example, not just the label. The judge learns the reasoning pattern, not just the answer.
| Use Case | Temperature | Rationale |
|---|---|---|
| Binary classification (Pass/Fail) | 0.0 | Deterministic, reproducible |
| Likert scale scoring (1-5) | 0.0–0.3 | Low variance, consistent |
| Generating diverse critiques | 0.5–0.7 | Some creativity for different angles |
| Brainstorming failure modes | 0.7–1.0 | High creativity for exploration |
For judge evaluation, always use temperature 0.0. You want the same input to produce the same output every time.
Tell the judge exactly how to format its response. JSON is preferred for parsing reliability.
Return your evaluation as JSON:
{
"explanation": "Step-by-step reasoning about the response...",
"label": "PASS or FAIL",
"confidence": "HIGH, MEDIUM, or LOW",
"flagged_items": ["list of specific problematic items, if any"]
}
Tip: The confidence field is useful for identifying borderline cases during error analysis but is not a reliable calibrated probability.
| Bias | Description | Mitigation |
|---|---|---|
| Leniency bias | Judge defaults to "Pass" too often | Add explicit failure examples; emphasize "when in doubt, FAIL" |
| Verbosity bias | Judge favors longer, more detailed responses | Add examples where a short response passes and a long one fails |
| Position bias | Judge favors the first/last option in a list | Randomize order if comparing multiple outputs |
| Sycophancy bias | Judge agrees with confident-sounding text | Add examples where confident text is wrong |
| Anchoring bias | Judge is swayed by the first piece of evidence | Instruct judge to consider ALL evidence before concluding |
1. Write initial prompt with 2-3 few-shot examples
2. Run on Dev set → calculate TPR and TNR
3. Find the worst errors (cases where judge was wrong)
4. For each error:
a. Understand WHY the judge was wrong
b. Add a clarification, edge case, or new example to the prompt
5. Re-run on Dev set → check if metrics improved
6. Repeat steps 3-5 until TPR > 80% and TNR > 80%
7. Run ONCE on Test set for final, unbiased metrics
Common iteration patterns:
- TPR too low → Judge is missing real failures. Add more Fail examples, make fail criteria more explicit.
- TNR too low → Judge has too many false alarms. Add "what does NOT count as a failure" section, add Pass examples for edge cases.
- Both low → Criteria are ambiguous. Rewrite from scratch with clearer definitions.
| Model Tier | When to Use | Typical Accuracy |
|---|---|---|
| GPT-4o / Claude Sonnet 4.5 | High-stakes evals, complex reasoning | 85–95% |
| GPT-4o-mini / Claude Haiku | Cost-sensitive, high-volume evals | 75–90% |
| Open-source (Llama, Mistral) | Self-hosted, privacy-sensitive | 70–85% |
Tip: Start with the most capable model to establish a performance ceiling. Then test whether a cheaper model can match it for your specific use case. Often it can — especially with good few-shot examples.
Always version your judge prompts. Track:
- The prompt text
- The few-shot examples used
- The model and temperature
- Dev set metrics (TPR, TNR) at that version
- Date and reason for the change
Both LangWatch and Langfuse have built-in prompt versioning. Use it.
With LangWatch:
import langwatch
langwatch.prompts.create(
name="dietary-judge-v3",
description="Added edge cases for keto",
template=judge_prompt_text,
model="gpt-4o",
temperature=0
)With Langfuse:
from langfuse import get_client
langfuse = get_client()
langfuse.create_prompt(
name="dietary-judge",
prompt=judge_prompt_text,
labels=["staging"], # promote to "production" after validation
)import langwatch
# Initialize (auto-instruments OpenAI, LangChain, LlamaIndex, etc.)
langwatch.init()
# Add custom spans
@langwatch.span(type="chain")
def my_pipeline(question):
"""Parent span for the whole pipeline"""
sql = generate_sql(question)
results = execute_query(sql)
return synthesize_answer(question, results)
@langwatch.span(type="llm")
def generate_sql(question):
"""Tracked as an LLM generation"""
return client.chat.completions.create(...)
@langwatch.span(type="tool")
def execute_query(sql):
"""Tracked as a tool call"""
return db.execute(sql)import langwatch
# Get all spans for a specific name
spans_df = langwatch.get_spans(
filters={"name": "ParseRequest"}
)
# Get spans within a time range
spans_df = langwatch.get_spans(
filters={
"timestamp_gte": "2025-02-01",
"timestamp_lte": "2025-02-09"
}
)import pandas as pd
import langwatch
df = pd.DataFrame({
"query": ["Query 1", "Query 2"],
"expected_answer": ["Answer 1", "Answer 2"]
})
dataset = langwatch.datasets.create(
name="my-dataset",
dataframe=df
)import langwatch
# Use built-in evaluators (40+ available)
results = langwatch.evaluate.batch(
dataset=traces_df,
evaluators=[
"dietary_compliance", # Built-in
"toxicity", # Built-in
"prompt_injection", # Built-in
]
)
# Create custom evaluator
@langwatch.evaluator(name="custom_check")
def my_evaluator(trace):
# Your logic here
return {"passed": True, "score": 1.0}
# Run custom evaluator
results = langwatch.evaluate.batch(
dataset=traces_df,
evaluators=["custom_check"]
)import langwatch
def my_task(example):
query = example["input"]["query"]
return {"answer": my_pipeline(query)}
# Run experiment with automatic metrics
results = langwatch.evaluate.batch(
dataset=dataset,
task=my_task,
evaluators=["accuracy", "latency", "cost"]
)
# View results
print(results.metrics)import langwatch
# Create prompt
prompt = langwatch.prompts.create(
name="recipe-assistant-v1",
template=[
{"role": "system", "content": "You are a recipe assistant..."},
{"role": "user", "content": "{{question}}"}
],
model="gpt-4o-mini",
temperature=0.7
)
# Use at runtime
messages = prompt.render(question="How do I make pancakes?")
response = client.chat.completions.create(
messages=messages,
model=prompt.model,
temperature=prompt.temperature
)from langfuse.openai import OpenAI # Drop-in replacement
from langfuse import observe, get_client
client = OpenAI() # Calls are automatically traced
@observe()
def my_pipeline(question):
"""Creates a parent trace"""
return generate_answer(question)
@observe(as_type="generation")
def generate_answer(question):
"""Tracked as a generation"""
return client.chat.completions.create(...)from langfuse import get_client
langfuse = get_client()
traces = langfuse.api.trace.list(limit=100, tags=["production"])
trace = langfuse.api.trace.get("trace_id")from langfuse import get_client
langfuse = get_client()
langfuse.create_dataset(name="my-dataset")
langfuse.create_dataset_item(
dataset_name="my-dataset",
input={"query": "What is AI?"},
expected_output={"answer": "Artificial Intelligence"},
)from langfuse import Evaluation
def my_task(*, item, **kwargs):
query = item["input"]["query"]
return my_pipeline(query)
def my_evaluator(*, output, expected_output, **kwargs):
correct = output == expected_output.get("answer")
return Evaluation(name="accuracy", value=1.0 if correct else 0.0)
result = langfuse.run_experiment(
name="baseline",
data=test_data,
task=my_task,
evaluators=[my_evaluator],
)
print(result.format())from langfuse import get_client
langfuse = get_client()
# Score a trace
langfuse.create_score(
trace_id="trace_id",
name="dietary_adherence",
value=1, # 0 or 1
data_type="BOOLEAN",
comment="Recipe correctly follows vegan restrictions",
)
# Score within context
with langfuse.start_as_current_observation(as_type="span", name="eval") as span:
span.score(name="accuracy", value=0.95, data_type="NUMERIC")from langfuse import get_client
langfuse = get_client()
langfuse.create_prompt(
name="my-prompt",
type="chat",
prompt=[
{"role": "system", "content": "You are a {{role}}"},
{"role": "user", "content": "{{question}}"},
],
labels=["production"],
)
prompt = langfuse.get_prompt("my-prompt", type="chat")
compiled = prompt.compile(role="chef", question="Best pasta recipe?")| Day | Activity | Time | Role Focus |
|---|---|---|---|
| 1 | Pick your platform (LangWatch or Langfuse), install it | 1h | All |
| 2 | Instrument your app with auto-tracing | 2h | Engineer |
| 2 | Browse the trace viewer UI, understand traces visually | 1h | PM/QA |
| 3 | Create a test dataset with dimensional sampling | 2h | All |
| 4 | Upload dataset to your platform, run first experiment | 1h | All |
| 5 | Review 50 traces, take notes (open coding) | 1h | All |
| 6 | Categorize errors using LLM (axial coding) | 1h | All |
| 7 | Prioritize: frequency x severity matrix | 30m | All |
| Day | Activity | Time | Role Focus |
|---|---|---|---|
| 8 | Build 2 code-based evals for your top issues | 2h | Engineer |
| 8 | Define eval criteria in plain English | 1h | PM/QA |
| 9 | Test evals with known good/bad cases | 1h | All |
| 10 | Run evals on all traces, calculate failure rates | 1h | All |
| 11-14 | Iterate based on results | 2h | All |
| Day | Activity | Time | Role Focus |
|---|---|---|---|
| 15 | Label 100-150 traces as ground truth | 3h | All |
| 16 | Split into Train/Dev/Test | 30m | Engineer |
| 17 | Write first judge prompt with few-shot examples | 2h | All |
| 18 | Validate on Dev set, calculate TPR/TNR | 1h | All |
| 19 | Iterate prompt until metrics > 80% | 2h | All |
| 20 | Final test on Test set | 30m | All |
| 21 | Run judge on all traces + correct with judgy | 1h | All |
| Day | Activity | Time | Role Focus |
|---|---|---|---|
| 22 | RAG evaluation — retrieval metrics + answer quality (Ch. 6) | 2h | Engineer |
| 23 | Multi-step pipeline evaluation (Ch. 7) | 2h | Engineer |
| 24 | Multi-turn conversation evaluation (Ch. 8) | 2h | Engineer |
| 25 | Safety evals — prompt injection, PII leakage (Ch. 9) | 2h | All |
| 26 | Set up regression test suite (Ch. 11) | 2h | Engineer |
| 27 | Human annotation calibration — measure inter-annotator agreement (Ch. 12) | 1h | All |
| 28 | Optimize for cost — tiered evaluation, sampling strategy (Ch. 13) | 1h | All |
| 29 | Create monitoring dashboard + automated eval runs | 2h | Engineer |
| 30 | Document eval suite, present to stakeholders, plan maintenance | 2h | All |
Real lessons from implementing complete eval pipelines in production:
On Building Judges (Chapters 4, 10)
-
LLM-as-Judge is powerful but needs guardrails - Without proper validation, a judge can confidently give wrong answers. Always validate against ground truth.
-
You must test evaluators against ground truth - A judge that seems reasonable but has TNR=22% is actively harmful — it misses most real failures.
-
Train/Dev/Test splits enable confidence - Without them, you're fooling yourself about your judge's quality. This is non-negotiable.
-
Iterating on judge prompts is crucial - The first prompt is never good enough. Plan for 3-5 iterations minimum. See Appendix E for techniques.
-
Explanation-before-verdict is the #1 technique - Asking the judge to reason before labeling improves accuracy more than any other single change.
On Process & Methodology (Chapters 3, 11, 12)
-
Error analysis is the real work - Fancy tools don't matter if you haven't sat down and looked at your failures. Open coding → axial coding → prioritization is the workflow that works.
-
Human annotators disagree more than you think - Measure inter-annotator agreement (Cohen's kappa) before trusting your ground truth. If humans can't agree, the judge won't either.
-
Closing the loop is what separates good teams from great ones - Running evals is only half the job. The other half is systematically turning failures into improvements and preventing regressions.
On Production & Scale (Chapters 9, 13)
-
Safety evals are not optional - Prompt injection, PII leakage, and jailbreak detection should be running before you worry about quality evals.
-
Start expensive, then optimize - Use GPT-4o/Claude Sonnet to establish your performance ceiling, then test whether a cheaper model can match it. Often it can.
-
Sampling beats exhaustive evaluation - Evaluating 10% of traces with statistical rigor gives you a better answer than evaluating 100% with a bad judge.
-
Good observability tools make the workflow 10x faster - Integrated tracing, evaluation, datasets, and experiments in one platform (LangWatch, Langfuse, etc.) saves enormous time vs. stitching together custom scripts.
On Platform Choice
-
LangWatch for speed, Langfuse for depth - LangWatch gets you results in hours with built-in evaluators. Langfuse gives maximum control for complex custom logic. Many teams use both.
-
Built-in evaluators save weeks of dev time - LangWatch's 40+ built-in evaluators cover most common use cases. If you're reinventing safety checks or RAG metrics, you're wasting time.
-
Community matters for long-term success - Langfuse's larger community means more integrations, more examples, more support. LangWatch's simpler API means faster onboarding.
AI evals are not just "testing" — they're a product development methodology that touches engineering, product management, and quality assurance.
Key takeaways:
- Everyone needs evals — Not just big companies. If your AI app touches users, you need systematic evaluation.
- Start with error analysis — Sit down and look at your failures before building anything automated (Chapter 3).
- PMs and QAs must lead — Error analysis and criteria definition are product/quality work, not just engineering tasks.
- Build incrementally — Start with code-based evals, then add LLM judges, then add safety evals. Don't try to do everything at once.
- Measure what matters — Application-specific criteria, not generic "helpfulness" scores.
- Both TPR and TNR — A judge that catches failures but also false-alarms is harmful. Measure both.
- Split your data — Train/Dev/Test is mandatory. Without it, you're overfitting your judge.
- Correct for bias — Use statistical correction (Chapter 10) for honest metrics.
- Close the loop — Evals that don't lead to improvements are wasted effort (Chapter 11).
- Plan for scale — Start with the best model, then optimize for cost (Chapter 13).
Your action plan (see Appendix G for details):
- Week 1: Set up observability (LangWatch or Langfuse), do error analysis
- Week 2: Build 2-3 core code-based evals
- Week 3: Build and validate an LLM judge with proper train/dev/test splits
- Week 4: Advanced topics — RAG evals, multi-turn evals, safety evals, automation
- Ongoing: 30 minutes per week maintenance + regression testing
Platform decision:
- Choose LangWatch if you want to start fast (<30 min setup) and use built-in evaluators
- Choose Langfuse if you need maximum flexibility and have complex custom workflows
- Use both if you want the best of both worlds (many teams do this)
Remember: The teams shipping the best AI products are the ones with the best evals. Not the fanciest models. Not the biggest teams. The ones who systematically measure and improve.
Start today. Your future self will thank you.
- LangWatch Docs: docs.langwatch.ai
- LangWatch GitHub: github.com/langwatch/langwatch
- Langfuse Docs: langfuse.com/docs
- Langfuse GitHub: github.com/langfuse/langfuse
- Maven Course (AI Evals for Engineers & PMs): maven.com/parlance-labs/evals
- HuggingFace Evaluation Guidebook: github.com/huggingface/evaluation-guidebook
- OpenAI Evals Platform: evals.openai.com
- OpenAI Cookbook (practical examples & guides): cookbook.openai.com
- OpenAI Research: openai.com/research
- OpenAI Docs (Evals): platform.openai.com/docs/guides/evals
- Anthropic Research: anthropic.com/research
- METR (Model Evaluation & Threat Research): metr.org
- Eugene Yan on eval process: eugeneyan.com/writing/eval-process
- Hamel Husain's Blog: hamel.dev — Applied AI engineering, LLM evals deep-dives
- Shreya Shankar's Site: sh-reya.com — LLM data systems research, eval methodology
- Maxim AI Articles: getmaxim.ai/articles — Agentic evaluation patterns
| Tool | Focus | License | Links |
|---|---|---|---|
| LangWatch | Observability & built-in evals | Apache 2.0 | GitHub · Docs |
| Langfuse | Custom pipelines & tracing | MIT | GitHub · Docs |
| RAGAS | RAG-specific evaluation | Apache 2.0 | GitHub · Docs |
| Comet Opik | LLM tracing & evaluation | Apache 2.0 | GitHub · Site |
| judgy | Statistical bias correction | Open | GitHub |
| Braintrust | Experimentation & logging | Partial | Docs |
| Galileo | Hallucination detection | Proprietary | Site |
| Maxim | Agentic system evaluation | Proprietary | Site |
| Company | Focus | Open Source | Best For | Unique Strength |
|---|---|---|---|---|
| LangWatch | Observability + Built-in Evals | Yes (Apache 2.0) | Fast setup, analytics | 40+ built-in evaluators, auto-analytics |
| Langfuse | Custom Pipelines | Yes (MIT) | Data sovereignty, flexibility | Self-hostable, full control over data |
| Anthropic | Safety / Red Teaming | Partial | Frontier risks | Constitutional classifiers, multi-attempt adversarial testing |
| OpenAI | Preparedness / Business | Evals toolkit | Enterprise context | SME probing, contextual evals |
| RAGAS | RAG-specific | Yes (Apache 2.0) | RAG pipelines | Reference-free metrics, synthetic test data generation |
| Maxim | Agentic Systems | No | Multi-agent apps | Simulation framework, no-code evaluation |
| Braintrust | Experimentation | Partial | Early-stage teams | Collaborative design, fast iteration |
| Galileo | Hallucinations | No | Quality assurance | ChainPoll, real-time monitoring |
| Comet Opik | LLM Tracing & Evals | Yes (Apache 2.0) | End-to-end observability | Framework integrations, online evaluation rules |
| METR | Catastrophic Risk | Research | Policy guidance | Autonomous capability assessment |
- Om Bharatiya: @ombharatiya
This guide was built on the foundation of the following people's work and ideas. Their courses, blogs, and open-source contributions made this guide possible:
- Hamel Husain: @HamelHusain — hamel.dev
- Shreya Shankar: @sh_reya — sh-reya.com
- Eugene Yan: @eugeneyan — eugeneyan.com
This guide was inspired by and builds upon the AI Evals for Engineers & PMs course by Hamel Husain and Shreya Shankar, extended with additional research, production-ready code examples, and multi-platform guides covering LangWatch, Langfuse, and the broader eval tooling ecosystem.
Author: Om Bharatiya | Created: February 2026