Eval Workshop: Building Evals for a Helpdesk Agent

A hands-on workshop covering the eval loop: error analysis, baselines, failure analysis, and prompt iteration.

Time: ~60 minutes

Prerequisites

Before the workshop starts, you need an Anthropic API key (get one here) or an OpenAI API key (get one here).

Option A: GitHub Codespaces (recommended)

Click the Codespaces button in the repo README — a fully configured environment opens in your browser. Then:

cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY (or OPENAI_API_KEY)

# Test the agent
uv run helpdesk-agent "My laptop won't turn on"
# Should show: Department: IT

Option B: VS Code Dev Container

Clone the repo and open in VS Code
When prompted, click "Reopen in Container"
Wait for the container to build (first time takes a few minutes)

cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY (or OPENAI_API_KEY)

# Test the agent
uv run helpdesk-agent "My laptop won't turn on"

Option C: Local Setup

Requires Python 3.13+, uv (brew install uv), and Docker.

# 1. Clone and install
git clone <repo-url>
cd eval-workshop-ext
uv sync

# 2. Set up environment
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY (https://console.anthropic.com/settings/keys)
# Or OPENAI_API_KEY if using OpenAI (https://platform.openai.com/api-keys)

# 3. Start services (pull latest images first)
docker compose pull && docker compose up -d

# 4. Verify services are running
docker compose ps
# Should show cat-cafe as running/healthy

# 5. Test the agent
uv run helpdesk-agent "My laptop won't turn on"
# Should show: Department: IT

The Problem

You're building an AI assistant for a company's internal helpdesk. Employees send requests like:

"My laptop won't connect to WiFi"
"How do I submit an expense report?"
"The conference room projector is broken"

Your agent needs to:

Route requests to the right department (IT, HR, Facilities, Finance, Legal, Security)
Answer questions when possible, using company policy documents
Escalate to humans when it can't help

The Architecture

                         ┌─────────────────┐
                         │  User Request   │
                         └────────┬────────┘
                                  │
                         ┌────────▼────────┐
                         │    Concierge    │
                         │  (front-line)   │
                         └────────┬────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              │                   │                   │
     ┌────────▼────────┐ ┌───────▼───────┐ ┌────────▼────────┐
     │  HR Specialist  │ │ IT Specialist │ │    Escalate     │
     │  (has KB)       │ │ (future)      │ │    to Human     │
     └────────┬────────┘ └───────────────┘ └─────────────────┘
              │
     ┌────────▼────────┐
     │  Return answer  │
     │  to concierge   │
     └─────────────────┘

The concierge is like a customer service rep: it talks to the employee, consults internal specialists for domain-specific answers, and decides whether to relay the answer or escalate to a human team.

The Challenge

How do you know if your agent is working? You need evals — a systematic way to:

Find failures before users do
Measure if changes actually help
Catch regressions when you update prompts

Today we'll build evals for this helpdesk agent, from error analysis to LLM-as-judge.

Part A: Routing

The capability: Can our agent route requests to the correct department?

This is the first capability we're building. Before the agent can answer questions or take actions, it needs to understand what kind of request it's dealing with and send it to the right place.

Let's find out if it works.

Part 1: Error Analysis

Goal: Understand how the agent behaves before measuring it.

Run the Agent

We start with the baseline config — routing-only (no specialists). Requests get classified and escalated but not answered.

Try these requests and predict the department before seeing the result:

uv run helpdesk-agent -c configs/baseline.yaml "My laptop won't turn on"