Building Evaluation Harnesses for LLM Agents

Workshop at Big Data Conference Europe 2025
📍 Vilnius, Lithuania | 📅 November 18, 2025

🎯 What You'll Build

In this hands-on workshop, you'll build an evaluation-driven RAG system that:

Extracts financial data from Wikipedia
Calculates Return on Investment (ROI) for films
Classifies performance (Blockbuster, Profitable, Break-even, Flop)
Evaluates against ground truth data

You'll learn to systematically improve agent performance through:

Baseline measurement
Error analysis
Iterative improvements
Metrics tracking (pass rate, cost, latency)

🚀 Quick Start

Prerequisites

Python 3.8+
A Cohere API key (free tier works!)

Setup

Clone or download this repository
Run the setup script:
```
chmod +x setup.sh
./setup.sh
```
Add your API key:
- Edit .env file
- Replace your-api-key-here with your actual Cohere API key

Start Jupyter:

source venv/bin/activate
jupyter notebook

Open the notebook:
- Navigate to notebooks/02_rag_with_eval.ipynb
- Follow along with the instructor

📁 Repository Structure

.
├── README.md                          # This file
├── agenda.md                          # Workshop schedule
├── setup.sh                           # Automated setup script
├── requirements.txt                   # Python dependencies
├── .env.example                       # API key template
├── data/
│   └── ground_truth/
│       └── film_box_office_ground_truth.csv  # Evaluation dataset
└── notebooks/
    └── 02_rag_with_eval.ipynb        # Main workshop notebook

📚 Workshop Agenda

See agenda.md for the full schedule.

Highlights:

Block 1 (09:20-11:15): Foundations - Build baseline agent & eval harness
Block 2 (11:40-13:30): Iteration - Error analysis & systematic improvements
Block 3 (14:15-15:35): Applied Lab - Choose your own use case
Block 4 (16:00-17:00): Demos & Q&A

🎓 Learning Objectives

By the end of this workshop, you'll be able to:

✅ Define task success criteria for agent workflows
✅ Build gold-standard evaluation datasets
✅ Implement deterministic and LLM-as-judge evaluators
✅ Run error analysis to identify failure patterns
✅ Iterate systematically to improve pass rates
✅ Track quality metrics (pass rate) and operational metrics (cost, latency)

🔑 Getting Your Cohere API Key

Visit dashboard.cohere.com
Sign up (free tier available)
Generate an API key
Add it to your .env file

🛠️ Troubleshooting

Setup script fails

Ensure Python 3.8+ is installed: python3 --version
Try manually: python3 -m venv venv && source venv/bin/activate && pip install -r requirements.txt

Module not found errors

Activate the virtual environment: source venv/bin/activate
Reinstall dependencies: pip install -r requirements.txt

API key issues

Check .env file exists and contains your key
Verify format: COHERE_API_KEY=your-actual-key (no quotes)
Get a key at dashboard.cohere.com/api-keys

Jupyter won't start

Ensure venv is activated: source venv/bin/activate
Reinstall Jupyter: pip install --upgrade jupyter notebook

📖 Additional Resources

Cohere Documentation:

Evaluation Resources:

👤 Instructor

Yann Stoneman
Staff Solutions Architect @ Cohere

📝 License

Questions during the workshop? Ask away! 🙋

After the workshop? Connect on LinkedIn or reach out via conference channels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Evaluation Harnesses for LLM Agents

🎯 What You'll Build

🚀 Quick Start

Prerequisites

Setup

📁 Repository Structure

📚 Workshop Agenda

🎓 Learning Objectives

🔑 Getting Your Cohere API Key

🛠️ Troubleshooting

Setup script fails

Module not found errors

API key issues

Jupyter won't start

📖 Additional Resources

👤 Instructor

📝 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Building Evaluation Harnesses for LLM Agents

🎯 What You'll Build

🚀 Quick Start

Prerequisites

Setup

📁 Repository Structure

📚 Workshop Agenda

🎓 Learning Objectives

🔑 Getting Your Cohere API Key

🛠️ Troubleshooting

Setup script fails

Module not found errors

API key issues

Jupyter won't start

📖 Additional Resources

👤 Instructor

📝 License