Skip to content

ai-yann/vilnius-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building Evaluation Harnesses for LLM Agents

Workshop at Big Data Conference Europe 2025
📍 Vilnius, Lithuania | 📅 November 18, 2025


🎯 What You'll Build

In this hands-on workshop, you'll build an evaluation-driven RAG system that:

  • Extracts financial data from Wikipedia
  • Calculates Return on Investment (ROI) for films
  • Classifies performance (Blockbuster, Profitable, Break-even, Flop)
  • Evaluates against ground truth data

You'll learn to systematically improve agent performance through:

  • Baseline measurement
  • Error analysis
  • Iterative improvements
  • Metrics tracking (pass rate, cost, latency)

🚀 Quick Start

Prerequisites

Setup

  1. Clone or download this repository

  2. Run the setup script:

    chmod +x setup.sh
    ./setup.sh
  3. Add your API key:

    • Edit .env file
    • Replace your-api-key-here with your actual Cohere API key
  4. Start Jupyter:

    source venv/bin/activate
    jupyter notebook
  5. Open the notebook:

    • Navigate to notebooks/02_rag_with_eval.ipynb
    • Follow along with the instructor

📁 Repository Structure

.
├── README.md                          # This file
├── agenda.md                          # Workshop schedule
├── setup.sh                           # Automated setup script
├── requirements.txt                   # Python dependencies
├── .env.example                       # API key template
├── data/
│   └── ground_truth/
│       └── film_box_office_ground_truth.csv  # Evaluation dataset
└── notebooks/
    └── 02_rag_with_eval.ipynb        # Main workshop notebook

📚 Workshop Agenda

See agenda.md for the full schedule.

Highlights:

  • Block 1 (09:20-11:15): Foundations - Build baseline agent & eval harness
  • Block 2 (11:40-13:30): Iteration - Error analysis & systematic improvements
  • Block 3 (14:15-15:35): Applied Lab - Choose your own use case
  • Block 4 (16:00-17:00): Demos & Q&A

🎓 Learning Objectives

By the end of this workshop, you'll be able to:

✅ Define task success criteria for agent workflows
✅ Build gold-standard evaluation datasets
✅ Implement deterministic and LLM-as-judge evaluators
✅ Run error analysis to identify failure patterns
✅ Iterate systematically to improve pass rates
✅ Track quality metrics (pass rate) and operational metrics (cost, latency)


🔑 Getting Your Cohere API Key

  1. Visit dashboard.cohere.com
  2. Sign up (free tier available)
  3. Generate an API key
  4. Add it to your .env file

🛠️ Troubleshooting

Setup script fails

  • Ensure Python 3.8+ is installed: python3 --version
  • Try manually: python3 -m venv venv && source venv/bin/activate && pip install -r requirements.txt

Module not found errors

  • Activate the virtual environment: source venv/bin/activate
  • Reinstall dependencies: pip install -r requirements.txt

API key issues

  • Check .env file exists and contains your key
  • Verify format: COHERE_API_KEY=your-actual-key (no quotes)
  • Get a key at dashboard.cohere.com/api-keys

Jupyter won't start

  • Ensure venv is activated: source venv/bin/activate
  • Reinstall Jupyter: pip install --upgrade jupyter notebook

📖 Additional Resources

Cohere Documentation:

Evaluation Resources:


👤 Instructor

Yann Stoneman
Staff Solutions Architect @ Cohere


📝 License

Workshop materials provided for educational purposes.
© 2025 Big Data Conference Europe


Questions during the workshop? Ask away! 🙋

After the workshop? Connect on LinkedIn or reach out via conference channels.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors