Skip to content

Latest commit

Β 

History

History
160 lines (111 loc) Β· 4.22 KB

File metadata and controls

160 lines (111 loc) Β· 4.22 KB

Building Evaluation Harnesses for LLM Agents

Workshop at Big Data Conference Europe 2025
πŸ“ Vilnius, Lithuania | πŸ“… November 18, 2025


🎯 What You'll Build

In this hands-on workshop, you'll build an evaluation-driven RAG system that:

  • Extracts financial data from Wikipedia
  • Calculates Return on Investment (ROI) for films
  • Classifies performance (Blockbuster, Profitable, Break-even, Flop)
  • Evaluates against ground truth data

You'll learn to systematically improve agent performance through:

  • Baseline measurement
  • Error analysis
  • Iterative improvements
  • Metrics tracking (pass rate, cost, latency)

πŸš€ Quick Start

Prerequisites

Setup

  1. Clone or download this repository

  2. Run the setup script:

    chmod +x setup.sh
    ./setup.sh
  3. Add your API key:

    • Edit .env file
    • Replace your-api-key-here with your actual Cohere API key
  4. Start Jupyter:

    source venv/bin/activate
    jupyter notebook
  5. Open the notebook:

    • Navigate to notebooks/02_rag_with_eval.ipynb
    • Follow along with the instructor

πŸ“ Repository Structure

.
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ agenda.md                          # Workshop schedule
β”œβ”€β”€ setup.sh                           # Automated setup script
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ .env.example                       # API key template
β”œβ”€β”€ data/
β”‚   └── ground_truth/
β”‚       └── film_box_office_ground_truth.csv  # Evaluation dataset
└── notebooks/
    └── 02_rag_with_eval.ipynb        # Main workshop notebook

πŸ“š Workshop Agenda

See agenda.md for the full schedule.

Highlights:

  • Block 1 (09:20-11:15): Foundations - Build baseline agent & eval harness
  • Block 2 (11:40-13:30): Iteration - Error analysis & systematic improvements
  • Block 3 (14:15-15:35): Applied Lab - Choose your own use case
  • Block 4 (16:00-17:00): Demos & Q&A

πŸŽ“ Learning Objectives

By the end of this workshop, you'll be able to:

βœ… Define task success criteria for agent workflows
βœ… Build gold-standard evaluation datasets
βœ… Implement deterministic and LLM-as-judge evaluators
βœ… Run error analysis to identify failure patterns
βœ… Iterate systematically to improve pass rates
βœ… Track quality metrics (pass rate) and operational metrics (cost, latency)


πŸ”‘ Getting Your Cohere API Key

  1. Visit dashboard.cohere.com
  2. Sign up (free tier available)
  3. Generate an API key
  4. Add it to your .env file

πŸ› οΈ Troubleshooting

Setup script fails

  • Ensure Python 3.8+ is installed: python3 --version
  • Try manually: python3 -m venv venv && source venv/bin/activate && pip install -r requirements.txt

Module not found errors

  • Activate the virtual environment: source venv/bin/activate
  • Reinstall dependencies: pip install -r requirements.txt

API key issues

  • Check .env file exists and contains your key
  • Verify format: COHERE_API_KEY=your-actual-key (no quotes)
  • Get a key at dashboard.cohere.com/api-keys

Jupyter won't start

  • Ensure venv is activated: source venv/bin/activate
  • Reinstall Jupyter: pip install --upgrade jupyter notebook

πŸ“– Additional Resources

Cohere Documentation:

Evaluation Resources:


πŸ‘€ Instructor

Yann Stoneman
Staff Solutions Architect @ Cohere


πŸ“ License

Workshop materials provided for educational purposes.
Β© 2025 Big Data Conference Europe


Questions during the workshop? Ask away! πŸ™‹

After the workshop? Connect on LinkedIn or reach out via conference channels.