Skip to content

Commit eea839c

Browse files
authored
feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex (#1807)
1 parent 0a3eb4a commit eea839c

File tree

17 files changed

+19760
-3
lines changed

17 files changed

+19760
-3
lines changed

.github/actions/spelling/allow.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,7 @@ TLDR
453453
TOKENLIST
454454
TPU
455455
TPUs
456+
TRK
456457
TSLA
457458
TSMC
458459
TSNE
@@ -554,6 +555,7 @@ aextract
554555
afrom
555556
agentic
556557
agg
558+
aggfunc
557559
ainit
558560
ainvoke
559561
aio
@@ -613,6 +615,7 @@ bnb
613615
booktitle
614616
boop
615617
boundings
618+
boxplot
616619
bpa
617620
bpd
618621
bqdf
@@ -939,6 +942,7 @@ linestyle
939942
linkedin
940943
linted
941944
linting
945+
litellm
942946
llm
943947
llms
944948
loghub
@@ -1297,6 +1301,7 @@ vectoral
12971301
vectordb
12981302
veo
12991303
vesselin
1304+
viridis
13001305
vllm
13011306
vnc
13021307
voiceover

.github/actions/spelling/line_forbidden.patterns

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -283,9 +283,6 @@
283283
# Should be Colab
284284
\s(?!Colab)Co[Ll][Ll]?abs?\b
285285

286-
# Should be Kaggle
287-
\skaggle\b
288-
289286
# Should be TPU or TPUs
290287
\btpus?\b
291288

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
data
2+
evaluation_results
3+
*.png
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Agent Evaluation Framework
2+
3+
This repository contains a framework for generating, evaluating, and analyzing the performance of LLM-powered agents in customer support scenarios.
4+
5+
## Overview
6+
7+
The framework consists of three main components:
8+
9+
1. **Customer Support Agent** - An LLM-powered agent with tools for handling e-commerce customer queries
10+
2. **Dataset Generator** - A system for creating synthetic evaluation datasets with realistic customer queries
11+
3. **Agent Evaluator** - A comprehensive evaluation system for measuring agent performance
12+
13+
## Customer Support Agent
14+
15+
The customer support agent is built using the `smolagents` framework and provides several tools for handling e-commerce queries:
16+
17+
- `ProductSearchTool` - Search product catalog by name, category, or description
18+
- `OrderStatusTool` - Check order status by order ID
19+
- `CategoryBrowseTool` - Browse products by category
20+
- `PriceCheckTool` - Check product price by product ID
21+
- `CustomerOrderHistoryTool` - Get order history for a customer
22+
23+
The agent can be configured with different LLM models, including Gemini 1.5 Pro, and supports planning capabilities to handle complex multi-step queries.
24+
25+
```python
26+
agent = create_customer_support_agent(
27+
model_id="google/gemini-1.5-pro",
28+
use_weave=True,
29+
temperature=0.2,
30+
planning_interval=1,
31+
max_steps=3
32+
)
33+
```
34+
35+
## Dataset Generator
36+
37+
The dataset generator creates realistic evaluation examples by:
38+
39+
1. Generating diverse e-commerce customer queries
40+
2. Running the agent on these queries and recording its trajectory
41+
3. Evaluating each step and the final response using a judge model
42+
4. Filtering examples based on quality thresholds
43+
5. Saving high-quality examples to a dataset for evaluation
44+
45+
```python
46+
generator = DatasetGenerator(
47+
agent=agent,
48+
judge_model="gemini/gemini-1.5-pro",
49+
thresholds={
50+
"final_response": 0.7,
51+
"single_step": 0.7,
52+
"trajectory": 0.7
53+
},
54+
debug=True
55+
)
56+
57+
examples = create_customer_support_agent_evaluation_dataset(generator, agent, num_prompts=10)
58+
```
59+
60+
## Agent Evaluator
61+
62+
The evaluator provides comprehensive metrics for agent performance:
63+
64+
- **Response Correctness** - Accuracy and completeness of the agent's final response
65+
- **Tool Selection** - Appropriate use of available tools
66+
- **Trajectory Analysis** - Efficiency and effectiveness of the agent's path to solution
67+
- **Reasoning Quality** - Quality of the agent's reasoning process
68+
- **Coherence** - Consistency and clarity of the agent's communication
69+
70+
The evaluator generates detailed reports, visualizations, and metrics to analyze agent performance.
71+
72+
```python
73+
evaluator = AgentEvaluator(
74+
model_name="gemini-1.5-pro",
75+
temperature=0.1,
76+
verbosity=2,
77+
use_weave=True
78+
)
79+
80+
results = evaluator.run_evaluation(agent, eval_dataset)
81+
```
82+
83+
## Getting Started
84+
85+
1. Install dependencies:
86+
```
87+
uv sync
88+
```
89+
90+
2. Set up environment variables (some of these will auto populate if you run `setup.py`):
91+
```
92+
# Create a .env file with your API keys or colab secrets
93+
GEMINI_API_KEY
94+
HUGGING_FACE_HUB_TOKEN
95+
VERTEX_PROJECT_ID
96+
VERTEX_LOCATION
97+
VERTEX_MODEL_ID
98+
VERTEX_ENDPOINT_ID
99+
DEEPSEEK_ENDPOINT_ID
100+
```
101+
102+
3. Generate evaluation dataset:
103+
```python
104+
from dataset_generator import DatasetGenerator, create_customer_support_agent_evaluation_dataset
105+
from customer_support_agent import create_customer_support_agent
106+
107+
agent = create_customer_support_agent()
108+
generator = DatasetGenerator(agent=agent)
109+
examples = create_customer_support_agent_evaluation_dataset(generator, agent)
110+
generator.save_dataset(examples, "evaluation_dataset.json")
111+
```
112+
113+
4. Run evaluation:
114+
```python
115+
from evaluator import AgentEvaluator, load_dataset
116+
117+
eval_dataset = load_dataset("evaluation_dataset.json")
118+
evaluator = AgentEvaluator()
119+
results = evaluator.run_evaluation(agent, eval_dataset)
120+
```
121+
122+
## Features
123+
124+
- **Realistic Data Generation**: Creates synthetic but realistic customer queries based on e-commerce data
125+
- **Comprehensive Evaluation**: Measures multiple aspects of agent performance
126+
- **Visualization**: Generates plots and tables for analysis
127+
- **Weave Integration**: Tracks experiments and results with Weave
128+
- Logs agent trajectories and evaluation metrics
129+
- Enables experiment comparison across different agent configurations
130+
- Provides interactive dashboards for analyzing agent performance
131+
- Supports versioning of evaluation datasets and results
132+
- Facilitates collaboration through shareable experiment links
133+
- **Configurable Thresholds**: Adjustable quality thresholds for dataset generation
134+
135+
## Weave Integration
136+
137+
The framework leverages Weave for experiment tracking and visualization:
138+
139+
1. **Experiment Tracking**: Each agent run is logged as a Weave experiment with detailed metrics
140+
2. **Trajectory Visualization**: Agent trajectories are visualized step-by-step for analysis
141+
3. **Comparative Analysis**: Compare performance across different agent configurations and models
142+
4. **Custom Dashboards**: Create custom dashboards to monitor specific metrics
143+
5. **Artifact Management**: Store and version datasets, agent configurations, and evaluation results
144+
145+
```python
146+
# Enable Weave logging in agent creation
147+
agent = create_customer_support_agent(
148+
model_id="google/gemini-1.5-pro",
149+
use_weave=True, # Enable Weave logging
150+
temperature=0.2
151+
)
152+
153+
# Enable Weave in evaluator
154+
evaluator = AgentEvaluator(
155+
model_name="gemini-1.5-pro",
156+
use_weave=True, # Enable Weave logging
157+
verbosity=2
158+
)
159+
```
160+
161+
## Requirements
162+
163+
- Python 3.8+
164+
- Vertex AI API access
165+
- [Weights & Biases account](https://wandb.ai)
166+
- Required Python packages (see pyproject.toml)
167+
168+
## Contributors
169+
170+
- [Anish Shah](https://github.com/ash0ts)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
3+
# Find and clean all Python files in main folder, excluding .venv and other hidden directories
4+
echo "Cleaning Python files..."
5+
find . -name "*.py" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs black
6+
find . -name "*.py" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs isort
7+
8+
# Find and clean all Jupyter notebook files in main folder, excluding hidden directories
9+
echo "Cleaning Jupyter notebook files..."
10+
find . -name "*.ipynb" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs nbqa black
11+
find . -name "*.ipynb" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs nbqa isort
12+
13+
# Run nox format session for any remaining files
14+
echo "Running final format check..."
15+
nox -s format
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
WEAVE_PROJECT_NAME = "agent_evaluation_workshop"

0 commit comments

Comments
 (0)