Skip to content

Commit 4bc344e

Browse files
committed
updates to exampels, mcp integration, evaluation, judges, and premium samples
1 parent 49d8029 commit 4bc344e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+6203
-802
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ agent_memory/
2828
!pyproject.toml
2929
!package.json
3030
!tsconfig.json
31+
!premium-samples/samples.json
3132
# C extensions
3233
*.so
3334

examples/evaluation/README.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Multi-Agent Evaluation Suite
2+
3+
Comprehensive evaluation framework for comparing direct models, single agents, and multi-agent systems.
4+
5+
## Quick Start
6+
7+
### One Script, Two Modes
8+
9+
The `comprehensive-evaluation.py` script has everything integrated:
10+
11+
```bash
12+
# Quick test (3 tasks, 4 configs, ~1 minute) - RECOMMENDED FIRST
13+
python comprehensive-evaluation.py quick
14+
15+
# Full evaluation (10 tasks, 4 configs, ~5-10 minutes)
16+
python comprehensive-evaluation.py full
17+
# or simply:
18+
python comprehensive-evaluation.py
19+
```
20+
21+
**Auto-generates:**
22+
- CSV results with scores + **reasoning** (WHY scores are what they are)
23+
- Visualization charts (performance vs efficiency)
24+
- Summary statistics
25+
26+
**Results:**
27+
- `quick_results/quick_results.csv` + `evaluation_results.png`
28+
- `comprehensive_results/comprehensive_results.csv` + `evaluation_results.png`
29+
30+
## What We're Testing
31+
32+
### Configurations
33+
34+
1. **Direct-Model** - Baseline (no agent wrapper)
35+
2. **Single-Agent-Tools** - Agent with tools (Calculator, DateTime, Think)
36+
3. **Multi-Agent-RoundRobin** - Fixed-order team (Planner → Solver → Reviewer)
37+
4. **Multi-Agent-AI** - Dynamic orchestration (AI selects speakers)
38+
39+
### Task Categories
40+
41+
**Quick Test (3 tasks):**
42+
- Math word problem
43+
- Calculator usage
44+
- Logic puzzle
45+
46+
**Comprehensive (10 tasks across 4 categories):**
47+
- **Simple Reasoning** (3 tasks) - Math, logic, comprehension
48+
- **Tool-Heavy** (3 tasks) - Real-time data, calculations, date operations
49+
- **Complex Planning** (2 tasks) - Multi-constraint optimization
50+
- **Verification** (2 tasks) - Fact-checking, argument analysis
51+
52+
### Evaluation Metrics
53+
54+
- **Overall Score** (0-10) - Composite quality assessment
55+
- **Accuracy** - Correctness of response
56+
- **Completeness** - Thoroughness of answer
57+
- **Helpfulness** - Practical value
58+
- **Clarity** - Communication quality
59+
- **Tokens** - Resource consumption (input + output)
60+
- **Duration** - Wall-clock time (ms)
61+
- **LLM Calls** - API invocations
62+
63+
## Interpreting Results
64+
65+
### Performance vs Efficiency
66+
67+
The key insight: **Multi-agent systems should justify their overhead.**
68+
69+
**Example from quick test:**
70+
```
71+
Configuration Score Tokens Efficiency (pts/1K tok)
72+
Direct-Model 7.4/10 156 47.5
73+
Multi-Agent-RR 7.2/10 2157 3.4
74+
```
75+
76+
**Teaching moment:** Multi-agent uses 14x more tokens but scores lower on simple tasks!
77+
78+
### When Multi-Agent Should Win
79+
80+
Multi-agent systems should show advantages on:
81+
- **Complex planning** - Multi-step decomposition
82+
- **Tool-heavy tasks** - Specialized tool usage
83+
- **Verification tasks** - Critique and review cycles
84+
- **Multi-constraint** - Balancing competing requirements
85+
86+
### Task Breakdown Analysis
87+
88+
Look for patterns:
89+
- Which tasks benefit from multi-agent coordination?
90+
- Where does orchestration overhead hurt performance?
91+
- Do specialized agents outperform generalists?
92+
93+
## Tuning Configurations
94+
95+
### Common Adjustments
96+
97+
**If teams timeout:**
98+
```python
99+
# Increase message limits
100+
termination=MaxMessageTermination(max_messages=50) # was 30
101+
102+
# Increase iterations
103+
max_iterations=15 # was 10
104+
```
105+
106+
**If quality is low:**
107+
```python
108+
# Improve agent instructions
109+
# Add more specific tool guidance
110+
# Adjust evaluation criteria
111+
```
112+
113+
**If costs are too high:**
114+
```python
115+
# Use fewer evaluation runs
116+
# Reduce task suite size
117+
# Skip expensive composite judges
118+
```
119+
120+
## Bug Fix Applied
121+
122+
This evaluation suite discovered and fixed a critical PicoAgents bug:
123+
124+
**Issue:** `LLMEvalJudge` was importing from wrong `BaseEvalJudge` class
125+
**Fix:** Changed `from .._base import BaseEvalJudge``from ._base import BaseEvalJudge`
126+
**Location:** `picoagents/src/picoagents/eval/judges/_llm.py:14`
127+
128+
## Next Steps
129+
130+
1. **Run quick test** - Validate setup and tune parameters
131+
2. **Analyze results** - Look for patterns and insights
132+
3. **Iterate configs** - Adjust based on findings
133+
4. **Run comprehensive** - Full evaluation for book/paper
134+
5. **Update chapter** - Integrate results and visualizations
135+
136+
## File Structure
137+
138+
```
139+
evaluation/
140+
├── README.md # This file
141+
├── comprehensive-evaluation.py # Main script (quick + full modes, auto-viz)
142+
├── agent-evaluation.py # Original example (educational reference)
143+
├── reference-based-evaluation.py # Judge type demonstrations
144+
├── quick_results/
145+
│ ├── quick_results.csv # Scores + reasoning
146+
│ └── evaluation_results.png # Auto-generated charts
147+
└── comprehensive_results/
148+
├── comprehensive_results.csv # Full dataset + reasoning
149+
└── evaluation_results.png # Auto-generated charts
150+
```
151+
152+
## Requirements
153+
154+
- Azure OpenAI credentials (set `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`)
155+
- Optional: Google Search API (set `GOOGLE_API_KEY`, `GOOGLE_CSE_ID`) for web search tasks
156+
- Python packages: `picoagents`, `pandas`, `matplotlib`

0 commit comments

Comments
 (0)