|
| 1 | +# Multi-Agent Evaluation Suite |
| 2 | + |
| 3 | +Comprehensive evaluation framework for comparing direct models, single agents, and multi-agent systems. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +### One Script, Two Modes |
| 8 | + |
| 9 | +The `comprehensive-evaluation.py` script has everything integrated: |
| 10 | + |
| 11 | +```bash |
| 12 | +# Quick test (3 tasks, 4 configs, ~1 minute) - RECOMMENDED FIRST |
| 13 | +python comprehensive-evaluation.py quick |
| 14 | + |
| 15 | +# Full evaluation (10 tasks, 4 configs, ~5-10 minutes) |
| 16 | +python comprehensive-evaluation.py full |
| 17 | +# or simply: |
| 18 | +python comprehensive-evaluation.py |
| 19 | +``` |
| 20 | + |
| 21 | +**Auto-generates:** |
| 22 | +- CSV results with scores + **reasoning** (WHY scores are what they are) |
| 23 | +- Visualization charts (performance vs efficiency) |
| 24 | +- Summary statistics |
| 25 | + |
| 26 | +**Results:** |
| 27 | +- `quick_results/quick_results.csv` + `evaluation_results.png` |
| 28 | +- `comprehensive_results/comprehensive_results.csv` + `evaluation_results.png` |
| 29 | + |
| 30 | +## What We're Testing |
| 31 | + |
| 32 | +### Configurations |
| 33 | + |
| 34 | +1. **Direct-Model** - Baseline (no agent wrapper) |
| 35 | +2. **Single-Agent-Tools** - Agent with tools (Calculator, DateTime, Think) |
| 36 | +3. **Multi-Agent-RoundRobin** - Fixed-order team (Planner → Solver → Reviewer) |
| 37 | +4. **Multi-Agent-AI** - Dynamic orchestration (AI selects speakers) |
| 38 | + |
| 39 | +### Task Categories |
| 40 | + |
| 41 | +**Quick Test (3 tasks):** |
| 42 | +- Math word problem |
| 43 | +- Calculator usage |
| 44 | +- Logic puzzle |
| 45 | + |
| 46 | +**Comprehensive (10 tasks across 4 categories):** |
| 47 | +- **Simple Reasoning** (3 tasks) - Math, logic, comprehension |
| 48 | +- **Tool-Heavy** (3 tasks) - Real-time data, calculations, date operations |
| 49 | +- **Complex Planning** (2 tasks) - Multi-constraint optimization |
| 50 | +- **Verification** (2 tasks) - Fact-checking, argument analysis |
| 51 | + |
| 52 | +### Evaluation Metrics |
| 53 | + |
| 54 | +- **Overall Score** (0-10) - Composite quality assessment |
| 55 | +- **Accuracy** - Correctness of response |
| 56 | +- **Completeness** - Thoroughness of answer |
| 57 | +- **Helpfulness** - Practical value |
| 58 | +- **Clarity** - Communication quality |
| 59 | +- **Tokens** - Resource consumption (input + output) |
| 60 | +- **Duration** - Wall-clock time (ms) |
| 61 | +- **LLM Calls** - API invocations |
| 62 | + |
| 63 | +## Interpreting Results |
| 64 | + |
| 65 | +### Performance vs Efficiency |
| 66 | + |
| 67 | +The key insight: **Multi-agent systems should justify their overhead.** |
| 68 | + |
| 69 | +**Example from quick test:** |
| 70 | +``` |
| 71 | +Configuration Score Tokens Efficiency (pts/1K tok) |
| 72 | +Direct-Model 7.4/10 156 47.5 |
| 73 | +Multi-Agent-RR 7.2/10 2157 3.4 |
| 74 | +``` |
| 75 | + |
| 76 | +**Teaching moment:** Multi-agent uses 14x more tokens but scores lower on simple tasks! |
| 77 | + |
| 78 | +### When Multi-Agent Should Win |
| 79 | + |
| 80 | +Multi-agent systems should show advantages on: |
| 81 | +- **Complex planning** - Multi-step decomposition |
| 82 | +- **Tool-heavy tasks** - Specialized tool usage |
| 83 | +- **Verification tasks** - Critique and review cycles |
| 84 | +- **Multi-constraint** - Balancing competing requirements |
| 85 | + |
| 86 | +### Task Breakdown Analysis |
| 87 | + |
| 88 | +Look for patterns: |
| 89 | +- Which tasks benefit from multi-agent coordination? |
| 90 | +- Where does orchestration overhead hurt performance? |
| 91 | +- Do specialized agents outperform generalists? |
| 92 | + |
| 93 | +## Tuning Configurations |
| 94 | + |
| 95 | +### Common Adjustments |
| 96 | + |
| 97 | +**If teams timeout:** |
| 98 | +```python |
| 99 | +# Increase message limits |
| 100 | +termination=MaxMessageTermination(max_messages=50) # was 30 |
| 101 | + |
| 102 | +# Increase iterations |
| 103 | +max_iterations=15 # was 10 |
| 104 | +``` |
| 105 | + |
| 106 | +**If quality is low:** |
| 107 | +```python |
| 108 | +# Improve agent instructions |
| 109 | +# Add more specific tool guidance |
| 110 | +# Adjust evaluation criteria |
| 111 | +``` |
| 112 | + |
| 113 | +**If costs are too high:** |
| 114 | +```python |
| 115 | +# Use fewer evaluation runs |
| 116 | +# Reduce task suite size |
| 117 | +# Skip expensive composite judges |
| 118 | +``` |
| 119 | + |
| 120 | +## Bug Fix Applied |
| 121 | + |
| 122 | +This evaluation suite discovered and fixed a critical PicoAgents bug: |
| 123 | + |
| 124 | +**Issue:** `LLMEvalJudge` was importing from wrong `BaseEvalJudge` class |
| 125 | +**Fix:** Changed `from .._base import BaseEvalJudge` → `from ._base import BaseEvalJudge` |
| 126 | +**Location:** `picoagents/src/picoagents/eval/judges/_llm.py:14` |
| 127 | + |
| 128 | +## Next Steps |
| 129 | + |
| 130 | +1. **Run quick test** - Validate setup and tune parameters |
| 131 | +2. **Analyze results** - Look for patterns and insights |
| 132 | +3. **Iterate configs** - Adjust based on findings |
| 133 | +4. **Run comprehensive** - Full evaluation for book/paper |
| 134 | +5. **Update chapter** - Integrate results and visualizations |
| 135 | + |
| 136 | +## File Structure |
| 137 | + |
| 138 | +``` |
| 139 | +evaluation/ |
| 140 | +├── README.md # This file |
| 141 | +├── comprehensive-evaluation.py # Main script (quick + full modes, auto-viz) |
| 142 | +├── agent-evaluation.py # Original example (educational reference) |
| 143 | +├── reference-based-evaluation.py # Judge type demonstrations |
| 144 | +├── quick_results/ |
| 145 | +│ ├── quick_results.csv # Scores + reasoning |
| 146 | +│ └── evaluation_results.png # Auto-generated charts |
| 147 | +└── comprehensive_results/ |
| 148 | + ├── comprehensive_results.csv # Full dataset + reasoning |
| 149 | + └── evaluation_results.png # Auto-generated charts |
| 150 | +``` |
| 151 | + |
| 152 | +## Requirements |
| 153 | + |
| 154 | +- Azure OpenAI credentials (set `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`) |
| 155 | +- Optional: Google Search API (set `GOOGLE_API_KEY`, `GOOGLE_CSE_ID`) for web search tasks |
| 156 | +- Python packages: `picoagents`, `pandas`, `matplotlib` |
0 commit comments