Skip to content

Commit 0643450

Browse files
AG2 CLI (#2497)
* feat: cli -- project scaffolding * docs: planned commands for cli * feat: add cli commands implementation * feat: add ag2 skills for install command * feat: add registry and destination for install command * feat: auto discover env and ide for install command * tests: test existing cli commands * feat: add ag2 install commands with full artifacts support capability * feat: add ag2 run command and test * tests: add tests for commands * tests: add cli playground cases * feat: add ag2 create artifacts command and publish command * fix: better error message * docs: update cli commands documentations * feat: add ag2 arena command for cli toolset * feat: add ag2 proxy command * feat: add ag2 replay command * fix: fix bugs around edge cases, remove docs for unimplemented commands * lint: fix all precommit errors * lint: fix new line * docs: add docs for cli * fix: fix descriptions for creation * fix: fix descriptions for creation * lint: fix pre commit errors * fix replay branch import and refactor runner/commands * Handle artifacts with owner included in command * fix: fix overstrike stripping for proxy command * tests: add tests for proxy fix * fix: fix lint issue * lint: fix pr checks --------- Co-authored-by: Mark Sze <mark@sze.family>
1 parent d33386f commit 0643450

File tree

134 files changed

+23838
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

134 files changed

+23838
-1
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ skills-lock.json
44

55
.docusaurus/
66
node_modules/
7+
.vite/
78

89
# Project
910
.vs/
@@ -208,3 +209,5 @@ remote-examples
208209

209210
*CLAUDE.md:
210211
.cursor/
212+
.claude/
213+
.vercel/

cli/README.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# AG2 CLI
2+
3+
Build, run, test, and deploy multi-agent applications from the terminal.
4+
5+
```
6+
pip install ag2-cli
7+
```
8+
9+
```
10+
██ ██
11+
██ ██
12+
██ ██████████ ██
13+
████ ████
14+
██ ██
15+
██ ██████████████ ██
16+
██ ████ ██████ ████ ██
17+
██████████████████
18+
19+
████ ██████ ██████
20+
██ ██ ██ ██
21+
████████ ██ ████ ████
22+
██ ██ ██ ██ ██
23+
██ ██ ████ ████████
24+
25+
Build, run, test, and deploy multi-agent applications
26+
```
27+
28+
## Commands
29+
30+
| Command | Description | Status |
31+
|---------|-------------|--------|
32+
| `ag2 install skills` | Install AG2 skills into your IDE | ✅ Ready |
33+
| `ag2 install template` | Install project templates from resource hub | ✅ Ready |
34+
| `ag2 install tool` | Install AG2 tools and MCP servers | ✅ Ready |
35+
| `ag2 install dataset` | Install datasets for evaluation | ✅ Ready |
36+
| `ag2 install agent` | Install pre-built Claude Code subagents | ✅ Ready |
37+
| `ag2 install bundle` | Install curated artifact collections | ✅ Ready |
38+
| `ag2 install list` | List available skills, templates, targets | ✅ Ready |
39+
| `ag2 install search` | Search for artifacts across all types | ✅ Ready |
40+
| `ag2 install uninstall` | Remove installed artifacts | ✅ Ready |
41+
| `ag2 run` | Run an agent or team from a file | ✅ Ready |
42+
| `ag2 chat` | Interactive terminal chat with agents | ✅ Ready |
43+
| `ag2 serve` | Expose agents as REST/MCP/A2A endpoints | ✅ Ready |
44+
| `ag2 create` | Scaffold projects, agents, tools, teams | ✅ Ready |
45+
| `ag2 test eval` | Run evaluation suites against agents | ✅ Ready |
46+
| `ag2 test bench` | Standardized benchmarks | 🔜 Coming Soon |
47+
| `ag2 replay` | Replay, debug, and branch conversations | ✅ Ready |
48+
| `ag2 arena` | A/B test agent implementations | ✅ Ready |
49+
| `ag2 proxy` | Wrap CLIs/APIs/modules as AG2 tools | ✅ Ready |
50+
| `ag2 publish` | Publish artifacts to the registry | ✅ Ready |
51+
52+
## Quick Start
53+
54+
```bash
55+
# Install skills into your IDE (auto-detects Cursor, Claude Code, etc.)
56+
ag2 install skills
57+
58+
# Install for a specific target
59+
ag2 install skills --target cursor
60+
61+
# List what's available
62+
ag2 install list skills
63+
ag2 install list targets
64+
```
65+
66+
## Architecture
67+
68+
```
69+
cli/
70+
├── src/ag2_cli/
71+
│ ├── app.py # Main Typer application
72+
│ ├── commands/ # Command implementations
73+
│ │ ├── install.py # ag2 install (skills, templates, list, uninstall)
74+
│ │ ├── run.py # ag2 run, ag2 chat
75+
│ │ ├── create.py # ag2 create (project, agent, tool, team)
76+
│ │ ├── serve.py # ag2 serve
77+
│ │ └── test.py # ag2 test (eval, bench)
78+
│ ├── install/ # Install subsystem
79+
│ │ ├── registry.py # Content pack loading
80+
│ │ └── targets/ # IDE target implementations
81+
│ │ ├── base.py # DirectoryTarget, SingleFileTarget
82+
│ │ ├── claude.py # Claude Code target
83+
│ │ └── copilot.py # GitHub Copilot target
84+
│ ├── content/ # Bundled content packs
85+
│ │ └── skills/ # Skills pack (rules, skills, agents, commands)
86+
│ └── ui/ # Rich UI components
87+
│ ├── logo.py # AG2 banner
88+
│ ├── console.py # Shared console instances
89+
│ └── theme.py # Color theme
90+
├── docs/ # Use case design documents
91+
└── tests/
92+
```
93+
94+
## Tech Stack
95+
96+
- **[Typer](https://typer.tiangolo.com/)** — CLI framework (type-hint driven, built on Click)
97+
- **[Rich](https://rich.readthedocs.io/)** — Terminal formatting (tables, panels, progress bars, syntax highlighting)
98+
- **[questionary](https://github.com/tmbo/questionary)** — Interactive prompts (multi-select, fuzzy search)
99+
100+
## Development
101+
102+
```bash
103+
cd cli
104+
pip install -e ".[dev]"
105+
ag2 --version
106+
```
107+
108+
## Artifacts Repository
109+
110+
Skills, templates, and marketplace packages are hosted at
111+
[github.com/ag2ai/resource-hub](https://github.com/ag2ai/resource-hub).
112+
113+
## License
114+
115+
Apache-2.0

cli/docs/arena.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# ag2 arena
2+
3+
> A/B test agent implementations — compare quality, cost, and speed.
4+
5+
## Problem
6+
7+
"Is my agent v2 actually better than v1?" Teams answer this by eyeballing
8+
outputs. There's no systematic way to compare two agent implementations
9+
across the same test cases, models, or scenarios.
10+
11+
## Commands
12+
13+
```bash
14+
# Compare two implementations
15+
ag2 arena agent_v1.py agent_v2.py --eval tests/cases.yaml
16+
17+
# Compare across models
18+
ag2 arena my_agent.py --models gpt-4o,claude-sonnet-4-6 --eval tests/cases.yaml
19+
20+
# Tournament — multiple agents, multiple benchmarks
21+
ag2 arena agents/ --eval tests/ --format table
22+
23+
# Interactive head-to-head
24+
ag2 arena agent_v1.py agent_v2.py --interactive
25+
26+
# Export comparison report
27+
ag2 arena agent_v1.py agent_v2.py --eval tests/ --output report.html
28+
```
29+
30+
## Comparison Modes
31+
32+
### Eval-Based Comparison
33+
34+
```bash
35+
ag2 arena agent_v1.py agent_v2.py --eval tests/cases.yaml
36+
```
37+
38+
```
39+
╭─ AG2 Arena ────────────────────────────────────────────────╮
40+
│ Contenders: agent_v1.py vs agent_v2.py │
41+
│ Eval cases: 10 | Model: gpt-4o │
42+
╰────────────────────────────────────────────────────────────╯
43+
44+
agent_v1 agent_v2 winner
45+
basic_search ✓ ✓ tie
46+
tool_usage ✓ ✓ tie
47+
multi_step ✗ (0.62) ✓ (0.91) agent_v2
48+
error_handling ✓ ✗ agent_v1
49+
complex_reasoning ✓ (0.78) ✓ (0.95) agent_v2
50+
code_generation ✓ ✓ tie
51+
long_context ✗ ✓ agent_v2
52+
structured_output ✓ ✓ tie
53+
adversarial ✗ ✗ tie
54+
latency_test ✓ (2.1s) ✓ (1.3s) agent_v2
55+
56+
╭─ Summary ──────────────────────────────────────────────────╮
57+
│ agent_v1 agent_v2 │
58+
│ Pass rate 70% 80% (+14%) │
59+
│ Avg quality 0.74 0.88 (+19%) │
60+
│ Avg time 3.2s 2.4s (-25%) │
61+
│ Avg cost/case $0.08 $0.06 (-25%) │
62+
│ Total cost $0.80 $0.60 │
63+
│ │
64+
│ Winner: agent_v2 (better on 3 cases, worse on 1) │
65+
╰─────────────────────────────────────────────────────────────╯
66+
```
67+
68+
### Model Comparison
69+
70+
```bash
71+
ag2 arena my_agent.py --models gpt-4o,claude-sonnet-4-6,gemini-2.0-flash --eval tests/
72+
```
73+
74+
Same agent, different backends. Helps you choose the best model for your use case.
75+
76+
### Interactive Mode
77+
78+
```bash
79+
ag2 arena agent_v1.py agent_v2.py --interactive
80+
```
81+
82+
```
83+
╭─ AG2 Arena — Interactive ──────────────────────────╮
84+
│ Send the same message to both agents. │
85+
│ Pick the winner for each round. │
86+
╰────────────────────────────────────────────────────╯
87+
88+
You: Explain the CAP theorem with real-world examples
89+
90+
┌─ Agent A ────────────────────────────────────────┐
91+
│ The CAP theorem states that a distributed... │
92+
│ (534 tokens, 2.1s, $0.02) │
93+
└──────────────────────────────────────────────────┘
94+
95+
┌─ Agent B ────────────────────────────────────────┐
96+
│ CAP theorem (Brewer's theorem) defines three... │
97+
│ (412 tokens, 1.8s, $0.01) │
98+
└──────────────────────────────────────────────────┘
99+
100+
Which is better? [A] Agent A [B] Agent B [T] Tie [S] Skip
101+
> B
102+
103+
Score: Agent A: 0 Agent B: 1 Ties: 0
104+
105+
You: █
106+
```
107+
108+
Agent identities are hidden (A/B) to avoid bias. Revealed at the end.
109+
110+
### Tournament Mode
111+
112+
```bash
113+
ag2 arena agents/ --eval tests/ --format table
114+
```
115+
116+
Runs every agent file in `agents/` against every eval case:
117+
118+
```
119+
╭─ Tournament Results ───────────────────────────────────────╮
120+
│ case1 case2 case3 case4 case5 Score │
121+
│ researcher_v1 ✓ ✓ ✗ ✓ ✓ 80% │
122+
│ researcher_v2 ✓ ✓ ✓ ✓ ✗ 80% │
123+
│ researcher_v3 ✓ ✓ ✓ ✓ ✓ 100% 🏆 │
124+
│ baseline ✓ ✗ ✗ ✓ ✗ 40% │
125+
╰────────────────────────────────────────────────────────────╯
126+
```
127+
128+
## ELO Rating System
129+
130+
For interactive mode, maintain an ELO rating across sessions:
131+
132+
```bash
133+
ag2 arena --leaderboard
134+
```
135+
136+
```
137+
╭─ Agent Leaderboard ────────────────────────────────╮
138+
│ Rank Agent ELO W/L/T Last │
139+
│ 1 researcher_v3 1523 12/2/1 Today │
140+
│ 2 researcher_v2 1487 8/4/3 Today │
141+
│ 3 researcher_v1 1445 6/6/3 Yesterday │
142+
│ 4 baseline 1320 2/10/3 Yesterday │
143+
╰─────────────────────────────────────────────────────╯
144+
```
145+
146+
## Implementation Notes
147+
148+
### Parallel Execution
149+
Run both agents concurrently using `asyncio.gather()` for speed.
150+
Ensure they have isolated state (fresh agent instances per case).
151+
152+
### Statistical Significance
153+
With `--runs N`, run each case N times and compute confidence intervals.
154+
Report whether differences are statistically significant (p < 0.05).
155+
156+
### Cost Controls
157+
- `--budget $5.00` — stop when total arena cost exceeds budget
158+
- `--dry-run` — estimate cost before running
159+
- Individual case cost limits from eval YAML
160+
161+
### Integration with ag2 test
162+
Arena builds on the same eval case format as `ag2 test eval`.
163+
The assertion system is shared — arena just runs two agents instead of one.

0 commit comments

Comments
 (0)