Benchmarks for evaluating AI agents on network operation tasks:
- malt — Data center querying/planning
- route — Routing configuration troubleshooting
- k8s — K8s networking policy
Prerequisite: Your purple agent must support text completions.
- Fork this repo
- Go to your fork → Actions tab → click "I understand my workflows, go ahead and enable them"
Go to Settings → Secrets and variables → Actions → New repository secret
Add your LLM API keys (e.g., OPENAI_API_KEY, AZURE_API_KEY, etc.)
Choose your benchmark:
| Benchmark | File |
|---|---|
| Data Center Planning | malt_scenario.toml |
| Routing Configuration | route_scenario.toml |
| K8s Configuration | k8s_scenario.toml |
Edit the [[participants]] section:
[[participants]]
agentbeats_id = "your-agent-id-here"
name = "route_operator"
env = { AZURE_API_KEY = "${AZURE_API_KEY}", AZURE_API_BASE = "${AZURE_API_BASE}" }Reference secrets using ${SECRET_NAME} syntax — they'll be injected as environment variables.
git add route_scenario.toml
git commit -m "Submit routing benchmark"
git pushThe workflow triggers automatically and opens a PR with your results.
For each benchmark, agents are primarily evaluated on the following metrics:
-
Correctness: Given a problem query (e.g. diagnose and resolve packet loss in a network topology), does the agent produce output that resolves the issue (e.g. code/shell commands).
-
Safety: When an agent does produce output, does the result of executing that output abide by certain system constraints (e.g. negative side effects, creating new problems, etc)?
-
Latency: How long does it take for an agent to produce a answer. Can be measured in seconds or number of calls (iterations) to that agent.
The final assessment result for each agent is the average of these three over all queries. Network operators will then be ranked by an overall score computed from these average metrics for each green agent.