Benchmarks for evaluating AI agents on network operation tasks:
- malt — Data center querying/planning
- route — Routing configuration troubleshooting
- k8s — K8s networking policy
More details on each of these benchmarks can be found in this walkthrough.
Prerequisite: Your purple agent must support text completions.
- Fork this repo
- Go to your fork → Actions tab → click "I understand my workflows, go ahead and enable them"
Go to Settings → Secrets and variables → Actions → New repository secret
Add your LLM API keys (e.g., OPENAI_API_KEY, AZURE_API_KEY, etc.)
Choose your benchmark:
| Benchmark | File |
|---|---|
| Data Center Planning | malt_scenario.toml |
| Routing Configuration | route_scenario.toml |
| K8s Configuration | k8s_scenario.toml |
Edit the [[participants]] section:
[[participants]]
agentbeats_id = "your-agent-id-here"
name = "route_operator"
env = { AZURE_API_KEY = "${AZURE_API_KEY}", AZURE_API_BASE = "${AZURE_API_BASE}" }Reference secrets using ${SECRET_NAME} syntax — they'll be injected as environment variables.
git add route_scenario.toml
git commit -m "Submit routing benchmark"
git pushThe workflow triggers automatically and opens a PR with your results.
For each benchmark, agents are primarily evaluated on the following metrics:
-
Correctness: Given a problem query (e.g. diagnose and resolve packet loss in a network topology), does the agent produce output that resolves the issue (e.g. code/shell commands).
-
Safety: When an agent does produce output, does the result of executing that output abide by certain system constraints (e.g. no side effects, creating new problems, etc)?
-
Latency: How long does it take for an agent to produce a answer. Can be measured in seconds or number of calls (iterations) to that agent.
The final assessment result for each agent is the average of each of these three over all queries. Network operators will then be ranked by an overall score computed as an average between correctness and safety, with ties broken by the agent with lower latency.
Benchmark specific details on how each metric is calculated can be found in this walkthrough.