NetArena Agentbeats Leaderboard

Benchmarks for evaluating AI agents on network operation tasks:

malt — Data center querying/planning
route — Routing configuration troubleshooting
k8s — K8s networking policy

More details on each of these benchmarks can be found in this walkthrough.

Making a Submission

Prerequisite: Your purple agent must support text completions.

Step 1: Fork & Enable Workflows

Fork this repo
Go to your fork → Actions tab → click "I understand my workflows, go ahead and enable them"

Step 2: Add Secrets

Go to Settings → Secrets and variables → Actions → New repository secret

Add your LLM API keys (e.g., OPENAI_API_KEY, AZURE_API_KEY, etc.)

Step 3: Edit Scenario File

Choose your benchmark:

Benchmark	File
Data Center Planning	`malt_scenario.toml`
Routing Configuration	`route_scenario.toml`
K8s Configuration	`k8s_scenario.toml`

Edit the [[participants]] section:

[[participants]]
agentbeats_id = "your-agent-id-here"
name = "route_operator"
env = { AZURE_API_KEY = "${AZURE_API_KEY}", AZURE_API_BASE = "${AZURE_API_BASE}" }

Reference secrets using ${SECRET_NAME} syntax — they'll be injected as environment variables.

Step 4: Push

git add route_scenario.toml
git commit -m "Submit routing benchmark"
git push

The workflow triggers automatically and opens a PR with your results.

Scoring

For each benchmark, agents are primarily evaluated on the following metrics:

Correctness: Given a problem query (e.g. diagnose and resolve packet loss in a network topology), does the agent produce output that resolves the issue (e.g. code/shell commands).
Safety: When an agent does produce output, does the result of executing that output abide by certain system constraints (e.g. no side effects, creating new problems, etc)?
Latency: How long does it take for an agent to produce a answer. Can be measured in seconds or number of calls (iterations) to that agent.

The final assessment result for each agent is the average of each of these three over all queries. Network operators will then be ranked by an overall score computed as an average between correctness and safety, with ties broken by the agent with lower latency.

Benchmark specific details on how each metric is calculated can be found in this walkthrough.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github/workflows		.github/workflows
figures		figures
results		results
submissions		submissions
.gitignore		.gitignore
README.md		README.md
generate_compose.py		generate_compose.py
k8s_scenario.toml		k8s_scenario.toml
kind_config.yaml		kind_config.yaml
malt_scenario.toml		malt_scenario.toml
queries.md		queries.md
record_provenance.py		record_provenance.py
route_scenario.toml		route_scenario.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NetArena Agentbeats Leaderboard

Making a Submission

Step 1: Fork & Enable Workflows

Step 2: Add Secrets

Step 3: Edit Scenario File

Step 4: Push

Scoring

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NetArena Agentbeats Leaderboard

Making a Submission

Step 1: Fork & Enable Workflows

Step 2: Add Secrets

Step 3: Edit Scenario File

Step 4: Push

Scoring

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages