11# Performance Benchmarks
22
3- > ** Last updated:** March 2026 · ** Toolkit version:** 1 .1.x · ** Python:** 3.13 · ** OS:** Windows 11 (AMD64)
3+ > ** Last updated:** March 2026 · ** Toolkit version:** 2 .1.0 · ** Python:** 3.13 · ** OS:** Windows 11 (AMD64)
44>
55> All benchmarks use ` time.perf_counter() ` with 10,000 iterations (unless noted).
66> Numbers are from a development workstation — CI runs on ` ubuntu-latest ` GitHub-hosted runners.
99
1010| What you care about | Number |
1111| ---| ---|
12- | ** Policy evaluation (single rule)** | ** 0.012 ms** (p50) — 72K ops/sec |
13- | ** Policy evaluation (100 rules)** | ** 0.029 ms** (p50) — 31K ops/sec |
14- | ** Kernel enforcement (allow path)** | ** 0.091 ms** (p50) — 9.3K ops/sec |
15- | ** Adapter governance overhead** | ** 0.004–0.006 ms** (p50) — 130K–230K ops/sec |
16- | ** Circuit breaker check** | ** 0.0005 ms** (p50) — 1.66M ops/sec |
17- | ** Concurrent throughput (50 agents)** | ** 35,481 ops/sec** |
12+ | ** Policy evaluation (single rule)** | ** 0.011 ms** (p50) — 84K ops/sec |
13+ | ** Policy evaluation (100 rules)** | ** 0.030 ms** (p50) — 32K ops/sec |
14+ | ** Kernel enforcement (allow path)** | ** 0.103 ms** (p50) — 9.7K ops/sec |
15+ | ** Adapter governance overhead** | ** 0.005–0.007 ms** (p50) — 135K–190K ops/sec |
16+ | ** Circuit breaker check** | ** 0.0005 ms** (p50) — 1.83M ops/sec |
17+ | ** Concurrent throughput (50 agents)** | ** 46,329 ops/sec** |
18+ | ** Concurrent throughput (1,000 agents)** | ** 47,085 ops/sec** |
1819
19- ** Bottom line:** Policy enforcement adds ** < 0.1 ms** per action. At 1,000 concurrent agents, the governance layer is not the bottleneck — your LLM API call is 100–1000 × slower.
20+ ** Bottom line:** Policy enforcement adds ** < 0.1 ms** per action. At 1,000 concurrent agents, the governance layer sustains ** 47K ops/sec ** with near-linear scaling — your LLM API call is 1,000–10,000 × slower.
2021
2122---
2223
@@ -26,13 +27,13 @@ Measures `PolicyEvaluator.evaluate()` — the core enforcement path every agent
2627
2728| Benchmark | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) |
2829| ---| ---:| ---:| ---:| ---:|
29- | Single rule evaluation | 72,386 | 0.012 | 0.019 | 0.081 |
30- | 10-rule policy | 67,044 | 0.014 | 0.018 | 0.074 |
31- | 100-rule policy | 31,016 | 0.029 | 0.047 | 0.116 |
32- | SharedPolicy cross-project eval | 120,500 | 0.008 | 0.010 | 0.026 |
33- | YAML policy load (cold, 10 rules) | 111 | 8.403 | 12.571 | 21.835 |
30+ | Single rule evaluation | 84,489 | 0.011 | 0.014 | 0.037 |
31+ | 10-rule policy | 76,406 | 0.012 | 0.017 | 0.049 |
32+ | 100-rule policy | 32,025 | 0.030 | 0.039 | 0.108 |
33+ | SharedPolicy cross-project eval | 116,454 | 0.008 | 0.010 | 0.028 |
34+ | YAML policy load (cold, 10 rules) | 112 | 8.432 | 12.717 | 17.763 |
3435
35- ** Key takeaway:** Rule count scales linearly. Even with 100 rules, p99 is under 0.12 ms. YAML loading is a cold-start cost (once per deployment, not per action).
36+ ** Key takeaway:** Rule count scales linearly. Even with 100 rules, p99 is under 0.11 ms. YAML loading is a cold-start cost (once per deployment, not per action).
3637
3738Source: [ ` packages/agent-os/benchmarks/bench_policy.py ` ] ( packages/agent-os/benchmarks/bench_policy.py )
3839
@@ -42,17 +43,20 @@ Measures `StatelessKernel.execute()` — the full enforcement path including pol
4243
4344| Benchmark | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) |
4445| ---| ---:| ---:| ---:| ---:|
45- | Kernel execute (allow) | 9,285 | 0.091 | 0.224 | 0.398 |
46- | Kernel execute (deny) | 11,731 | 0.071 | 0.199 | 0.422 |
47- | Circuit breaker state check | 1,662,638 | 0.001 | 0.001 | 0.001 |
46+ | Kernel execute (allow) | 9,668 | 0.103 | 0.198 | 0.347 |
47+ | Kernel execute (deny) | 10,239 | 0.097 | 0.191 | 0.322 |
48+ | Circuit breaker state check | 1,828,845 | 0.001 | 0.001 | 0.001 |
4849
49- ### Concurrent Throughput
50+ ### Concurrent Throughput (Scaling)
5051
51- | Concurrency | Total ops | Wall time (s) | ops/sec |
52- | ---:| ---:| ---:| ---:|
53- | 50 agents × 200 ops each | 10,000 | 0.282 | 35,481 |
52+ | Concurrency | Total ops | Wall time (s) | ops/sec | vs. single-threaded |
53+ | ---:| ---:| ---:| ---:| ---|
54+ | 50 agents × 200 ops | 10,000 | 0.216 | 46,329 | 4.8× |
55+ | 100 agents × 100 ops | 10,000 | 0.209 | 47,920 | 5.0× |
56+ | 500 agents × 100 ops | 50,000 | 1.085 | 46,089 | 4.8× |
57+ | ** 1,000 agents × 100 ops** | ** 100,000** | ** 2.124** | ** 47,085** | ** 4.9×** |
5458
55- ** Key takeaway:** Deny path is slightly faster than allow (no downstream execution). Circuit breaker overhead is negligible (sub-microsecond). At 50 concurrent agents, throughput exceeds 35K ops/sec .
59+ ** Key takeaway:** Throughput is ** stable at ~ 47K ops/sec ** from 50 to 1,000 concurrent agents — no degradation at scale. The deny path is slightly faster than allow (no downstream execution). Circuit breaker overhead is negligible (sub-microsecond).
5660
5761Source: [ ` packages/agent-os/benchmarks/bench_kernel.py ` ] ( packages/agent-os/benchmarks/bench_kernel.py )
5862
@@ -62,12 +66,12 @@ Measures audit entry creation, querying, and serialization — the observability
6266
6367| Benchmark | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) |
6468| ---| ---:| ---:| ---:| ---:|
65- | Audit entry write | 212,565 | 0.003 | 0.007 | 0.015 |
66- | Audit entry serialization | 247,175 | 0.004 | 0.006 | 0.008 |
67- | Execution time tracking | 510,071 | 0.002 | 0.003 | 0.003 |
68- | Audit log query (10K entries) | 1,119 | 0.810 | 1.537 | 1.935 |
69+ | Audit entry write | 285,202 | 0.002 | 0.006 | 0.008 |
70+ | Audit entry serialization | 343,548 | 0.003 | 0.003 | 0.004 |
71+ | Execution time tracking | 442,206 | 0.002 | 0.002 | 0.003 |
72+ | Audit log query (10K entries) | 1,399 | 0.716 | 0.877 | 1.076 |
6973
70- ** Key takeaway:** Audit writes add ~ 3 µs per action. Querying 10K entries takes ~ 1 ms (in-memory scan). For production deployments, external append-only stores (e.g., OpenTelemetry export) are recommended for large-scale query workloads.
74+ ** Key takeaway:** Audit writes add ~ 2 µs per action. Querying 10K entries takes ~ 0.7 ms (in-memory scan). For production deployments, external append-only stores (e.g., OpenTelemetry export) are recommended for large-scale query workloads.
7175
7276Source: [ ` packages/agent-os/benchmarks/bench_audit.py ` ] ( packages/agent-os/benchmarks/bench_audit.py )
7377
@@ -77,20 +81,20 @@ Measures the governance check overhead per framework adapter — the cost added
7781
7882| Adapter | ops/sec | p50 (ms) | p95 (ms) | p99 (ms) |
7983| ---| ---:| ---:| ---:| ---:|
80- | GovernancePolicy init (startup) | 189,403 | 0.005 | 0.007 | 0.013 |
81- | Tool allowed check | 7,506,344 | 0.000 | 0.000 | 0.000 |
82- | Pattern match (per call) | 130,817 | 0.006 | 0.013 | 0.029 |
83- | ** OpenAI** adapter | 132,340 | 0.006 | 0.013 | 0.031 |
84- | ** LangChain** adapter | 225,128 | 0.004 | 0.007 | 0.010 |
85- | ** Anthropic** adapter | 213,598 | 0.004 | 0.007 | 0.011 |
86- | ** LlamaIndex** adapter | 215,934 | 0.004 | 0.006 | 0.011 |
87- | ** CrewAI** adapter | 230,223 | 0.004 | 0.006 | 0.010 |
88- | ** AutoGen** adapter | 191,390 | 0.005 | 0.007 | 0.010 |
89- | ** Google Gemini** adapter | 139,730 | 0.005 | 0.011 | 0.027 |
90- | ** Mistral** adapter | 148,880 | 0.006 | 0.009 | 0.020 |
91- | ** Semantic Kernel** adapter | 138,810 | 0.006 | 0.012 | 0.015 |
92-
93- ** Key takeaway:** All adapters add ** < 0.03 ms** (p99) per tool call. This is 3–4 orders of magnitude below a typical LLM API round-trip (200–2000 ms). The governance layer is invisible to end users.
84+ | GovernancePolicy init (startup) | 134,923 | 0.007 | 0.008 | 0.019 |
85+ | Tool allowed check | 3,745,036 | 0.000 | 0.000 | 0.000 |
86+ | Pattern match (per call) | 135,717 | 0.007 | 0.008 | 0.022 |
87+ | ** OpenAI** adapter | 166,363 | 0.005 | 0.007 | 0.017 |
88+ | ** LangChain** adapter | 156,591 | 0.006 | 0.007 | 0.019 |
89+ | ** Anthropic** adapter | 164,194 | 0.006 | 0.008 | 0.017 |
90+ | ** LlamaIndex** adapter | 156,157 | 0.006 | 0.007 | 0.016 |
91+ | ** CrewAI** adapter | 190,134 | 0.005 | 0.006 | 0.013 |
92+ | ** AutoGen** adapter | 169,358 | 0.005 | 0.007 | 0.018 |
93+ | ** Google Gemini** adapter | 180,770 | 0.006 | 0.006 | 0.011 |
94+ | ** Mistral** adapter | 182,439 | 0.005 | 0.006 | 0.015 |
95+ | ** Semantic Kernel** adapter | 170,930 | 0.005 | 0.007 | 0.014 |
96+
97+ ** Key takeaway:** All adapters add ** < 0.02 ms** (p99) per tool call. This is 3–4 orders of magnitude below a typical LLM API round-trip (200–2000 ms). The governance layer is invisible to end users.
9498
9599Source: [ ` packages/agent-os/benchmarks/bench_adapters.py ` ] ( packages/agent-os/benchmarks/bench_adapters.py )
96100
@@ -100,15 +104,15 @@ Measures chaos engineering, SLO enforcement, and observability primitives.
100104
101105| Benchmark | ops/sec | p50 (µs) | p99 (µs) |
102106| ---| ---:| ---:| ---:|
103- | Fault injection | 1,060,108 | 0.60 | 1.90 |
104- | Chaos template init | 221,270 | 3.20 | 11.80 |
105- | Chaos schedule eval | 360,531 | 2.20 | 4.40 |
106- | SLO evaluation | 48,747 | 18.70 | 49.20 |
107- | Error budget calculation | 58,229 | 15 .70 | 42.50 |
108- | Burn rate alert | 49,593 | 16.30 | 50.10 |
109- | SLI recording | 618,961 | 1.10 | 4 .10 |
107+ | Fault injection | 428,253 | 1.20 | 6.60 |
108+ | Chaos template init | 98,889 | 9.10 | 18.50 |
109+ | Chaos schedule eval | 168,380 | 5.30 | 7.60 |
110+ | SLO evaluation | 29,475 | 30.10 | 96.60 |
111+ | Error budget calculation | 29,851 | 31 .70 | 111.70 |
112+ | Burn rate alert | 25,543 | 37.10 | 116.20 |
113+ | SLI recording | 284,274 | 2.40 | 11 .10 |
110114
111- ** Key takeaway:** SRE operations are sub-50 µs at p99. SLI recording (the hot path for every action) is ~ 1 µs. These can run alongside every agent action without measurable impact.
115+ ** Key takeaway:** SRE operations are sub-120 µs at p99. SLI recording (the hot path for every action) is ~ 2.4 µs. These can run alongside every agent action without measurable impact.
112116
113117Source: [ ` packages/agent-sre/benchmarks/ ` ] ( packages/agent-sre/benchmarks/ )
114118
@@ -157,6 +161,14 @@ cd ../agent-sre
157161pip install -e " .[dev]"
158162python benchmarks/bench_chaos.py
159163python benchmarks/bench_slo.py
164+
165+ # Custom concurrency levels (default: 50 agents × 200 ops)
166+ python -c "
167+ from benchmarks.bench_kernel import bench_concurrent_kernel
168+ import json
169+ result = bench_concurrent_kernel(concurrency=1000, per_task=100)
170+ print(json.dumps(result, indent=2))
171+ "
160172```
161173
162174### CI Integration
@@ -170,12 +182,19 @@ For context, here's where the governance overhead sits relative to typical agent
170182| Operation | Typical latency |
171183| ---| ---|
172184| ** Policy evaluation (this toolkit)** | ** 0.01–0.03 ms** |
173- | ** Full kernel enforcement** | ** 0.07–0. 10 ms** |
174- | ** Adapter overhead** | ** 0.004 –0.006 ms** |
185+ | ** Full kernel enforcement** | ** 0.10 ms** |
186+ | ** Adapter overhead** | ** 0.005 –0.007 ms** |
175187| Python function call | 0.001 ms |
176188| Redis read (local) | 0.1–0.5 ms |
177189| Database query (simple) | 1–10 ms |
178190| LLM API call (GPT-4) | 200–2,000 ms |
179191| LLM API call (Claude Sonnet) | 300–3,000 ms |
180192
181- The governance layer adds less overhead than a single Redis read and is 10,000× faster than an LLM call.
193+ The governance layer adds less overhead than a single Redis read and is ** 10,000× faster than an LLM call** .
194+
195+ ## Version History
196+
197+ | Version | Date | Notable changes |
198+ | ---| ---| ---|
199+ | v2.1.0 | March 2026 | Added 1K concurrent agent benchmarks, ~ 15% faster policy eval vs v1.1.x |
200+ | v1.1.0 | February 2026 | Initial published benchmarks |
0 commit comments