feat(orchestrator): add concurrent sandbox creation benchmark by arkamar · Pull Request #2280 · e2b-dev/infra

arkamar · 2026-04-01T14:33:32Z

This PR adds a benchmark to measure how many sandboxes can be effectively
resumed in parallel on a single node, to help tune
maxStartingInstancesPerNode for different machine types.

BenchmarkConcurrentResume launches N sandboxes simultaneously and
reports latency percentiles (P50/P95/P99) and wall-clock time
Latencies aggregated across all iterations for stable stats at high
iteration counts (e.g. -benchtime=100x)
Includes bench2md.awk helper to convert raw output to markdown tables

The current maxStartingInstancesPerNode limit of 3 was established from non-rigorous historical measurements on n1 machines. To find the optimal limit for n2 machines we need proper data. This adds BenchmarkConcurrentResume which launches N sandboxes simultaneously via barrier-synchronized goroutines and measures per-sandbox creation latency (avg, P50, P95, P99, min, max), wall-clock time, and success/failure counts. Concurrency levels default to 1,2,3,4,5,6,7,8,9,10 and can be overridden via the CONCURRENCY_LEVELS env var. Network and NBD pool sizes are scaled to the max concurrency level to avoid measuring pool starvation instead of actual creation overhead.

Replace cgroup.NewNoopManager() with the real cgroup.NewManager() so sandboxes get proper CPU and memory isolation via cgroupv2, matching production behavior.

Accumulate per-sandbox latencies and wall-clock times across all benchmark iterations instead of reporting only the last iteration. This makes percentiles (P50/P95/P99) statistically meaningful at high iteration counts (e.g. -benchtime=100x at concurrency-5 gives 500 latency samples instead of 5).

…rkdown table

cursor · 2026-04-01T14:33:41Z

PR Summary

Low Risk
Low risk since changes are limited to a new benchmark test and an auxiliary AWK reporting script, with no production code paths modified.

Overview
Adds BenchmarkConcurrentResume, a root-only Go benchmark that resumes many sandboxes in parallel at configurable concurrency levels and reports aggregated latency percentiles plus wall-clock timing to help tune per-node startup limits. Also includes a small bench2md.awk helper to convert the benchmark output into a Markdown table (including derived throughput).

^{Written by Cursor Bugbot for commit 5d98b65. This will update automatically on new commits. Configure here.}

packages/orchestrator/concurrent_benchmark_test.go

linear · 2026-04-01T14:39:32Z

ENG-3728 Experiment: How many sandboxes can be spawned concurrently (yet effectively) on a single node

packages/orchestrator/concurrent_benchmark_test.go

- Use max/min builtins instead of if-chains in percentile() - Use WaitGroup.Go instead of manual Add/Done goroutine pattern - Replace sort.Slice/sort.Ints with slices.Sort - Add blank line before bare return (nlreturn) - Rename min/max variables to fastest/slowest to avoid shadowing builtins

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d98b658f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-01T15:20:15Z

packages/orchestrator/concurrent_benchmark_test.go

+		}
+	}
+
+	b.StartTimer()


Keep benchmark timer stopped during result aggregation

runConcurrentResume calls b.StartTimer() right before returning, so the caller’s per-iteration bookkeeping (append, failure logging, and latency accumulation in the for b.Loop() body) is measured as part of the benchmark. This skews ns/op and affects iteration calibration for time-based benchtimes, especially at higher concurrency/failure rates where aggregation work grows, making cross-level comparisons less reliable.

Useful? React with 👍 / 👎.

claude

LGTM — benchmark-only addition with no production code impact; the two nit-level issues are captured in inline comments.

Extended reasoning...

Overview

This PR adds BenchmarkConcurrentResume (a new _test.go file) and bench2md.awk (an auxiliary reporting script). No production code paths are modified.

Security risks

None. The benchmark requires root and is not part of any deployed artifact. The AWK script is a local post-processing helper.

Level of scrutiny

Low. This is developer tooling for tuning maxStartingInstancesPerNode. Correctness matters for drawing valid conclusions from the benchmark, but bugs here cannot affect production behaviour.

Other factors

My two prior inline comments (tracerProvider shutdown, misleading ns/op) were both reasonably dismissed by the author with valid explanations and marked resolved. The two new nits — a misleading docstring about pool-size scaling and throughput overcounting on failures — are flagged as inline comments. At the default concurrency levels (1–10) neither issue affects the benchmark results in practice. The overall design (barrier-based concurrent launch, aggregated percentiles, LIFO cleanup) is sound.

claude · 2026-04-01T15:26:42Z

packages/orchestrator/concurrent_benchmark_test.go

+// BenchmarkConcurrentResume benchmarks concurrent sandbox resume at various
+// concurrency levels. For each level N, it launches N goroutines that call
+// ResumeSandbox simultaneously and measures per-sandbox latency.
+//
+// Network and NBD pool sizes are scaled to the max concurrency level + headroom
+// so the benchmark measures sandbox creation overhead, not pool starvation.
+func BenchmarkConcurrentResume(b *testing.B) {


🟡 The docstring claims "Network and NBD pool sizes are scaled to the max concurrency level + headroom" but no such scaling exists — pool sizes are hardcoded constants (network new slots: 32, reused: 100, NBD: 64). If CONCURRENCY_LEVELS is set to values like 50 or 100, the network pool would be exhausted and goroutines would block on pool starvation rather than measuring sandbox creation overhead, directly defeating the stated benchmark purpose. The docstring should be corrected to document the actual fixed pool sizes, or the code should dynamically scale the pools based on max(levels).

Extended reasoning...

The docstring at lines 125-126 of BenchmarkConcurrentResume explicitly states: "Network and NBD pool sizes are scaled to the max concurrency level + headroom so the benchmark measures sandbox creation overhead, not pool starvation." This claim is factually incorrect — no dynamic scaling occurs anywhere in the benchmark setup code.

The actual pool sizes are hardcoded constants. The network pool is created with network.NewSlotsPoolSize (32) and network.ReusedSlotsPoolSize (100). The NBD device pool uses config.NBDPoolSize which defaults to 64 (configurable via NBD_POOL_SIZE env var). None of these values are adjusted based on the parsed concurrency levels.

For the default concurrency levels (1–10), the hardcoded sizes happen to be large enough, so the benchmark works correctly by accident. However, the benchmark explicitly supports user-overridable concurrency via the CONCURRENCY_LEVELS environment variable. If a user sets CONCURRENCY_LEVELS=50,100 — a reasonable thing to do when tuning maxStartingInstancesPerNode on high-core-count machines — the network new-slots pool (size 32) would be exhausted at concurrency > 32. Goroutines would then block in Pool.Get() waiting for a slot to become available, measuring pool starvation latency rather than sandbox creation overhead.

This directly defeats the stated purpose of the benchmark. The docstring describes behavior that was either planned but never implemented, or was removed at some point. The fix is straightforward: either correct the docstring to document the actual fixed sizes and note the effective concurrency ceiling (e.g., "Pool sizes are fixed: network new-slots=32, reused=100, NBD=64; ensure CONCURRENCY_LEVELS stays below these limits"), or implement the described scaling by computing maxConcurrency := levels[len(levels)-1] and using that to size the pools with appropriate headroom.

Step-by-step proof: (1) User sets CONCURRENCY_LEVELS=50. (2) parseConcurrencyLevels() returns [50]. (3) BenchmarkConcurrentResume creates network pool with NewSlotsPoolSize=32. (4) Sub-benchmark concurrency-50 calls runConcurrentResume(..., 50). (5) 50 goroutines are launched, each calling factory.ResumeSandbox(...). (6) Each ResumeSandbox call acquires a network slot from the pool. (7) After 32 goroutines acquire slots, the remaining 18 block in Pool.Get() waiting for a slot to be returned. (8) The benchmark records these wait times as sandbox creation latency, artificially inflating p95/p99 and making the results useless for tuning maxStartingInstancesPerNode.

claude · 2026-04-01T15:26:42Z

packages/orchestrator/scripts/bench2md.awk

+    # Compute throughput: concurrency / (wall-clock in seconds)
+    throughput = ""
+    if (wall + 0 > 0) {
+        throughput = sprintf("%.1f", (concurrency + 0) / ((wall + 0) / 1000))
+    }


🟡 The Throughput/s column in the generated markdown table is overstated when any sandboxes fail, because the formula divides the raw concurrency level (goroutines launched) by wall-clock time rather than actual successful completions. The Go benchmark already reports an ok metric (avg successes per iteration), but the AWK script never extracts it; replace (concurrency + 0) with (ok + 0) in the throughput formula to get accurate numbers.

Extended reasoning...

What the bug is and how it manifests

In bench2md.awk lines 42-46, the throughput formula is (concurrency + 0) / ((wall + 0) / 1000). The variable concurrency is parsed from the benchmark sub-test name (e.g. concurrency-10), which is the total number of goroutines launched, not the number that succeeded. When some sandboxes fail, this inflates the reported throughput because failed attempts are counted as if they completed.

The specific code path that triggers it

The field-extraction loop (lines 30-39) reads: avg, fail, max, min, p50, p95, p99, wall-clock-ms. The ok field is never extracted. Meanwhile, the Go benchmark explicitly reports b.ReportMetric(float64(totalOK)/float64(iterations), "ok"). The data needed for an accurate metric is present in the input but silently ignored.

Why existing code does not prevent it

The fail column is included in the table, which lets a careful reader manually adjust. However, the Throughput/s column header implies successful throughput, not attempted throughput. There is no note or comment in the generated table warning that failures are included in the numerator.

Addressing the refutation

The refutation argues that concurrency/wall is a valid metric measuring load applied to the system. That interpretation is reasonable, but the column is labeled Throughput/s, which universally implies completed work per second, not attempted work. If the intent is to show attempted load, the column should be renamed (e.g., Load/s) and the distinction documented. As-is, a reader unfamiliar with the implementation will misread the number as successful sandbox creations per second, which could lead to incorrect tuning of maxStartingInstancesPerNode.

Impact and step-by-step proof

With concurrency=10, wall=5000ms, and 3 failures per iteration: Go reports ok=7.0, fail=3.0, wall-clock-ms=5000. Current script uses concurrency=10 from the name: throughput = 10 / (5000/1000) = 2.0/s (incorrect). Fixed script uses ok=7.0: throughput = 7.0 / (5000/1000) = 1.4/s (correct). That is a 43% overstatement. At higher failure rates the discrepancy grows proportionally.

Fix

Add ok extraction to the loop: else if ($(i) == "ok") ok = $(i-1). Then use it in the throughput formula: throughput = sprintf("%.1f", (ok + 0) / ((wall + 0) / 1000)). Since ok is already the average successful completions per iteration sharing the same time base as wall-clock-ms, this gives accurate sandboxes-per-second throughput without needing to know the iteration count separately.

arkamar added 4 commits April 1, 2026 16:12

feat(orchestrator): use real cgroup manager in concurrent benchmark

34dafd7

Replace cgroup.NewNoopManager() with the real cgroup.NewManager() so sandboxes get proper CPU and memory isolation via cgroupv2, matching production behavior.

feat(orchestrator): add awk script to convert benchmark results to ma…

b1775af

…rkdown table

e2b-request-same-site-reviewers bot assigned sitole Apr 1, 2026

claude bot reviewed Apr 1, 2026

View reviewed changes

packages/orchestrator/concurrent_benchmark_test.go Show resolved Hide resolved

claude bot reviewed Apr 1, 2026

View reviewed changes

packages/orchestrator/concurrent_benchmark_test.go Show resolved Hide resolved

packages/orchestrator/concurrent_benchmark_test.go Show resolved Hide resolved

arkamar marked this pull request as ready for review April 1, 2026 15:15

arkamar requested review from ValentaTomas, dobrac and jakubno as code owners April 1, 2026 15:15

chatgpt-codex-connector bot reviewed Apr 1, 2026

View reviewed changes

claude bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): add concurrent sandbox creation benchmark#2280

feat(orchestrator): add concurrent sandbox creation benchmark#2280
arkamar wants to merge 5 commits intomainfrom
experiment/sbx-concurency

arkamar commented Apr 1, 2026

Uh oh!

cursor bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

linear bot commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 1, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arkamar commented Apr 1, 2026

Uh oh!

cursor bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

linear bot commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor bot commented Apr 1, 2026 •

edited

Loading