Skip to content

feat(orchestrator): add concurrent sandbox creation benchmark#2280

Open
arkamar wants to merge 5 commits intomainfrom
experiment/sbx-concurency
Open

feat(orchestrator): add concurrent sandbox creation benchmark#2280
arkamar wants to merge 5 commits intomainfrom
experiment/sbx-concurency

Conversation

@arkamar
Copy link
Copy Markdown
Contributor

@arkamar arkamar commented Apr 1, 2026

This PR adds a benchmark to measure how many sandboxes can be effectively
resumed in parallel on a single node, to help tune
maxStartingInstancesPerNode for different machine types.

  • BenchmarkConcurrentResume launches N sandboxes simultaneously and
    reports latency percentiles (P50/P95/P99) and wall-clock time
  • Latencies aggregated across all iterations for stable stats at high
    iteration counts (e.g. -benchtime=100x)
  • Includes bench2md.awk helper to convert raw output to markdown tables

arkamar added 4 commits April 1, 2026 16:12
The current maxStartingInstancesPerNode limit of 3 was established from
non-rigorous historical measurements on n1 machines. To find the optimal
limit for n2 machines we need proper data.

This adds BenchmarkConcurrentResume which launches N sandboxes
simultaneously via barrier-synchronized goroutines and measures
per-sandbox creation latency (avg, P50, P95, P99, min, max),
wall-clock time, and success/failure counts.

Concurrency levels default to 1,2,3,4,5,6,7,8,9,10 and can be overridden
via the CONCURRENCY_LEVELS env var. Network and NBD pool sizes are
scaled to the max concurrency level to avoid measuring pool starvation
instead of actual creation overhead.
Replace cgroup.NewNoopManager() with the real cgroup.NewManager() so
sandboxes get proper CPU and memory isolation via cgroupv2, matching
production behavior.
Accumulate per-sandbox latencies and wall-clock times across all
benchmark iterations instead of reporting only the last iteration.
This makes percentiles (P50/P95/P99) statistically meaningful at
high iteration counts (e.g. -benchtime=100x at concurrency-5 gives
500 latency samples instead of 5).
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 1, 2026

PR Summary

Low Risk
Low risk since changes are limited to a new benchmark test and an auxiliary AWK reporting script, with no production code paths modified.

Overview
Adds BenchmarkConcurrentResume, a root-only Go benchmark that resumes many sandboxes in parallel at configurable concurrency levels and reports aggregated latency percentiles plus wall-clock timing to help tune per-node startup limits. Also includes a small bench2md.awk helper to convert the benchmark output into a Markdown table (including derived throughput).

Written by Cursor Bugbot for commit 5d98b65. This will update automatically on new commits. Configure here.

@linear
Copy link
Copy Markdown

linear bot commented Apr 1, 2026

- Use max/min builtins instead of if-chains in percentile()
- Use WaitGroup.Go instead of manual Add/Done goroutine pattern
- Replace sort.Slice/sort.Ints with slices.Sort
- Add blank line before bare return (nlreturn)
- Rename min/max variables to fastest/slowest to avoid shadowing builtins
@arkamar arkamar marked this pull request as ready for review April 1, 2026 15:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d98b658f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
}

b.StartTimer()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep benchmark timer stopped during result aggregation

runConcurrentResume calls b.StartTimer() right before returning, so the caller’s per-iteration bookkeeping (append, failure logging, and latency accumulation in the for b.Loop() body) is measured as part of the benchmark. This skews ns/op and affects iteration calibration for time-based benchtimes, especially at higher concurrency/failure rates where aggregation work grows, making cross-level comparisons less reliable.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — benchmark-only addition with no production code impact; the two nit-level issues are captured in inline comments.

Extended reasoning...

Overview

This PR adds BenchmarkConcurrentResume (a new _test.go file) and bench2md.awk (an auxiliary reporting script). No production code paths are modified.

Security risks

None. The benchmark requires root and is not part of any deployed artifact. The AWK script is a local post-processing helper.

Level of scrutiny

Low. This is developer tooling for tuning maxStartingInstancesPerNode. Correctness matters for drawing valid conclusions from the benchmark, but bugs here cannot affect production behaviour.

Other factors

My two prior inline comments (tracerProvider shutdown, misleading ns/op) were both reasonably dismissed by the author with valid explanations and marked resolved. The two new nits — a misleading docstring about pool-size scaling and throughput overcounting on failures — are flagged as inline comments. At the default concurrency levels (1–10) neither issue affects the benchmark results in practice. The overall design (barrier-based concurrent launch, aggregated percentiles, LIFO cleanup) is sound.

Comment on lines +121 to +127
// BenchmarkConcurrentResume benchmarks concurrent sandbox resume at various
// concurrency levels. For each level N, it launches N goroutines that call
// ResumeSandbox simultaneously and measures per-sandbox latency.
//
// Network and NBD pool sizes are scaled to the max concurrency level + headroom
// so the benchmark measures sandbox creation overhead, not pool starvation.
func BenchmarkConcurrentResume(b *testing.B) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The docstring claims "Network and NBD pool sizes are scaled to the max concurrency level + headroom" but no such scaling exists — pool sizes are hardcoded constants (network new slots: 32, reused: 100, NBD: 64). If CONCURRENCY_LEVELS is set to values like 50 or 100, the network pool would be exhausted and goroutines would block on pool starvation rather than measuring sandbox creation overhead, directly defeating the stated benchmark purpose. The docstring should be corrected to document the actual fixed pool sizes, or the code should dynamically scale the pools based on max(levels).

Extended reasoning...

The docstring at lines 125-126 of BenchmarkConcurrentResume explicitly states: "Network and NBD pool sizes are scaled to the max concurrency level + headroom so the benchmark measures sandbox creation overhead, not pool starvation." This claim is factually incorrect — no dynamic scaling occurs anywhere in the benchmark setup code.

The actual pool sizes are hardcoded constants. The network pool is created with network.NewSlotsPoolSize (32) and network.ReusedSlotsPoolSize (100). The NBD device pool uses config.NBDPoolSize which defaults to 64 (configurable via NBD_POOL_SIZE env var). None of these values are adjusted based on the parsed concurrency levels.

For the default concurrency levels (1–10), the hardcoded sizes happen to be large enough, so the benchmark works correctly by accident. However, the benchmark explicitly supports user-overridable concurrency via the CONCURRENCY_LEVELS environment variable. If a user sets CONCURRENCY_LEVELS=50,100 — a reasonable thing to do when tuning maxStartingInstancesPerNode on high-core-count machines — the network new-slots pool (size 32) would be exhausted at concurrency > 32. Goroutines would then block in Pool.Get() waiting for a slot to become available, measuring pool starvation latency rather than sandbox creation overhead.

This directly defeats the stated purpose of the benchmark. The docstring describes behavior that was either planned but never implemented, or was removed at some point. The fix is straightforward: either correct the docstring to document the actual fixed sizes and note the effective concurrency ceiling (e.g., "Pool sizes are fixed: network new-slots=32, reused=100, NBD=64; ensure CONCURRENCY_LEVELS stays below these limits"), or implement the described scaling by computing maxConcurrency := levels[len(levels)-1] and using that to size the pools with appropriate headroom.

Step-by-step proof: (1) User sets CONCURRENCY_LEVELS=50. (2) parseConcurrencyLevels() returns [50]. (3) BenchmarkConcurrentResume creates network pool with NewSlotsPoolSize=32. (4) Sub-benchmark concurrency-50 calls runConcurrentResume(..., 50). (5) 50 goroutines are launched, each calling factory.ResumeSandbox(...). (6) Each ResumeSandbox call acquires a network slot from the pool. (7) After 32 goroutines acquire slots, the remaining 18 block in Pool.Get() waiting for a slot to be returned. (8) The benchmark records these wait times as sandbox creation latency, artificially inflating p95/p99 and making the results useless for tuning maxStartingInstancesPerNode.

Comment on lines +42 to +46
# Compute throughput: concurrency / (wall-clock in seconds)
throughput = ""
if (wall + 0 > 0) {
throughput = sprintf("%.1f", (concurrency + 0) / ((wall + 0) / 1000))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The Throughput/s column in the generated markdown table is overstated when any sandboxes fail, because the formula divides the raw concurrency level (goroutines launched) by wall-clock time rather than actual successful completions. The Go benchmark already reports an ok metric (avg successes per iteration), but the AWK script never extracts it; replace (concurrency + 0) with (ok + 0) in the throughput formula to get accurate numbers.

Extended reasoning...

What the bug is and how it manifests

In bench2md.awk lines 42-46, the throughput formula is (concurrency + 0) / ((wall + 0) / 1000). The variable concurrency is parsed from the benchmark sub-test name (e.g. concurrency-10), which is the total number of goroutines launched, not the number that succeeded. When some sandboxes fail, this inflates the reported throughput because failed attempts are counted as if they completed.

The specific code path that triggers it

The field-extraction loop (lines 30-39) reads: avg, fail, max, min, p50, p95, p99, wall-clock-ms. The ok field is never extracted. Meanwhile, the Go benchmark explicitly reports b.ReportMetric(float64(totalOK)/float64(iterations), "ok"). The data needed for an accurate metric is present in the input but silently ignored.

Why existing code does not prevent it

The fail column is included in the table, which lets a careful reader manually adjust. However, the Throughput/s column header implies successful throughput, not attempted throughput. There is no note or comment in the generated table warning that failures are included in the numerator.

Addressing the refutation

The refutation argues that concurrency/wall is a valid metric measuring load applied to the system. That interpretation is reasonable, but the column is labeled Throughput/s, which universally implies completed work per second, not attempted work. If the intent is to show attempted load, the column should be renamed (e.g., Load/s) and the distinction documented. As-is, a reader unfamiliar with the implementation will misread the number as successful sandbox creations per second, which could lead to incorrect tuning of maxStartingInstancesPerNode.

Impact and step-by-step proof

With concurrency=10, wall=5000ms, and 3 failures per iteration: Go reports ok=7.0, fail=3.0, wall-clock-ms=5000. Current script uses concurrency=10 from the name: throughput = 10 / (5000/1000) = 2.0/s (incorrect). Fixed script uses ok=7.0: throughput = 7.0 / (5000/1000) = 1.4/s (correct). That is a 43% overstatement. At higher failure rates the discrepancy grows proportionally.

Fix

Add ok extraction to the loop: else if ($(i) == "ok") ok = $(i-1). Then use it in the throughput formula: throughput = sprintf("%.1f", (ok + 0) / ((wall + 0) / 1000)). Since ok is already the average successful completions per iteration sharing the same time base as wall-clock-ms, this gives accurate sandboxes-per-second throughput without needing to know the iteration count separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants