This doc is the “round 1” performance plan for bv, focused on needle-moving latency/throughput improvements while staying provably isomorphic (same outputs for the same inputs), with one explicit relaxation:
The user accepts “extremely close numerical stuff” as “the same” within epsilon.
Accordingly, this plan uses a strict workflow:
Baseline → Profile → Hotspot proof → Opportunity matrix → Minimal diff → Guardrails
Primary scope:
cmd/bv(CLI + robot protocol)pkg/analysis(graph + metrics + scoring)pkg/loader(JSONL loading)pkg/ui(TUI; background worker + snapshot builder)
- The dominant hotspot is approximate betweenness:
pkg/analysis.ApproxBetweenness→pkg/analysis.singleSourceBetweenness.- In alloc profiles for
BenchmarkRobotTriage_Sparse500,singleSourceBetweennessis ~70% of alloc_space, and CPU is dominated by GC + map growth sourced from that routine.
- In alloc profiles for
- For “typical” repos this project is already very fast, but large graphs still show gross headroom:
- Real data (570 issues): p50 ~52.6ms; peak RSS ~37.5MB.
- Synthetic large (5000 issues): p95 ~1.52s; peak RSS ~429MB;
--profile-startupshows Phase 2 dominated by k-core (~263ms) and betweenness (~179ms) plus slack/eigenvector.
We take the best “minimal diff” lever from PLAN_FOR_ADVANCED_OPTIMIZATIONS_ROUND_1__OPUS.md and combine it with the broader scaling view from this doc:
- Change 1A (lowest risk, highest ROI): buffer pooling for Brandes’ inner loop
- Reuse the 4 per-pivot maps + BFS slices in
singleSourceBetweennessviasync.Pool. - This is the smallest diff that attacks the alloc/GC root cause and is easiest to prove isomorphic.
- Reuse the 4 per-pivot maps + BFS slices in
- Change 1B (bigger win if still hot): array-based Brandes + cached adjacency
- Dense node indexing + precomputed sorted adjacency lists, eliminating most remaining per-pivot work (including gonum iterator churn).
- Only after re-profiling: address secondary large-graph costs (k-core graph building / slack reuse) if they become top-3 hotspots post betweenness fix.
- Since
singleSourceBetweennessis ~70% of alloc_space in triage-like profiles, even a conservative 60–80% reduction in that component should cut overall allocation volume by ~40–55%, collapsing GC time and improving p95/throughput. - If post-Change 1A profiles show gonum iterator churn becomes dominant, Change 1B (cached adjacency) is the next “clean” lever.
- Do not delete any file or directory unless the user provides the exact command in this session. (Repo policy.)
- Avoid broad refactors; prefer one performance lever per change.
A) Baseline first: tests + representative workload; record p50/p95/p99 latency, throughput, peak memory (exact commands).
B) Profile before proposing: capture CPU + allocation + I/O; identify top 3–5 hotspots by % time/bytes.
C) Equivalence oracle: explicit golden outputs + invariants; for large input spaces, use metamorphic/property tests.
D) Isomorphism proof per change: short proof sketch covering ordering, tie-breaking, floats, RNG, “now”.
E) Opportunity matrix: rank by (Impact × Confidence) / Effort, focus on p95+ and throughput wins.
F) Minimal diffs: one lever per change; include rollback guidance.
G) Regression guardrails: benchmarks/thresholds or CI hooks to prevent backsliding.
- Load issues from
.beads/issues.jsonl(preferred) or.beads/beads.jsonlfallback:pkg/loader/loader.go(GetBeadsDir,FindJSONLPath,LoadIssues,ParseIssuesWithOptions)
- Build graph + compute metrics via Analyzer:
pkg/analysis/graph.go(NewAnalyzer,AnalyzeAsync/AnalyzeWithProfile)
- Derive higher-level outputs:
- unified triage:
pkg/analysis/triage.go(ComputeTriageFromAnalyzer) - priority tuning + feedback:
pkg/analysis/priority.go,pkg/analysis/feedback.go
- unified triage:
- Presentation:
- robot JSON:
cmd/bv/main.go(--robot-*) - TUI snapshot pipeline:
pkg/ui/background_worker.go→pkg/ui/snapshot.go(build snapshot, run Phase 2, swap pointer)
- robot JSON:
- Phase 1 (sync, “instant”): degree, topo sort, density (enough for initial render and many filters).
- Phase 2 (async): PageRank, Betweenness, Eigenvector, HITS, Critical Path, Cycles (plus extras: k-core, slack, articulation).
- Size-aware config selection:
pkg/analysis/config.go(ConfigForSize). - Robot outputs include
statusso downstream consumers can tell computed vs approx vs timeout vs skipped.
Exact commands:
go version
grep -m1 'model name' /proc/cpuinfo
nproc
free -h | head -n 2Observed:
- Go:
go1.25.5 linux/amd64 - CPU:
AMD Ryzen Threadripper PRO 5975WX 32-Cores nproc:64- RAM:
499Gi
Exact command:
BV_NO_BROWSER=1 BV_TEST_MODE=1 go test ./...Observed: PASS (notably tests/e2e took ~132s in this run).
We baseline three “robot triage” datasets to cover realistic + scaling:
- Real project data (this repo):
.beads/issues.jsonl:570issues,1.5MB
- Synthetic medium:
tests/testdata/benchmark/medium.jsonl:1000issues,3.4MB
- Synthetic large:
tests/testdata/benchmark/large.jsonl:5000issues,32MB
For synthetic datasets we copy the JSONL into a temp BEADS_DIR with a single issues.jsonl to avoid loader auto-selecting another file.
Exact command:
go build -o /tmp/bv_round1 ./cmd/bvMeasurement harness (exact shape used):
python3 - <<'PY'
import math
import os
import statistics
import subprocess
import time
cmd = ['/tmp/bv_round1', '--robot-triage']
env = os.environ.copy()
env.update({'BV_ROBOT': '1', 'BV_NO_BROWSER': '1', 'BV_TEST_MODE': '1'})
# optionally set env['BEADS_DIR'] = ... for synthetic sets
warmup = 5
runs = 50
for _ in range(warmup):
subprocess.run(
cmd,
cwd='/data/projects/beads_viewer',
env=env,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True,
)
samples = []
for _ in range(runs):
t0 = time.perf_counter()
subprocess.run(
cmd,
cwd='/data/projects/beads_viewer',
env=env,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True,
)
samples.append(time.perf_counter() - t0)
samples.sort()
def pct(p):
k = (len(samples) - 1) * p / 100.0
f = math.floor(k)
c = math.ceil(k)
if f == c:
return samples[int(k)]
d = k - f
return samples[f] * (1 - d) + samples[c] * d
print('p50_s', pct(50))
print('p95_s', pct(95))
print('p99_s', pct(99))
print('mean_s', statistics.mean(samples))
print('throughput_runs_per_s', runs / sum(samples))
PYRecorded results:
| Dataset | p50 | p95 | p99 | Throughput |
|---|---|---|---|---|
.beads/issues.jsonl (570) |
52.599ms | 56.733ms | 59.216ms | 18.88 runs/s |
benchmark/medium.jsonl (1000) |
153.632ms | 158.724ms | 163.705ms | 6.53 runs/s |
benchmark/large.jsonl (5000) |
1.457s | 1.518s | 1.540s | 0.68 runs/s |
Exact command:
/usr/bin/time -v env BV_ROBOT=1 BV_NO_BROWSER=1 BV_TEST_MODE=1 /tmp/bv_round1 --robot-triage >/dev/nullRecorded “Maximum resident set size”:
| Dataset | Peak RSS |
|---|---|
.beads/issues.jsonl (570) |
38,388 kB (~37.5MB) |
benchmark/medium.jsonl (1000) |
42,928 kB (~41.9MB) |
benchmark/large.jsonl (5000) |
439,660 kB (~429MB) |
Interpretation:
- Typical repos are extremely fast already (p50 ~53ms).
- Scaling to 5k issues shows a major memory spike and >1s latency, meaning there is still “gross” headroom for large projects.
Exact command:
env BV_ROBOT=1 BV_NO_BROWSER=1 BV_TEST_MODE=1 /tmp/bv_round1 --profile-startup --profile-jsonNote: In some PTY environments, terminal capability probes may emit escape sequences before the JSON; if you need guaranteed clean JSON, force non-TTY stdout (e.g., pipe/redirect).
Observed (real data, 570 nodes):
load_jsonl: ~12.0mstotal_with_load: ~27.1msbetweenness: ~9.1ms (≈63% of Phase 2 time)
Observed (synthetic large, 5000 nodes, density≈0.00999):
load_jsonl: ~347.5mstotal_with_load: ~1.123s- dominant Phase 2 costs: k-core (~262.7ms), betweenness (~178.7ms), slack (~96.4ms), eigenvector (~78.0ms), PageRank (~48.3ms)
This is the “OPUS plan” strength: a wide bench sweep that makes the bottleneck undeniable, and gives us regression guardrails.
Exact command used (targets a representative subset; -count=3 for stability):
BV_NO_BROWSER=1 BV_TEST_MODE=1 go test ./pkg/analysis -run '^$' \
-bench '^(BenchmarkApproxBetweenness_500nodes_(Exact|Sample100|Sample50)|BenchmarkFullAnalysis_Sparse500|BenchmarkGenerateAllSuggestions_Medium|BenchmarkCycles_ManyCycles20|BenchmarkCycles_ManyCycles30|BenchmarkDetectCycleWarnings_Medium|BenchmarkFindCyclesSafe_Large|BenchmarkSuggestLabels_LargeSet|BenchmarkComplete_Betweenness15|BenchmarkComplete_PageRank20)$' \
-benchmem -count=3Recorded (mean over -count=3, rounded):
| Benchmark | ns/op | allocs/op | B/op |
|---|---|---|---|
| ApproxBetweenness_500nodes_Exact | 66,386,288 | 499,560 | 34,110,547 |
| ApproxBetweenness_500nodes_Sample100 | 13,242,222 | 199,551 | 29,575,730 |
| ApproxBetweenness_500nodes_Sample50 | 5,456,221 | 100,756 | 14,841,992 |
| FullAnalysis_Sparse500 | 11,989,257 | 222,834 | 98,705,820 |
| GenerateAllSuggestions_Medium | 6,622,519 | 49,279 | 3,856,843 |
| DetectCycleWarnings_Medium | 209,281 | 1,435 | 140,584 |
| FindCyclesSafe_Large | 140,131 | 1,417 | 103,872 |
| SuggestLabels_LargeSet | 1,582,094 | 11,342 | 607,904 |
| Complete_Betweenness15 | 258,745 | 4,087 | 1,723,107 |
| Complete_PageRank20 | 32,577 | 513 | 248,696 |
| Cycles_ManyCycles20 | 149,214,974 | 294,827 | 120,980,771 |
Pathology note (BenchmarkCycles_ManyCycles30):
ns/opis stable (~500ms) but allocations vary wildly run-to-run (it is intentionally a pathological benchmark):B/op: min 1,004,341,976; median 1,998,106,040; max 3,103,384,328allocs/op: min 1,904,844; median 3,980,031; max 5,710,612 This is a reminder that some graph problems are inherently “blow up” shaped and must remain guarded by timeouts and caps.
We use benches as the reproducible profiling harness (faster iteration than sampling full CLI).
Robot-like bench used:
pkg/analysis/bench_test.go:BenchmarkRobotTriage_Sparse500
Exact commands (timestamps omitted for readability):
BV_NO_BROWSER=1 BV_TEST_MODE=1 go test ./pkg/analysis -run '^$' -bench '^BenchmarkRobotTriage_Sparse500$' \
-benchtime=3s -cpuprofile /tmp/bv_cpu_triage_sparse500.prof -memprofile /tmp/bv_mem_triage_sparse500.prof
go tool pprof -top -cum -nodecount=25 /tmp/bv_cpu_triage_sparse500.prof
go tool pprof -top -alloc_space -nodecount=25 /tmp/bv_mem_triage_sparse500.profObserved bench summary (this run):
BenchmarkRobotTriage_Sparse500: ~21.59ms/op, ~27.7MB/op, ~89.5k allocs/op
Alloc-space (from pprof):
pkg/analysis.singleSourceBetweenness: ~70% of alloc_space- Gonum iterator allocations inside neighbor iteration:
gonum.org/v1/gonum/graph/iterator.*(several % each) - Secondary allocs: ranking helpers (
computeFloatRanks), some analyzer/graph building
CPU (from pprof, interpreted causally):
runtime.gcDrain/runtime.scanobject/ friends: GC dominates wall timeruntime.mapassign_fast64+internal/runtime/maps.(*table).grow/rehash: map traffic dominates CPUanalysis.ApproxBetweenness.func2+analysis.singleSourceBetweenness: the algorithmic source of the above GC/map load
Conclusion:
Approximate betweenness (
ApproxBetweenness→singleSourceBetweenness) is the dominant, needle-moving hotspot (time via GC/map churn; space via alloc_space).
In pkg/analysis/betweenness_approx.go, each pivot run of singleSourceBetweenness currently allocates and fills multiple per-node maps:
sigma := make(map[int64]float64) // shortest-path counts
dist := make(map[int64]int) // BFS distances
delta := make(map[int64]float64) // dependencies
pred := make(map[int64][]int64) // predecessor lists (dynamic slices)And on top of that, it also allocates repeatedly during traversal:
nodes := graph.NodesOf(g.Nodes())per call (materializes node slice)neighborsslice per visited node (append + sort) to get deterministic traversal order- gonum iterator allocations from
g.From(v)calls - in
ApproxBetweenness, each goroutine also allocateslocalBC := make(map[int64]float64)before merging intopartialBC
This combination creates extreme allocation volume → GC dominates CPU and tail latency (p95+) gets worse under load.
Exact command:
strace -c -f env BV_ROBOT=1 BV_NO_BROWSER=1 BV_TEST_MODE=1 /tmp/bv_round1 --robot-triage >/dev/nullObserved: syscalls dominated by futex/nanosleep; only ~30 read calls; effectively not I/O-bound.
For “same inputs”, we mean:
- same parsed issue set (same JSONL contents after parsing/validation)
- same analysis configuration (dynamic
ConfigForSizegiven node/edge count, unless forced) - same seed for approximate algorithms (approx betweenness uses a seed)
- same “now” when time-based fields are included (triage staleness, etc.)
We treat outputs in two classes:
(1) Structural outputs must be exact
- IDs present and counts
- membership of sets (actionable set, blockers, quick wins, etc.)
- cycle presence and cycle members (when cycles computed)
- JSON schema and field names
- deterministic tie-breaking rules (ID) when not governed by near-ties
(2) Float outputs are equal within epsilon
- centralities and composite scores can differ slightly (parallel reduction, iteration order)
- we accept differences within epsilon (abs and/or relative)
Use both absolute and relative tolerance:
absEps: correct near zerorelEps: correct at large magnitudes
Suggested defaults (align with existing golden tests + new epsilon test):
- PageRank: abs
1e-5 - Eigenvector/HITS/CriticalPath: abs
1e-6 - Betweenness:
- exact: abs
1e-6 - approximate: abs
1e-6+ rel1e-12(or compare top-k rank stability)
- exact: abs
If two scores differ by less than epsilon, ordering swaps are treated as equivalent as long as:
- the set of items is identical, and
- swapped items were within epsilon of each other.
This preserves correctness while preventing flaky tests when floats drift in low bits.
- Golden numeric expectations:
pkg/analysis/golden_test.go - Metamorphic/invariance:
pkg/analysis/invariance_test.go - Loader fuzzing:
pkg/loader/fuzz_test.go - Robot contract + perf checks:
tests/e2e/* - Epsilon stability for approximate betweenness:
pkg/analysis/betweenness_approx_test.go(TestApproxBetweenness_DeterministicWithinEpsilon)
Scoring: (Impact × Confidence) / Effort, where Impact is primarily p95+/throughput and peak RSS on large graphs.
| # | Candidate | Impact | Confidence | Effort | Score | Why it matters |
|---|---|---|---|---|---|---|
| 1A | Buffer pooling for Brandes inner loop (reuse maps + BFS slices) | 0.70 | 0.95 | 0.35 | 1.90 | Minimal diff that attacks ~70% alloc_space root cause |
| 1B | Array-based Brandes + cached adjacency (dense indices, no maps) | 0.90 | 0.85 | 0.60 | 1.28 | Bigger win; also eliminates gonum iterator churn and neighbor sorting |
| 2 | Avoid building a separate undirected graph for k-core/articulation | 0.25 | 0.85 | 0.35 | 0.61 | On 5k nodes, k-core is ~263ms and allocates significantly |
| 3 | Slack computation reuse (reuse topo; avoid per-node prereq slice builds) | 0.15 | 0.90 | 0.30 | 0.45 | On 5k nodes slack ~96ms; easy isomorphic savings |
| 4 | Replace k-core algorithm with linear-time peeling | 0.25 | 0.85 | 0.60 | 0.35 | Protects against worst-case O(maxDeg·V) behavior |
| 5 | Cross-process cache for robot mode keyed by (data_hash,config_hash) |
0.15 | 0.60 | 0.70 | 0.13 | Useful for repeated invocations; adds invalidation/storage policy risk |
| 6 | Data-hash micro-optimization (same hash, fewer allocs) | 0.10 | 0.60 | 0.30 | 0.20 | Might matter at 5k+ once analysis allocs drop |
| R | Semiring / matrix-form recast (research track) | 0.10 | 0.30 | 0.90 | 0.03 | Interesting for 50k+ nodes; not justified for Round 1 |
Round 1 focus: 1A → re-profile → 1B only if still hot → then consider 2/3 if they rise into the top 3–5 hotspots.
Target:
pkg/analysis/betweenness_approx.go(singleSourceBetweenness)
Why this is the best first lever (hybrid rationale):
- It attacks the proven bottleneck (≈70% alloc_space) with minimal surface area.
- It is straightforward to prove isomorphic: we perform the same assignments as before, just into reused buffers.
Plan sketch (pattern from OPUS plan; keep signature stable):
type brandesBuffers struct {
sigma map[int64]float64
dist map[int64]int
delta map[int64]float64
pred map[int64][]int64
queue []int64
stack []int64
neighbors []int64
}
var brandesPool = sync.Pool{ /* New: allocate once, reuse */ }Key reset logic:
- For every nodeID in
nodes, explicitly setsigma=0,dist=-1,delta=0, andpred[nodeID]=pred[nodeID][:0]. - Use pooled
neighborsslice inside the BFS loop (neighbors = neighbors[:0]). - Optional safety: if buffers grew far beyond current graph size, drop/recreate to avoid unbounded memory retention.
What this does NOT address (so we know what to profile next):
nodes := graph.NodesOf(g.Nodes())still allocates per pivotlocalBC := make(map[int64]float64)inApproxBetweennessstill allocates per pivot- gonum iterator allocations from
g.From(v)still occur (though pooled neighbor slice reduces some pressure)
Isomorphism proof sketch (more explicit, OPUS-style):
- Initialization equivalence: every value read by the algorithm is assigned before use for all nodes, so prior buffer contents are irrelevant.
- Graph-change safety: stale entries for node IDs not present in
nodesare never referenced (only iterated/visited nodes matter). - Predecessor slice equivalence:
make([]T,0)ands[:0]are identical for algorithm semantics (len==0), differing only in allocation behavior. - Deterministic traversal: neighbor ordering remains sorted by node ID (same as current semantics).
- Concurrency safety: each pivot goroutine gets its own buffers from the pool; no shared mutation across goroutines.
- Pool eviction safety:
sync.Poolmay drop buffers under GC; worst case behavior degenerates to current allocation behavior, never to incorrect results.
Caveats:
sync.Poolcan be evicted under memory pressure; reductions may vary with workload and GC.- If we “clear” maps incorrectly (forget to overwrite a key that will be read), correctness breaks; the reset must be exhaustive over node IDs.
Rollback guidance:
- Localized to one file; revert
brandesPoolusage if it misbehaves.
Guardrails:
- Existing golden + invariance tests.
- Add/keep allocation-focused benches and check that
BenchmarkApproxBetweenness_500nodes_Sample100allocs/op drops sharply.
Expected gains (bench-driven targets, conservative):
- Baseline (mean, §2H):
BenchmarkApproxBetweenness_500nodes_Sample100allocs/op ≈ 199,551, B/op ≈ 29.6MB. - Change 1A target (pooling only):
- allocs/op: floor < 80k (≥60% reduction), stretch < 50k (≥75% reduction)
- B/op: floor < ~15MB, stretch < ~10MB
- If we don’t hit these, it’s a signal that remaining allocs are dominated by
graph.NodesOf(g.Nodes()),localBCallocations, or iterator churn → proceed to Change 1B.
Target:
pkg/analysis/betweenness_approx.go(ApproxBetweenness,singleSourceBetweenness)
Goal:
- Eliminate most remaining allocations by moving to dense arrays and precomputed adjacency lists.
Strategy:
- Build a stable
ids []int64(sorted node IDs) andindex map[int64]intonce per ApproxBetweenness call. - Build
adj [][]intonce, whereadj[i]is sorted neighbor indices (this replaces per-node neighbor slice materialization + sort). - Rewrite Brandes inner loop to operate on indices (
[]float64/[]int), reusing slices per worker. - Keep pivot sampling deterministic (same seed) and accept epsilon-level float drift due to reduction order (approved by user + guarded by epsilon tests).
Isomorphism proof sketch:
- Same algorithm, same pivots, same adjacency ordering (sorted by ID), same arithmetic; only representation changes.
- Any differences are limited to floating rounding; within epsilon acceptance.
Change 2 (only if post-#1 profiles justify it): Remove undirected-graph construction for k-core/articulation
Target:
pkg/analysis/graph.go(computeCoreAndArticulation+computeKCore+findArticulationPoints)
Minimal diff idea:
- Build an undirected adjacency view directly from the directed graph edges (treat every directed edge as undirected).
- Run k-core and articulation on that adjacency without allocating
simple.UndirectedGraph+ edges.
Isomorphism proof sketch:
- The current implementation already treats the directed edges as undirected by inserting them into an undirected graph; direct adjacency construction is equivalent.
- Determinism preserved by sorting neighbors (if iteration order matters in DFS).
Target:
computeKCorecurrently loopsk=1..maxDegand rescans nodes; worst-case can be large.
Proposed algorithm:
- Batagelj–Zaveršnik linear-time k-core decomposition using bin sorting by degree.
Isomorphism proof sketch:
- Computes the same core number definition; only the algorithmic route changes.
This section explicitly maps the user’s optimization checklist to this codebase.
- Index-based lookup vs linear scans: dense-index arrays for graph algorithms.
- Zero-copy / buffer reuse: per-worker scratch buffers for Brandes.
- Bounded queues + backpressure: worker pool rather than per-pivot goroutines (even though sem bounds concurrency, pool enables buffer reuse cleanly).
- Topological sort and DAG-awareness: reuse Phase 1 topo order in Phase 2 helpers (slack).
- Graph traversal with early termination: cycle detection already uses SCC precheck; keep it.
- Heaps/priority queues: not directly needed in current hotspots.
- Memory layout optimization (SoA vs AoS): array-based sigma/dist/delta improves cache locality.
- Short-circuiting: avoid work when metrics are skipped by config or timeouts.
- Serialization format costs: JSON encoding/decoding costs may become significant on 20k issues; revisit once analysis allocations are fixed.
- Cross-process caching: can help repeated
bv --robot-*on large graphs.
- DP / convex optimization / min-cost flow: no hotspots resemble scheduling/allocation optimization problems.
- Tries, segment trees, spatial indices: not the problem shape here.
- Lock-free data structures: contention is not the limiter; allocations/GC are.
pkg/analysis/BenchmarkRobotTriage_Sparse500pkg/analysis/BenchmarkFullAnalysis_Sparse1000
Suggested CI guard (concrete options):
- Benchstat gate: run
./scripts/benchmark.sh compareand fail if key benches regress beyond a threshold. - Allocation threshold gate (fast, dependency-free): ensure
BenchmarkApproxBetweenness_500nodes_Sample100allocs/op stays under a fixed cap after Change 1A.
Example allocation-threshold check:
go test -run '^$' -bench '^BenchmarkApproxBetweenness_500nodes_Sample100$' -benchmem ./pkg/analysis -count=1 | tee bench.txt
ALLOCS=$(awk '/Sample100/ {print $7}' bench.txt)
if [ "${ALLOCS:-999999999}" -gt 80000 ]; then
echo "Allocation regression: Sample100 allocs/op=$ALLOCS (cap=80000)"
exit 1
fipkg/analysis/golden_test.go(metric tolerances)pkg/analysis/betweenness_approx_test.go(epsilon stability)
For Change 1A/1B (betweenness optimization), “success” is defined as all of:
- Meaningful reduction in alloc_space attributed to betweenness (target: from ~70% to “not dominant”).
- CPU profile shows GC time collapsing (runtime.gcDrain no longer dominates).
- For large graphs (5k issues):
- peak RSS falls drastically from ~429MB
- p95 latency and/or throughput improves (re-measure p50/p95/p99 as in §2E)
- All tests remain green under
BV_NO_BROWSER=1 BV_TEST_MODE=1.