Performance Optimization Plan: Round 1

Executive Summary

This document captures a rigorous performance analysis of beads_viewer following a data-driven methodology. The analysis identified a single dominant hotspot responsible for 71% of memory allocations and 49% of CPU time, with a clear optimization path that is provably isomorphic (outputs unchanged).

Key Finding: Buffer pooling for singleSourceBetweenness can eliminate 60-80% of allocations from the dominant hotspot, with potential for 90%+ when combined with additional caching.

Epsilon Relaxation: Floating-point outputs are considered "equal" within epsilon tolerances (user-approved).

0) Hard Constraints

Repo Invariants (Absolute)

Do not delete any file or directory unless explicitly requested with exact command.
Avoid broad refactors; prefer one performance lever per change.

Methodology Invariants (A→G)

A) Baseline First: Run benchmarks with -benchmem -count=3; record p50/p95/p99 latency, throughput, peak memory
B) Profile Before Proposing: Capture CPU + allocation profiles; identify top 3-5 hotspots
C) Equivalence Oracle: Define golden outputs + invariants; use property tests for large input spaces
D) Isomorphism Proof: Every proposed change includes proof that outputs cannot change
E) Opportunity Matrix: Rank by (Impact × Confidence) / Effort
F) Minimal Diffs: One performance lever per change; include rollback guidance
G) Regression Guardrails: Add benchmark thresholds and CI hooks

1) Architecture Snapshot

1A) Data Plane (robot + TUI share the same engine)

Load issues from .beads/issues.jsonl (preferred) or .beads/beads.jsonl fallback:
- pkg/loader/loader.go (GetBeadsDir, FindJSONLPath, LoadIssues)
Build graph + compute metrics via Analyzer:
- pkg/analysis/graph.go (NewAnalyzer, AnalyzeAsync / AnalyzeWithProfile)
Derive higher-level outputs:
- Unified triage: pkg/analysis/triage.go (ComputeTriageFromAnalyzer)
- Priority tuning: pkg/analysis/priority.go, pkg/analysis/feedback.go
Presentation:
- Robot JSON: cmd/bv/main.go (--robot-*)
- TUI snapshot: pkg/ui/background_worker.go → pkg/ui/snapshot.go

1B) Two-Phase Analysis Contract

Phase 1 (sync, "instant"): Degree centrality, topological sort, density — enough for initial render
Phase 2 (async): PageRank, Betweenness, Eigenvector, HITS, Critical Path, Cycles, k-core, slack, articulation
Size-aware config: pkg/analysis/config.go (ConfigForSize)
Status tracking: Robot outputs include status so consumers know computed vs approx vs timeout vs skipped

2) Baseline Metrics

2A) Environment

go version go1.25.5 linux/amd64
cpu: AMD Ryzen Threadripper PRO 5975WX 32-Cores
nproc: 64
RAM: 499Gi

2B) Representative Workloads

Dataset	Issues	Size	Description
`.beads/issues.jsonl`	570	1.5MB	Real project data
`benchmark/medium.jsonl`	1,000	3.4MB	Synthetic medium
`benchmark/large.jsonl`	5,000	32MB	Synthetic large

2C) Latency Distribution (p50/p95/p99)

Measurement harness:

import math, os, subprocess, time

cmd = ['/tmp/bv_round1', '--robot-triage']
env = os.environ.copy()
env.update({'BV_ROBOT': '1', 'BV_NO_BROWSER': '1', 'BV_TEST_MODE': '1'})

warmup, runs = 5, 50
for _ in range(warmup):
    subprocess.run(cmd, cwd='/data/projects/beads_viewer', env=env,
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True)

samples = []
for _ in range(runs):
    t0 = time.perf_counter()
    subprocess.run(cmd, cwd='/data/projects/beads_viewer', env=env,
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, check=True)
    samples.append(time.perf_counter() - t0)

samples.sort()
def pct(p):
    k = (len(samples) - 1) * p / 100.0
    f, c = math.floor(k), math.ceil(k)
    return samples[int(k)] if f == c else samples[f] * (1 - (k - f)) + samples[c] * (k - f)

print(f'p50={pct(50)*1000:.1f}ms p95={pct(95)*1000:.1f}ms p99={pct(99)*1000:.1f}ms')
print(f'throughput={runs / sum(samples):.2f} runs/s')

Dataset	p50	p95	p99	Throughput
570 issues	52.6ms	56.7ms	59.2ms	18.88 runs/s
1,000 issues	153.6ms	158.7ms	163.7ms	6.53 runs/s
5,000 issues	1.457s	1.518s	1.540s	0.68 runs/s

2D) Peak Memory (RSS)

/usr/bin/time -v env BV_ROBOT=1 BV_NO_BROWSER=1 BV_TEST_MODE=1 /tmp/bv_round1 --robot-triage >/dev/null

Dataset	Peak RSS
570 issues	38 MB
1,000 issues	42 MB
5,000 issues	429 MB

Interpretation: Typical repos are fast (p50 ~53ms). Scaling to 5k issues shows major memory spike and >1s latency — significant headroom for optimization.

2E) Startup Breakdown (`--profile-startup`)

Phase	570 issues	5,000 issues
load_jsonl	12.0ms	347.5ms
betweenness	9.1ms (63% of Phase 2)	178.7ms
k-core	—	262.7ms
slack	—	96.4ms
eigenvector	—	78.0ms
PageRank	—	48.3ms
total_with_load	27.1ms	1.123s

2F) Benchmark Results

Benchmark	ns/op	allocs/op	B/op
ApproxBetweenness_500nodes_Exact	70,263,842	499,557	34,110,194
ApproxBetweenness_500nodes_Sample100	13,586,671	199,548	29,574,627
ApproxBetweenness_500nodes_Sample50	5,568,698	100,756	14,841,955
RobotTriage_Sparse500	21,590,000	89,500	27,700,000
FullAnalysis_Sparse500	14,232,742	82,316	26,428,791
Cycles_ManyCycles30	500,355,711	3,335,574	1,824,125,368

3) Profiling Results

3A) CPU Profile (138.02s total sample time)

Hotspot	Time (s)	%	Category
runtime.gcDrain	44.04	31.9%	GC overhead
runtime.mapassign_fast64	39.38	28.5%	Map operations
singleSourceBetweenness	68.01	49.3%	Core algorithm
runtime.scanobject	18.24	13.2%	GC scanning
internal/runtime/maps.table.grow	26.31	19.1%	Map growth

Root Cause: ~45% GC (gcDrain + scanobject) + ~48% map operations. These overlap since map allocations trigger GC.

3B) Memory Profile (41.34GB total allocations)

Allocator	Memory (MB)	%
singleSourceBetweenness (direct)	29,463	71.3%
gonum iterators (inside singleSourceBetweenness)	~8,000	~19%
Other	~4,000	~10%

Note: The 71.3% represents DIRECT allocations (4 maps). The gonum iterator allocations (~19%) occur during g.From(v) calls INSIDE singleSourceBetweenness. Total during execution: ~90%.

Root Cause: singleSourceBetweenness creates 4 fresh maps per call:

sigma := make(map[int64]float64)  // N entries
dist := make(map[int64]int)       // N entries
delta := make(map[int64]float64)  // N entries
pred := make(map[int64][]int64)   // N entries + dynamic slices

For a 500-node graph with 100 samples: 400 maps created (4 per sample), totaling ~200K entries.

3C) I/O Profile Sanity Check

strace -c -f env BV_ROBOT=1 BV_NO_BROWSER=1 BV_TEST_MODE=1 /tmp/bv_round1 --robot-triage >/dev/null

Observed: Syscalls dominated by futex/nanosleep; only ~30 read calls. Not I/O-bound.

4) Equivalence Oracle

4A) Input Definition

For "same inputs", we require:

Same parsed issue set (same JSONL contents after parsing/validation)
Same analysis configuration (ConfigForSize given node/edge count)
Same seed for approximate algorithms (approx betweenness uses a seed)
Same "now" when time-based fields are included (triage staleness)

4B) Output Definition

(1) Structural outputs must be exact:

IDs present and counts
Membership of sets (actionable, blockers, quick wins)
Cycle presence and members (when computed)
JSON schema and field names
Deterministic tie-breaking (by ID)

(2) Float outputs are equal within epsilon:

Centralities and composite scores may differ slightly (parallel reduction, iteration order)
Differences within epsilon are accepted

4C) Epsilon Policy

Metric	Absolute Tolerance	Notes
PageRank	1e-5	Iterative convergence variance
Betweenness (exact)	1e-6	Exact algorithm
Betweenness (approx)	1e-6 abs + 1e-12 rel	Compare top-k rank stability
Eigenvector	1e-6
HITS (Hubs/Authorities)	1e-6
CriticalPathScore	1e-6

4D) Ordering Under Epsilon

If two scores differ by less than epsilon, ordering swaps are treated as equivalent as long as:

The set of items is identical
Swapped items were within epsilon of each other

This prevents flaky tests when floats drift in low bits.

4E) Existing Guardrails

Test File	Coverage
`pkg/analysis/golden_test.go`	Numeric tolerances for all metrics
`pkg/analysis/invariance_test.go`	Metamorphic/invariance properties
`pkg/analysis/betweenness_approx_test.go`	Epsilon stability for approx betweenness
`pkg/loader/fuzz_test.go`	Loader fuzzing
`tests/e2e/*`	Robot contract + perf checks

4F) Invariants

Betweenness bounds: 0 ≤ BC(v) ≤ (n-1)(n-2) for directed graphs
PageRank sum: Σ PR(v) = 1.0
Deterministic ordering: Sorted by value descending, then by ID ascending
Map completeness: All node IDs present in output maps

5) Opportunity Matrix

Scoring: (Impact × Confidence) / Effort

#	Candidate	Impact	Confidence	Effort	Score	Notes
1	Buffer pooling for Brandes	0.70	0.95	0.40	1.66	Dominant alloc_space (~71%)
2	Array-based indexing (no maps)	0.50	0.90	0.50	0.90	Cache-friendly
3	Remove undirected graph for k-core	0.25	0.85	0.35	0.61	On 5k nodes, k-core ~263ms
4	Cached adjacency lists	0.30	0.70	0.60	0.35	Avoid gonum iterator overhead
5	Slack computation reuse	0.15	0.90	0.30	0.45	Reuse topo order
6	Cross-process cache (robot mode)	0.15	0.60	0.70	0.13	Keyed by data_hash + config_hash

Round 1 Focus: #1 (buffer pooling). Re-profile after, then evaluate #2-5.

6) Proposed Changes

Change 1 (Primary): Buffer Pooling for Brandes' Algorithm

Location: pkg/analysis/betweenness_approx.go:167-241

Current Implementation:

func singleSourceBetweenness(g *simple.DirectedGraph, source graph.Node, bc map[int64]float64) {
    sourceID := source.ID()
    nodes := graph.NodesOf(g.Nodes())  // Still allocates (not pooled)

    // ALLOCATES 4 FRESH MAPS PER CALL - THIS IS THE PROBLEM
    sigma := make(map[int64]float64)
    dist := make(map[int64]int)
    delta := make(map[int64]float64)
    pred := make(map[int64][]int64)

    // ... rest of algorithm
}

Proposed Implementation:

// brandesBuffers holds reusable data structures for Brandes' algorithm.
type brandesBuffers struct {
    sigma     map[int64]float64
    dist      map[int64]int
    delta     map[int64]float64
    pred      map[int64][]int64
    queue     []int64
    stack     []int64
    neighbors []int64
}

var brandesPool = sync.Pool{
    New: func() interface{} {
        return &brandesBuffers{
            sigma:     make(map[int64]float64, 256),
            dist:      make(map[int64]int, 256),
            delta:     make(map[int64]float64, 256),
            pred:      make(map[int64][]int64, 256),
            queue:     make([]int64, 0, 256),
            stack:     make([]int64, 0, 256),
            neighbors: make([]int64, 0, 32),
        }
    },
}

// reset clears all values but retains map/slice capacity.
func (b *brandesBuffers) reset(nodes []graph.Node) {
    // Clear maps if they've grown too large (prevents unbounded memory growth)
    if len(b.sigma) > len(nodes)*2 {
        clear(b.sigma)
        clear(b.dist)
        clear(b.delta)
        clear(b.pred)
    }

    for _, n := range nodes {
        nid := n.ID()
        b.sigma[nid] = 0
        b.dist[nid] = -1
        b.delta[nid] = 0
        // Reuse slice backing array, clear length
        if existing, ok := b.pred[nid]; ok {
            b.pred[nid] = existing[:0]
        } else {
            b.pred[nid] = make([]int64, 0, 4)  // First-call allocation unavoidable
        }
    }
    b.queue = b.queue[:0]
    b.stack = b.stack[:0]
    b.neighbors = b.neighbors[:0]
}

func singleSourceBetweenness(g *simple.DirectedGraph, source graph.Node, bc map[int64]float64) {
    sourceID := source.ID()
    nodes := graph.NodesOf(g.Nodes())  // Still allocates

    // Get buffer from pool
    buf := brandesPool.Get().(*brandesBuffers)
    defer brandesPool.Put(buf)

    buf.reset(nodes)

    // Use buf.sigma, buf.dist, buf.delta, buf.pred instead of local maps

    // BFS phase - use pooled neighbors slice:
    // buf.neighbors = buf.neighbors[:0]
    // for to.Next() {
    //     buf.neighbors = append(buf.neighbors, to.Node().ID())
    // }

    // ... rest of algorithm unchanged
}

What This Does NOT Address

// Line 169 - STILL ALLOCATES (not pooled)
nodes := graph.NodesOf(g.Nodes())

// Line 116 in ApproxBetweenness - STILL ALLOCATES per goroutine
localBC := make(map[int64]float64)

// gonum iterator overhead from g.From(v) calls

To achieve 90%+ reduction, also pool nodes slice and localBC maps (requires Analyzer-level caching).

7) Isomorphism Proof

Theorem: Buffer reuse produces identical outputs to fresh allocation.

Proof:

Initialization Equivalence:
- Current: sigma[nid] = 0, dist[nid] = -1, delta[nid] = 0
- Proposed: Same assignments in reset()
- Since we iterate over ALL nodes and explicitly set values, prior state is irrelevant.
Graph Change Safety:
- If node IDs differ between calls, stale entries don't affect output
- BFS only visits reachable nodes from source
- All visited nodes are explicitly initialized before use
Predecessor Slice Safety:
- Current: pred[nid] = make([]int64, 0)
- Proposed: pred[nid] = pred[nid][:0] (reuses backing array)
- Equivalence: Empty slice has len=0. Both produce identical behavior.
Floating-Point Determinism:
- All arithmetic operations (σ_v/σ_w) × (1+δ_w) are identical
- IEEE-754 floating-point is deterministic for same inputs
- Ordering guaranteed by sorted node IDs in BFS/accumulation
Concurrency Safety:
- Each goroutine in ApproxBetweenness gets its own buffer from pool
- Buffers are returned AFTER results are written to localBC (before merge to partialBC)
- Results live in localBC, not in the buffer, so merge is safe after buffer return
- sync.Pool guarantees no concurrent access to same buffer
Pool Eviction Safety:
- If buffer is evicted during GC and recreated, behavior is identical to fresh allocation

QED ∎

8) Follow-Up Changes (Post-#1 Profiling)

Change 2: Remove Undirected Graph for k-core/Articulation

Target: pkg/analysis/graph.go (computeCoreAndArticulation, computeKCore, findArticulationPoints)

Rationale: On 5k nodes, k-core takes ~263ms and allocates significantly by constructing simple.UndirectedGraph.

Minimal diff:

Build undirected adjacency view directly from directed edges
Run k-core and articulation on that adjacency without allocating new graph

Isomorphism: Current implementation treats directed edges as undirected by inserting into undirected graph; direct adjacency construction is equivalent.

Change 3: Linear-Time k-core Algorithm

Target: computeKCore loops k=1..maxDeg and rescans nodes; worst-case O(maxDeg·V).

Proposed: Batagelj–Zaveršnik linear-time k-core decomposition using bin-sort peeling.

Isomorphism: Computes same core number definition; only algorithmic route changes.

9) What We Considered

Explicit mapping of optimization techniques to this codebase:

Clearly Applicable (Supported by Profiles)

Technique	Application
Index-based lookup	Dense-index arrays for graph algorithms
Zero-copy / buffer reuse	Per-worker scratch buffers for Brandes
Bounded queues + backpressure	Worker pool rather than per-pivot goroutines
Topological sort reuse	Reuse Phase 1 topo order in Phase 2 helpers (slack)
Memory layout (SoA vs AoS)	Array-based sigma/dist/delta improves cache locality
Short-circuiting	Avoid work when metrics skipped by config/timeouts

Possibly Useful Later

Technique	Notes
Serialization format	JSON encoding may become significant at 20k+ issues
Cross-process caching	Helps repeated `bv --robot-*` on large graphs

Not Currently Justified

Technique	Reason
DP / convex optimization	No hotspots resemble scheduling/allocation problems
Tries / segment trees	Not the problem shape here
Lock-free data structures	Contention is not the limiter; allocations/GC are

10) Expected Gains

Before Optimization

Metric	Value
Allocations/op (Sample100)	199,548
Bytes/op (Sample100)	29.5 MB
GC CPU overhead	~45%

After Optimization - Conservative (Buffer Pooling Only)

Metric	Value	Reduction
Allocations/op (Sample100)	~40,000	80%
Bytes/op (Sample100)	~8 MB	73%
GC CPU overhead	~15%	67%
Throughput	1.5-2x	—

After Optimization - With Additional Caching

If combined with node slice caching and localBC pooling:

Metric	Value	Reduction
Allocations/op (Sample100)	~5,000	97%
Bytes/op (Sample100)	~3 MB	90%
GC CPU overhead	~5%	89%
Throughput	3-5x	—

11) Minimal Diff Summary

The change is isolated to a single file:

pkg/analysis/betweenness_approx.go
  - Add brandesBuffers struct (~25 lines)
  - Add brandesPool with sync.Pool (~10 lines)
  - Add reset() method (~20 lines)
  - Modify singleSourceBetweenness to use pool (~10 line changes)

Total: ~65 lines added/modified

No API changes: Function signature unchanged, output unchanged.

Rollback Guidance

If issues arise:

Remove brandesPool and brandesBuffers
Restore original map allocations in singleSourceBetweenness
No external code changes required

12) Regression Guardrails

12A) Benchmarks to Watch

pkg/analysis/BenchmarkRobotTriage_Sparse500
pkg/analysis/BenchmarkApproxBetweenness_500nodes_Sample100
pkg/analysis/BenchmarkFullAnalysis_Sparse1000

12B) Allocation Threshold Benchmark

func BenchmarkBrandesAllocationThreshold(b *testing.B) {
    issues := generateSparseGraph(500)
    g := buildGraph(issues)
    nodes := graph.NodesOf(g.Nodes())

    b.ReportAllocs()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        bc := make(map[int64]float64)
        singleSourceBetweenness(g, nodes[0], bc)
    }
    // After optimization: expect < 100 allocs/op (down from ~1000)
}

12C) Property Test

func TestBetweennessOutputEquivalence(t *testing.T) {
    issues := generateSparseGraph(100)

    baseline := computeBetweennessBaseline(issues)
    optimized := computeBetweennessOptimized(issues)

    for id, baseVal := range baseline {
        optVal := optimized[id]
        if math.Abs(baseVal-optVal) > 1e-10 {
            t.Errorf("Mismatch for %s: baseline=%v, optimized=%v", id, baseVal, optVal)
        }
    }
}

12D) CI Integration

- name: Check allocation regression
  run: |
    go test -bench=BenchmarkApproxBetweenness_500nodes_Sample100 \
      -benchmem ./pkg/analysis/... | tee bench.txt

    # Field positions: 1=name 2=iterations 3=ns/op 4="ns/op" 5=B/op 6="B/op" 7=allocs/op
    ALLOCS=$(grep 'Sample100' bench.txt | awk '{print $7}')
    if [ "$ALLOCS" -gt 50000 ]; then
      echo "Allocation regression: $ALLOCS > 50000"
      exit 1
    fi

13) Success Criteria

For Change #1 (buffer pooling), "success" is defined as:

Criterion	Target
alloc_space from betweenness	From ~71% to "not dominant" (<30%)
GC CPU time	runtime.gcDrain no longer dominates
Peak RSS (5k issues)	Drastic reduction from ~429MB
p95 latency (5k issues)	Measurable improvement
All tests	Green under `BV_NO_BROWSER=1 BV_TEST_MODE=1`

Appendix: Raw Profiling Commands

# Run benchmarks with memory stats
go test -bench=. -benchmem -count=3 ./pkg/analysis/... 2>&1 | tee bench_baseline.txt

# Generate CPU profile
go test -run=NONE -bench="BenchmarkApproxBetweenness" \
  -cpuprofile=cpu.prof -benchtime=3s ./pkg/analysis/...

# Generate memory profile
go test -run=NONE -bench="BenchmarkApproxBetweenness" \
  -memprofile=mem.prof -benchtime=3s ./pkg/analysis/...

# Analyze CPU profile
go tool pprof -top cpu.prof | head -40

# Analyze memory profile
go tool pprof -top mem.prof | head -40

# I/O profile
strace -c -f env BV_ROBOT=1 BV_NO_BROWSER=1 BV_TEST_MODE=1 /tmp/bv_round1 --robot-triage >/dev/null

# Interactive profile exploration
go tool pprof -http=:8080 cpu.prof

# Latency distribution (use Python harness from §2C)

Generated: 2026-01-09 Author: Claude Code (performance analysis session) Hybrid: Best elements from OPUS and GPT plans

FilesExpand file tree

PLAN_FOR_ADVANCED_OPTIMIZATIONS_ROUND_1__OPUS.md

Latest commit

History