docs(propose): init/increment perf — bulk graph writes + cached ignore by HumanBean17 · Pull Request #338 · HumanBean17/java-codebase-rag

HumanBean17 · 2026-06-21T18:35:57Z

What

Adds propose/active/INIT-INCREMENT-PERF-PROPOSE.md — a proposal-only design for the init/increment performance program. No production code changed.

Why now

Profiling java-codebase-rag init on a medium Java corpus (Shopizer: 1210 files → 1167 indexed, 3879 chunks, ~32k graph edges, 395s total) showed graph writes are ~81% of init, not the vectors stage:

phase	time	share
LadybugDB graph write (per-row MERGE/CREATE)	~321s	~81%
cocoindex vectors	~68s	~17%
optimize	~5s	~1%

Init/increment latency is the project's stated pain point, so this redirects effort to the real bottleneck.

Highlights

Three PRs under one proposal:

PR-1 — Bulk in-memory-pyarrow COPY FROM for the full rebuild path (init/reprocess). Replaces ~21 per-row conn.execute sites (build_ast_graph.py:3096/3250-3398) with staged bulk loads via COPY <table> FROM $param. A micro-benchmark on the real Symbol schema measured ~300× (5.6 ms/row → 0.018 ms/row). Projected ~321s graph-write → tens of seconds; init ~395s → ~120s. Staging invariants spelled out: REL FROM/TO column rule, CALLS dedup + callee_declaring_role materialized at staging, node-before-edge order. Carries a mandatory equivalence harness (old per-row build vs new bulk build → identical counts + meta + full edge property rows + query results).
PR-2 — Same primitive extended to the incremental path (preserves the Route-MERGE dedup at build_ast_graph.py:3819-3821). Depends on PR-1.
PR-3 — Hoist LayeredIgnore(project_root) to a flow-lifespan cocoindex ContextKey and memoize is_ignored — ~25s → ~0s. Independent of PR-1/2.

No ontology bump (graph contents identical; proven by the harness). All PRs re-index-free — only the write mechanism / a cache change.

Reviewed

A 5-lens subagent review (39 agents, ~1.3M tokens) empirically validated the load-bearing claims (kuzu 0.11.3 COPY FROM into REL tables works; ~300× speedup; line citations spot-checked). Its main catch: an earlier PR-4 (default embedding device to MPS) was dropped — its premise was false. The flow already auto-selects MPS (SBERT_DEVICE unset → device=None → cuda→mps→cpu), so the profiled init ran on MPS (~16s), not CPU; there was no CPU→MPS win to recover. That rejection is recorded in the proposal's Out of scope.

Tests

Docs-only; baseline unchanged.

Out of scope

Implementation of any PR (PR-1…PR-3 follow as separate PRs).
MPS device default — already auto-selected (see Reviewed).
ANN vector index — parked, perf: add ANN vector index if/when flat-scan query latency becomes a bottleneck (parked) #337.
watch live mode — feat: watch mode — keep the index live as files change #336.

References

Proposal: propose/active/INIT-INCREMENT-PERF-PROPOSE.md
Related: feat: watch mode — keep the index live as files change #336 (watch), perf: add ANN vector index if/when flat-scan query latency becomes a bottleneck (parked) #337 (ANN, parked)

… ignore, mps) Proposal-only. Profiles init at ~395s on a medium Java corpus and sequenced three measured, independent levers as four PRs: - PR-1: bulk COPY FROM for the full rebuild path (init/reprocess) — the ~81% graph-write lever; init projected ~395s -> ~120s. - PR-2: same primitive extended to the incremental path. - PR-3: hoist LayeredIgnore to a flow-lifespan ContextKey — ~25s -> ~0s. - PR-4: default embedding device cuda -> mps -> cpu — ~28s -> ~16s on Apple Silicon. No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout. ANN index (parked, #337) and watch mode (#336) explicitly out of scope. Co-Authored-By: Claude <noreply@anthropic.com>

…4, align format) 5-lens subagent review of the proposal found: - PR-4 (MPS device default) was built on a false premise: the flow already auto-selects MPS (SBERT_DEVICE unset -> device=None -> cuda->mps->cpu), so the profiled init embedded on MPS (~16s), not CPU. Dropped; rationale moved to Out of scope. - PR-1 mechanism corrected to in-memory pyarrow COPY FROM $param (not Parquet file); staging invariants made explicit (REL FROM/TO column rule, CALLS dedup + callee_declaring_role materialization at staging, node-before-edge order); atomicity note added. - PR-3 broadened: also memoize is_ignored, not just hoist the constructor. - Citations fixed: full-rebuild node writer is _write_nodes at :3096 (not the incremental MERGE path at 824-825); ~21 per-row sites in write fns (not 44); _CREATE_SYMBOL/_MERGE_SYMBOL at :3007-3026. Also aligned the doc to the repo's current propose format (matches LADYBUG-DB-MIGRATE-PROPOSE): natural-English H1, Scope with In/Out subsections, no TL;DR, no PR-body-template section, no edit-history narration. Co-Authored-By: Claude <noreply@anthropic.com>

) * docs(plans): execution plan for init/increment perf (PR-P1..PR-P3) Adds plans/active/PLAN-INIT-INCREMENT-PERF.md and the companion plans/AGENT-PROMPTS-INIT-INCREMENT-PERF.md implementing the approved proposal propose/active/INIT-INCREMENT-PERF-PROPOSE.md. Three PRs: - PR-P1: bulk in-memory-pyarrow COPY FROM for the full rebuild path; equivalence harness is the merge gate. - PR-P2: same primitive for the incremental path (Route-MERGE dedup retained). - PR-P3: lifespan-cached LayeredIgnore (ContextKey) + is_ignored _mega memo. No production code. Stacks behind proposal PR #338. Co-Authored-By: Claude <noreply@anthropic.com> * docs(plans): apply review feedback to init/increment perf plan 5-lens subagent review of the plan found the PR-P1/P2 boundary was architecturally wrong: the graph write helpers are SHARED between the full and incremental paths, so a "full-path-only" split is impossible. - Verified call graph: _write_edges/_write_routes_and_exposes/_write_nodes_impl/ _write_meta are each called by BOTH paths; _write_clients_producers_and_calls is incremental-only (global pass5/6). - Re-split by write-FUNCTION: PR-P1 = _bulk_copy + _write_edges (the ~250s prize, accelerates both paths); PR-P2 = _write_nodes_impl + _write_routes_and_exposes + _write_clients_producers_and_calls; PR-P3 = ignore cache (independent). - GraphMeta (_write_meta) left on MERGE (shared, one row) — reverses Open Q1. - Fixed all binding sentinel greps: PR-P1 zeros the edge _CREATE_* only; PR-P2 zeros node/route/client constants + _MERGE_SYMBOL only after both routes functions convert; PR-P3 sentinel narrowed to LayeredIgnore(project_root).is_ignored (the bare-constructor grep wrongly matched once-per-run sites :177/:569, which are correctly left alone). - Load-order §1f corrected (UnresolvedCallSite before UNRESOLVED_AT; Route/Client/Producer before their edges). Test files qualified (test_brownfield_routes / test_mcp_v2_compose / test_vectors_progress / test_path_filtering). PR-P2 tests placed in TestIncrementalOrchestrator. Baseline flagged as equivalence anchor, not production invariant. PR-P1 DoD lists the four test names. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

) All of the init/increment-perf work has landed — the original plan (PR-P1..P3: #340 cached ignore, #341 _write_edges bulk, #342 nodes/routes bulk) and the post-review follow-ups (PR-P4 #343 dependent refresh + DECLARES dedup, PR-P5 #344 annotation-scope fix + route bulk + overrides invariant), plus its proposal (#338). Relocate the plan, agent-prompts, and proposal from active/ to completed/, matching the Ladybug/INDEX-OUTPUT close-out convention (pure rename, no content edits). Co-authored-by: Claude <noreply@anthropic.com>

HumanBean17 and others added 2 commits June 21, 2026 21:35

HumanBean17 changed the title ~~docs(propose): init/increment perf program (bulk graph writes, cached ignore, mps)~~ docs(propose): init/increment perf — bulk graph writes + cached ignore Jun 21, 2026

HumanBean17 mentioned this pull request Jun 21, 2026

docs(plans): execution plan for init/increment perf (PR-P1..PR-P3) #339

Merged

HumanBean17 merged commit 0396492 into master Jun 22, 2026
1 check passed

HumanBean17 mentioned this pull request Jun 22, 2026

docs(plans): move init/increment-perf plan + proposal to completed #345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(propose): init/increment perf — bulk graph writes + cached ignore#338

docs(propose): init/increment perf — bulk graph writes + cached ignore#338
HumanBean17 merged 2 commits into
masterfrom
plan/init-increment-perf-propose

HumanBean17 commented Jun 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HumanBean17 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why now

Highlights

Reviewed

Tests

Out of scope

References

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HumanBean17 commented Jun 21, 2026 •

edited

Loading