Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions propose/active/INIT-INCREMENT-PERF-PROPOSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Faster init/increment — bulk graph writes + cached ignore

## Status
Proposal — not yet implemented. Design-only; no production code in this PR.

Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only** (init / reprocess); the incremental path and the ignore-cache fix follow as separate PRs under this same proposal.

## Problem Statement

`init` / `increment` wall-clock is the project's stated pain point. A profiled `java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167 indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as:

| phase | time | share |
|---|---|---|
| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes ~4s write + passes ~5s analysis) | **~321s** | **~81%** |
| cocoindex vectors (embed ~16s on MPS + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB/orchestration residual) | ~68s | ~17% |
| optimize | ~5s | ~1% |

Two independent root causes account for almost all of it:

1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every node and edge one statement at a time via `conn.execute(query, row)` inside loops — ~21 per-row `conn.execute` sites across the full-rebuild write functions (44 is the file-wide total). Nodes via `_write_nodes` (`build_ast_graph.py:3096`, impl `_write_nodes_impl:3029` using `_CREATE_SYMBOL`/`_MERGE_SYMBOL`, strings at `:3007-3026`), called from `write_ladybug:3893`; edges at `3250-3398`. Measured ~7.8 ms/edge → ~250s for ~32k edges. A micro-benchmark on the real Symbol schema measured ~300×: ~5.6 ms/row per-row vs ~0.018 ms/row bulk `COPY FROM`.
2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).** `process_java_file` / `process_sql_file` / `process_yaml_file` in `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)` once per file *and* re-run `is_ignored`'s `_mega(spec)` merge per file. On 1167 files that is ~25s of work for an object + spec that are identical for the whole flow run.

A third candidate — defaulting the embedding device to MPS — was investigated and rejected (see Out of scope): the flow already auto-selects MPS, so the profiled embed ran on MPS (~16s), not CPU.

## Proposed Solution

Three PRs. The graph write is by far the largest lever, so PR-1 is the priority; PR-2 and PR-3 are independent of each other and can land in any order.

### PR-1 — Bulk `COPY FROM` for the full rebuild path

The full build assembles the entire graph in memory (`GraphTables`, fully populated by pass1–pass6 before any `_write_*` call, `build_ast_graph.py:3914`) and then writes it. That makes bulk insert a clean swap: instead of per-row `conn.execute` loops, stage the assembled rows and load them with kuzu `COPY FROM`.

**Mechanism — in-memory pyarrow.** The `ladybug` wrapper's first-class bulk path is `COPY <table> FROM $param` with an in-memory pyarrow table (`conn.execute` forwards `COPY FROM` verbatim and accepts a pyarrow param). Build the `pa.table` from the existing in-memory `*_rows` lists — zero disk I/O, native to the wrapper. Parquet-file staging is the fallback only.

**Staging invariants (must hold for byte-equivalence):**
- **REL-table column rule.** kuzu `COPY FROM` into a REL table requires the first two columns to be the FROM/TO node primary keys (the node `id`). The staging shape per table must match the `_SCHEMA_*` constants exactly.
- **Materialize write-time work at staging.** Several rows are computed *during* the current per-row writes and must be produced before staging: the CALLS dedup (`seen_calls`, `build_ast_graph.py:3282-3288`), the `callee_declaring_role` lookup, and the UNRESOLVED dedup (`seen_ucs`, `3317-3321`). Apply these to the in-memory lists, then stage the result.
- **Node-before-edge ordering.** Stage and load all node tables before REL tables (kuzu enforces endpoint existence). The current `_write_*` call order already does this; preserve it.

**Atomicity.** The current per-row path is not atomic (per-statement autocommit; a crash mid-build leaves a partial graph). `COPY FROM` raises this to per-table atomicity — an improvement, not a regression; the overall "rebuild in place" crash-safety story is unchanged.

**Equivalence.** The rewritten build must produce a byte-equivalent graph, proven by an equivalence harness (see Tests): node/edge counts, GraphMeta counters, full edge property rows (incl. `source_file`, CALLS `callee_declaring_role`), and a battery of Cypher queries identical between old and new.

Expected: ~321s graph-write phase → tens of seconds; overall `init` on this corpus from ~395s toward ~120s (projected; measured in PR-1).

### PR-2 — Bulk write for the incremental path (follow-up)

Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6 `MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`).

### PR-3 — Cached `LayeredIgnore` (+ `is_ignored` memo) as a `ContextKey`

Hoist the three per-file `LayeredIgnore(project_root)` constructions (`java_index_flow_lancedb.py:351/423/471`) into a single instance built once per flow run via a cocoindex `ContextKey` (lifespan-scoped — `PROJECT_ROOT`, `EMBEDDER`, `LANCE_DB` are already `ContextKey`s in `coco_lifespan`, `:60-72`/`:287-306`). Additionally memoize `is_ignored`'s `_mega(spec)` merge (cache the merged spec, or LRU by relative path) — the per-file cost is in `is_ignored`, not just the constructor, so construction hoisting alone will not reach ~0s. Keeps the cocoindex dependency inside `java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s.

## Scope

### In scope

- **PR-1:** replace per-row node/edge writes in the full rebuild path (`write_ladybug:3893` → `_write_nodes:3096` → `_write_nodes_impl:3029`; edge emit `3250-3398`) with bulk in-memory-pyarrow `COPY FROM`; add equivalence harness + benchmark.
- **PR-2:** shared bulk primitive applied to the incremental path (preserve Route-MERGE dedup).
- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey` and memoize `is_ignored`.

### Out of scope

- **MPS embedding default — not needed.** The init flow already auto-selects MPS on Apple Silicon: `SBERT_DEVICE` unset → `config.py:280` omits it from the subprocess env → flow constructs `SentenceTransformerEmbedder(device=None)` (`java_index_flow_lancedb.py:291`) → `SentenceTransformer(device=None)` auto-detects `cuda → mps → cpu`. On this machine `torch.backends.mps.is_available()` is true, so the profiled init ran on MPS (~16s embed). There is no CPU→MPS win to recover; an override already exists (`SBERT_DEVICE` / `--embedding-device` / YAML `embedding.device`).
- ANN vector index — parked (issue #337); query latency is fine today and ANN would tax indexing.
- `watch` live mode — issue #336.
- Replacing or restructuring the cocoindex flow.
- Changing the embedding model or dimension.
- Parallelizing the graph analysis passes (pass1–pass6).
- Converting the incremental write path in PR-1 (it is PR-2).

## Schema / Ontology / Re-index impact

- **Ontology bump:** not required. No node/edge kinds, properties, or enrichment semantics change. `ontology_version` stays 17.
- **Re-index required:** no. PR-1/2 change only the write mechanism; PR-3 changes only a cache. The graph contents are identical (proven by the equivalence harness), so users pick up the faster path on their next `init` / `reprocess` / `increment` naturally — no migration.
- **Config / tool surface:** none new.

## Tests / Validation

1. **PR-1 equivalence harness (mandatory).** Build the same source tree old-way (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type edge counts, `GraphMeta` counters (via `java-codebase-rag meta` / `GraphMetaOutput`), full property rows for a sample of N edges per type (including `source_file` and CALLS `callee_declaring_role` — proving the staging dedup/materialization is correct), and a battery of representative Cypher queries (`neighbors`, `find`, `describe`) returning identical rows. Run on `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus.
2. **PR-1 benchmark.** Capture `init` wall-clock before/after on the medium corpus; report the graph-write phase delta.
3. **PR-2 incremental equivalence.** `increment` after a single-file change yields the same graph as a full rebuild of that state (reuse the harness).
4. **PR-3.** Assert the ignore object is constructed once per flow run (not per file) and `is_ignored` is memoized; existing flow tests unchanged; micro-benchmark confirms the ~25s drop.

## Open Questions ([TBD])

1. Should the `GraphMeta` single-row MERGE (`build_ast_graph.py:3472-3473`) also move to bulk in PR-1, or stay per-row? — Recommended: **fold it into PR-1** (it is in the full-rebuild path; one extra small staging set).
2. PR-3 cache vehicle — cocoindex `ContextKey` vs a module-global? — Recommended: **`ContextKey`** (cocoindex-native, lifespan-scoped, keeps the dependency in the flow module).

## Sequencing / Follow-ups

- **PR-1** — bulk in-memory-pyarrow `COPY FROM` for the full rebuild path + equivalence harness + benchmark. Biggest win (~81% phase).
- **PR-2** — shared bulk primitive applied to the incremental path. Depends on PR-1's primitive.
- **PR-3** — `LayeredIgnore` + `is_ignored` → flow-lifespan `ContextKey`. Independent of PR-1/2.
Loading