Skip to content

perf(graph): bulk COPY FROM for nodes, routes, clients/producers (PR-P2)#342

Merged
HumanBean17 merged 2 commits into
masterfrom
perf/bulk-graph-writes-p2
Jun 22, 2026
Merged

perf(graph): bulk COPY FROM for nodes, routes, clients/producers (PR-P2)#342
HumanBean17 merged 2 commits into
masterfrom
perf/bulk-graph-writes-p2

Conversation

@HumanBean17

Copy link
Copy Markdown
Owner

Scope

This PR implements PR-P2 from plans/active/PLAN-INIT-INCREMENT-PERF.md § PR-P2, converting the remaining graph write helpers to bulk COPY FROM:

  1. _write_nodes_impl (shared workhorse) — stage Symbol rows then _bulk_copy(conn, "Symbol", _NODE_COLUMNS, rows)
  2. _write_routes_and_exposes (shared) — per-table staging + _bulk_copy for Route/EXPOSES/Client/Producer/DECLARES_*/HTTP_CALLS/ASYNC_CALLS
  3. _write_clients_producers_and_calls (incremental-only) — Client/Producer/edges bulk; MERGE (r:Route) retained + commented
  4. _write_meta — left UNTOUCHED (on MERGE as planned)

Deletes _CREATE_SYMBOL, _MERGE_SYMBOL, _CREATE_ROUTE, _CREATE_EXPOSES, and 6 shared _CREATE_* constants (CLIENT/PRODUCER/DECLARES_*/HTTP_CALL/ASYNC_CALL).

Referential integrity handling

Applies the lesson from PR-P1's CI failure: bulk COPY FROM enforces referential integrity. This PR:

  • Generalizes _existing_symbol_ids to _existing_node_ids (queries all node types: Symbol, Route, Client, Producer)
  • Filters edge row endpoints (FROM/TO) against _existing_node_ids(conn) for EXPOSES, DECLARES_CLIENT, DECLARES_PRODUCER, HTTP_CALLS, ASYNC_CALLS
  • Bulk-loads Route/Client/Producer NODES before edges that reference them (re-queries valid_ids after node bulk-load)
  • Filters existing nodes in _write_nodes_impl to avoid duplicate primary key errors in incremental path

Column ordering fix

Enhanced _bulk_copy to select columns in exact schema order: tbl.select(columns). Without this, pyarrow's from_pylist creates columns in dict key order (alphabetical), which doesn't match LadybugDB's schema and causes type coercion errors.

Manual evidence

Full rebuild of tests/bank-chat-system with bulk writes:

rm -rf /tmp/p2-evidence && .venv/bin/python build_ast_graph.py \
  --source-root tests/bank-chat-system \
  --ladybug-path /tmp/p2-evidence/code_graph.lbug --verbose
.venv/bin/java-codebase-rag meta --source-root tests/bank-chat-system --index-dir /tmp/p2-evidence

Graph-write phase timing (from JCIRAG_PROGRESS lines):

  • nodes written in 0.04s
  • edges written in 0.15s
  • routes/exposes written in 0.16s

Meta counts (from counts_json):

  • Symbol nodes: 29 packages + 130 files + 140 types + 606 members = 905 total
  • Routes: 29 (12 spring_mvc, 3 kafka, 14 builtin/layer_c_source)
  • Clients: 8, Producers: 9
  • Edges: EXTENDS 18, IMPLEMENTS 21, INJECTS 94, DECLARES 606, OVERRIDES 38, CALLS 678, EXPOSES 15, DECLARES_CLIENT 8, DECLARES_PRODUCER 9, HTTP_CALLS 8, ASYNC_CALLS 9

Tests

All tests pass:

  • tests/test_incremental_graph.py — 30 tests (including 2 new: test_incremental_bulk_write_equivalent_to_full_rebuild, test_incremental_route_merge_dedup_preserved)
  • tests/test_ast_graph_build.py — 28 tests
  • tests/test_schema_consistency.py — 8 tests

Sentinel greps (all pass)

Must return zero:

grep -nE "_CREATE_(SYMBOL|ROUTE|EXPOSES|CLIENT|PRODUCER|DECLARES_CLIENT|DECLARES_PRODUCER|HTTP_CALL|ASYNC_CALL)\b" build_ast_graph.py
grep -nE "_MERGE_SYMBOL\b" build_ast_graph.py

Must be non-zero:

grep -n "MERGE (r:Route" build_ast_graph.py     # retained dedup (1 hit)
grep -n "MERGE (m:GraphMeta" build_ast_graph.py  # retained (1 hit)
grep -nEn "COPY .*FROM .rows" build_ast_graph.py # bulk present (23 _bulk_copy calls)

Files changed

  • build_ast_graph.py — bulk conversions + column constants + _existing_node_ids helper
  • tests/test_incremental_graph.py — 2 new tests in TestIncrementalOrchestrator

HumanBean17 and others added 2 commits June 22, 2026 14:51
- Convert _write_nodes_impl (shared workhorse) to bulk COPY FROM for Symbol nodes
- Convert _write_routes_and_exposes (shared) to bulk COPY FROM for Route/EXPOSES/Client/Producer/DECLARES_*/HTTP_CALLS/ASYNC_CALLS
- Convert _write_clients_producers_and_calls (incremental-only) to bulk COPY FROM for Client/Producer/edges
- Delete _CREATE_SYMBOL, _MERGE_SYMBOL, _CREATE_ROUTE, _CREATE_EXPOSES, and 6 shared _CREATE_* constants
- Retain MERGE (r:Route) dedup in _write_clients_producers_and_calls with comment
- Add _existing_node_ids helper to generalize referential integrity filtering for all node types
- Add column constants: _ROUTE_COLUMNS, _CLIENT_COLUMNS, _PRODUCER_COLUMNS, _REL_*_COLUMNS for routes/clients/producers
- Filter existing nodes in incremental path to avoid duplicate primary key errors
- Re-fetch valid_ids after bulk-loading Route/Client/Producer nodes before edge staging
- Add test_incremental_bulk_write_equivalent_to_full_rebuild and test_incremental_route_merge_dedup_preserved to TestIncrementalOrchestrator

Co-Authored-By: Claude <noreply@anthropic.com>
PR-P2's _write_clients_producers_and_calls (incremental global pass 5-6)
applied the edge-endpoint filter to NODE writes: it filtered
client/producer node rows against the existing-id set. But the caller
DETACH-DELETEs every Client/Producer node immediately before invoking
this, so the pre-load id set is empty by construction — the filter
dropped ALL Client and Producer nodes (and their DECLARES_*/HTTP_CALLS/
ASYNC_CALLS edges) on every increment. Repro on http_caller_smoke:
Client=5, Producer=3 -> 0, 0 after a single increment.

Load Client/Producer nodes unconditionally (matching _write_routes_and_exposes
and the old per-row path); keep the endpoint filter on EDGES only, against
the post-load id set. The bug was latent: test_incremental_with_http_clients_
does_not_fall_back used a client-bearing corpus but only asserted
mode=="incremental" — strengthen it to assert Client/Producer nodes survive
the increment. Also drop a redundant unused module-level write_ladybug
import that tripped ruff F401/F811.

Co-Authored-By: Claude <noreply@anthropic.com>
@HumanBean17 HumanBean17 merged commit 100c666 into master Jun 22, 2026
1 check passed
@HumanBean17 HumanBean17 deleted the perf/bulk-graph-writes-p2 branch June 22, 2026 12:30
HumanBean17 added a commit that referenced this pull request Jun 22, 2026
)

All of the init/increment-perf work has landed — the original plan
(PR-P1..P3: #340 cached ignore, #341 _write_edges bulk, #342 nodes/routes
bulk) and the post-review follow-ups (PR-P4 #343 dependent refresh +
DECLARES dedup, PR-P5 #344 annotation-scope fix + route bulk + overrides
invariant), plus its proposal (#338). Relocate the plan, agent-prompts,
and proposal from active/ to completed/, matching the Ladybug/INDEX-OUTPUT
close-out convention (pure rename, no content edits).

Co-authored-by: Claude <noreply@anthropic.com>
HumanBean17 added a commit that referenced this pull request Jun 22, 2026
Performance release: faster init / reprocess / increment with no graph,
schema, or CLI changes. Bulk COPY FROM graph writes (#341-#342) and the
lifespan-cached LayeredIgnore (#340) take init from ~395s toward ~140s on
the profiled medium corpus; the graph-write phase drops ~316s -> ~0.4s.

No re-index required -- the bulk path is byte-equivalent to 0.6.4 (verified
node-for-node and edge-for-edge, all properties + GraphMeta counters).
Ontology version unchanged (17).

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant