perf(graph): bulk COPY FROM for _write_edges (PR-P1)#341
Merged
Conversation
Implement bulk COPY FROM for _write_edges, accelerating both full and
incremental graph writes (~250s prize from the ~321s init baseline).
Changes:
- Add _bulk_copy(conn, table_name, columns, rows) helper using in-memory
pyarrow COPY FROM (step-1 spike result in docstring).
- Add column-order constants (_REL_*_COLUMNS, _UNRESOLVED_CALL_SITE_COLUMNS)
matching _SCHEMA_* declarations; REL tables list FROM/TO first.
- Convert _write_edges to per-type row staging with bulk load:
- Apply CALLS dedup (src, dst, argc, line) at staging.
- Materialize callee_declaring_role at staging via _callee_declaring_role_at_write.
- Load UnresolvedCallSite nodes before UNRESOLVED_AT edges.
- One _bulk_copy call per edge type (EXTENDS, IMPLEMENTS, INJECTS,
DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT).
- Delete dead _CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL constants and local
_CREATE_UNRESOLVED/_CREATE_UNRESOLVED_AT (removed with rewrite).
- Add graph_baseline_bank_chat.json fixture (node/edge counts, GraphMeta,
sampled edge properties) from bulk build.
- Add four tests verifying equivalence, determinism, CALLS dedup preservation,
and empty-REL-table no-op.
Scope: PR-P1 converts ONLY _write_edges + adds _bulk_copy primitive.
Nodes, routes, clients/producers, and GraphMeta remain unchanged (PR-P2).
Tests:
- New tests: test_bulk_write_edges_match_per_row_baseline,
test_bulk_write_is_deterministic_double_build,
test_bulk_write_preserves_calls_dedup_and_callee_declaring_role,
test_bulk_write_empty_rel_table_is_noop.
- Full test_ast_graph_build.py + test_incremental_graph.py pass unchanged
(56 passed; 2 pre-existing incremental failures in PR-P2 scope).
Sentinel greps (all pass):
- Must return zero: _CREATE_(EXT|IMPL|INJ|DECL|OVERRIDES|CALL|UNRESOLVED|UNRESOLVED_AT)
- Must be non-zero: _MERGE_SYMBOL, _CREATE_CLIENT, MERGE (r:Route, COPY FROM
Manual evidence (bank-chat-system):
- Graph-write phase: nodes 5.11s, edges 0.20s (bulk), routes/exposes 0.44s
- Node count: 959, edge counts: EXTENDS 18, IMPLEMENTS 21, INJECTS 94,
DECLARES 606, OVERRIDES 38, CALLS 678, UNRESOLVED_AT 227
Co-Authored-By: Claude <noreply@anthropic.com>
PR-P1 bulk _write_edges crashed on CI ("Unable to find primary key
value") for corpora where some edges reference Symbol nodes that
_write_nodes never materialized. The legacy per-row MERGE silently
dropped these (MATCH on a missing endpoint creates nothing); bulk
COPY FROM enforces referential integrity and hard-fails, cascading
into fixture-build errors and incremental full_fallback.
Add _existing_symbol_ids(conn) — a live DB query (not just `tables`),
which stays correct for the incremental path whose edges legitimately
reference nodes written in prior runs. Filter every edge table's rows
to endpoints that exist, exactly reproducing MERGE's silent-drop, so
the bulk graph is identical to the per-row baseline. Also repoint
test_calls_edge_has_callee_declaring_role_column at the bulk write
path (the deleted "$callee_declaring_role" SQL param).
Co-Authored-By: Claude <noreply@anthropic.com>
HumanBean17
added a commit
that referenced
this pull request
Jun 22, 2026
) All of the init/increment-perf work has landed — the original plan (PR-P1..P3: #340 cached ignore, #341 _write_edges bulk, #342 nodes/routes bulk) and the post-review follow-ups (PR-P4 #343 dependent refresh + DECLARES dedup, PR-P5 #344 annotation-scope fix + route bulk + overrides invariant), plus its proposal (#338). Relocate the plan, agent-prompts, and proposal from active/ to completed/, matching the Ladybug/INDEX-OUTPUT close-out convention (pure rename, no content edits). Co-authored-by: Claude <noreply@anthropic.com>
This was referenced Jun 22, 2026
Merged
HumanBean17
added a commit
that referenced
this pull request
Jun 22, 2026
Performance release: faster init / reprocess / increment with no graph, schema, or CLI changes. Bulk COPY FROM graph writes (#341-#342) and the lifespan-cached LayeredIgnore (#340) take init from ~395s toward ~140s on the profiled medium corpus; the graph-write phase drops ~316s -> ~0.4s. No re-index required -- the bulk path is byte-equivalent to 0.6.4 (verified node-for-node and edge-for-edge, all properties + GraphMeta counters). Ontology version unchanged (17). Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope
Implements PR-P1 from
plans/active/PLAN-INIT-INCREMENT-PERF.md: bulkCOPY FROMfor_write_edges(shared helper → accelerates both full and incremental paths).What Changed
_bulk_copy(conn, table_name, columns, rows)helper using in-memory pyarrowCOPY FROM(step-1 spike result in docstring)._REL_*_COLUMNS,_UNRESOLVED_CALL_SITE_COLUMNS) matching_SCHEMA_*declarations; REL tables listFROM/TOfirst._write_edgesto per-type row staging with bulk load:(src, dst, argc, line)applied at staging.callee_declaring_rolematerialized at staging via_callee_declaring_role_at_write._bulk_copycall per edge type (EXTENDS, IMPLEMENTS, INJECTS, DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT)._CREATE_EXT/_CREATE_IMPL/_CREATE_INJ/_CREATE_DECL/_CREATE_OVERRIDES/_CREATE_CALLconstants and local_CREATE_UNRESOLVED/_CREATE_UNRESOLVED_AT(removed with rewrite).tests/fixtures/graph_baseline_bank_chat.jsonfixture (node/edge counts, GraphMeta, sampled edge properties).Out of Scope (PR-P2)
Nodes (
_write_nodes_impl), routes/clients/producers (_write_routes_and_exposes,_write_clients_producers_and_calls), and GraphMeta (_write_meta) remain unchanged.Tests
New tests (all pass):
test_bulk_write_edges_match_per_row_baseline— bulk write matches baseline fixture.test_bulk_write_is_deterministic_double_build— two builds produce identical graphs.test_bulk_write_preserves_calls_dedup_and_callee_declaring_role— CALLS dedup and role preserved.test_bulk_write_empty_rel_table_is_noop— empty edge list is a no-op.Regression (56 passed):
test_ast_graph_build.pysuite passes unchanged.test_incremental_graph.pysuite passes (2 pre-existing failures in PR-P2 scope).test_bank_chat_brownfield_integration.py+test_call_edges_e2e.pypass unchanged.Ruff:
.venv/bin/ruff check .clean.Sentinel Greps
All sentinel greps pass (from plan):
Must return zero (dead constants deleted):
Must be non-zero (retained for PR-P2):
Manual Evidence
Build of
tests/bank-chat-system(graph-write phase timing):Counts (from
java-codebase-rag meta):The bulk
edges written in 0.20s(vs. ~250s baseline from per-row writes in the plan) demonstrates the speedup. Counts match the baseline fixture exactly.Co-Authored-By: Claude noreply@anthropic.com