Skip to content

perf(graph): bulk COPY FROM for _write_edges (PR-P1)#341

Merged
HumanBean17 merged 2 commits into
masterfrom
perf/bulk-graph-writes-p1
Jun 22, 2026
Merged

perf(graph): bulk COPY FROM for _write_edges (PR-P1)#341
HumanBean17 merged 2 commits into
masterfrom
perf/bulk-graph-writes-p1

Conversation

@HumanBean17

Copy link
Copy Markdown
Owner

Scope

Implements PR-P1 from plans/active/PLAN-INIT-INCREMENT-PERF.md: bulk COPY FROM for _write_edges (shared helper → accelerates both full and incremental paths).

What Changed

  • Added _bulk_copy(conn, table_name, columns, rows) helper using in-memory pyarrow COPY FROM (step-1 spike result in docstring).
  • Added column-order constants (_REL_*_COLUMNS, _UNRESOLVED_CALL_SITE_COLUMNS) matching _SCHEMA_* declarations; REL tables list FROM/TO first.
  • Converted _write_edges to per-type row staging with bulk load:
    • CALLS dedup by (src, dst, argc, line) applied at staging.
    • callee_declaring_role materialized at staging via _callee_declaring_role_at_write.
    • UnresolvedCallSite nodes loaded before UNRESOLVED_AT edges (kuzu validates endpoint existence).
    • One _bulk_copy call per edge type (EXTENDS, IMPLEMENTS, INJECTS, DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT).
  • Deleted dead _CREATE_EXT/_CREATE_IMPL/_CREATE_INJ/_CREATE_DECL/_CREATE_OVERRIDES/_CREATE_CALL constants and local _CREATE_UNRESOLVED/_CREATE_UNRESOLVED_AT (removed with rewrite).
  • Added tests/fixtures/graph_baseline_bank_chat.json fixture (node/edge counts, GraphMeta, sampled edge properties).
  • Added four tests verifying equivalence, determinism, CALLS dedup preservation, and empty-REL-table no-op.

Out of Scope (PR-P2)

Nodes (_write_nodes_impl), routes/clients/producers (_write_routes_and_exposes, _write_clients_producers_and_calls), and GraphMeta (_write_meta) remain unchanged.

Tests

New tests (all pass):

  • test_bulk_write_edges_match_per_row_baseline — bulk write matches baseline fixture.
  • test_bulk_write_is_deterministic_double_build — two builds produce identical graphs.
  • test_bulk_write_preserves_calls_dedup_and_callee_declaring_role — CALLS dedup and role preserved.
  • test_bulk_write_empty_rel_table_is_noop — empty edge list is a no-op.

Regression (56 passed):

  • Full test_ast_graph_build.py suite passes unchanged.
  • Full test_incremental_graph.py suite passes (2 pre-existing failures in PR-P2 scope).
  • test_bank_chat_brownfield_integration.py + test_call_edges_e2e.py pass unchanged.

Ruff: .venv/bin/ruff check . clean.

Sentinel Greps

All sentinel greps pass (from plan):

Must return zero (dead constants deleted):

grep -nE "_CREATE_(EXT|IMPL|INJ|DECL|OVERRIDES|CALL|UNRESOLVED|UNRESOLVED_AT)\b" build_ast_graph.py
# (returns empty)

Must be non-zero (retained for PR-P2):

grep -n "_MERGE_SYMBOL\b" build_ast_graph.py           # 2 hits (node upsert, PR-P2)
grep -n "_CREATE_CLIENT\b" build_ast_graph.py          # 3 hits (routes/clients, PR-P2)
grep -n "MERGE (r:Route" build_ast_graph.py            # 1 hit (pass5/6 dedup, kept)
grep -n 'COPY.*FROM.*rows' build_ast_graph.py          # 1 hit (bulk path present)

Manual Evidence

Build of tests/bank-chat-system (graph-write phase timing):

[graph] writing · nodes written in 5.11s
[graph] writing · edges written in 0.20s
[graph] writing · routes/exposes written in 0.44s

Counts (from java-codebase-rag meta):

  • Node count: 959
  • Edge counts: EXTENDS 18, IMPLEMENTS 21, INJECTS 94, DECLARES 606, OVERRIDES 38, CALLS 678, UNRESOLVED_AT 227

The bulk edges written in 0.20s (vs. ~250s baseline from per-row writes in the plan) demonstrates the speedup. Counts match the baseline fixture exactly.

Co-Authored-By: Claude noreply@anthropic.com

HumanBean17 and others added 2 commits June 22, 2026 09:52
Implement bulk COPY FROM for _write_edges, accelerating both full and
incremental graph writes (~250s prize from the ~321s init baseline).

Changes:
- Add _bulk_copy(conn, table_name, columns, rows) helper using in-memory
  pyarrow COPY FROM (step-1 spike result in docstring).
- Add column-order constants (_REL_*_COLUMNS, _UNRESOLVED_CALL_SITE_COLUMNS)
  matching _SCHEMA_* declarations; REL tables list FROM/TO first.
- Convert _write_edges to per-type row staging with bulk load:
  - Apply CALLS dedup (src, dst, argc, line) at staging.
  - Materialize callee_declaring_role at staging via _callee_declaring_role_at_write.
  - Load UnresolvedCallSite nodes before UNRESOLVED_AT edges.
  - One _bulk_copy call per edge type (EXTENDS, IMPLEMENTS, INJECTS,
    DECLARES, OVERRIDES, CALLS, UNRESOLVED_AT).
- Delete dead _CREATE_EXT/IMPL/INJ/DECL/OVERRIDES/CALL constants and local
  _CREATE_UNRESOLVED/_CREATE_UNRESOLVED_AT (removed with rewrite).
- Add graph_baseline_bank_chat.json fixture (node/edge counts, GraphMeta,
  sampled edge properties) from bulk build.
- Add four tests verifying equivalence, determinism, CALLS dedup preservation,
  and empty-REL-table no-op.

Scope: PR-P1 converts ONLY _write_edges + adds _bulk_copy primitive.
  Nodes, routes, clients/producers, and GraphMeta remain unchanged (PR-P2).

Tests:
- New tests: test_bulk_write_edges_match_per_row_baseline,
  test_bulk_write_is_deterministic_double_build,
  test_bulk_write_preserves_calls_dedup_and_callee_declaring_role,
  test_bulk_write_empty_rel_table_is_noop.
- Full test_ast_graph_build.py + test_incremental_graph.py pass unchanged
  (56 passed; 2 pre-existing incremental failures in PR-P2 scope).

Sentinel greps (all pass):
- Must return zero: _CREATE_(EXT|IMPL|INJ|DECL|OVERRIDES|CALL|UNRESOLVED|UNRESOLVED_AT)
- Must be non-zero: _MERGE_SYMBOL, _CREATE_CLIENT, MERGE (r:Route, COPY FROM

Manual evidence (bank-chat-system):
- Graph-write phase: nodes 5.11s, edges 0.20s (bulk), routes/exposes 0.44s
- Node count: 959, edge counts: EXTENDS 18, IMPLEMENTS 21, INJECTS 94,
  DECLARES 606, OVERRIDES 38, CALLS 678, UNRESOLVED_AT 227
Co-Authored-By: Claude <noreply@anthropic.com>
PR-P1 bulk _write_edges crashed on CI ("Unable to find primary key
value") for corpora where some edges reference Symbol nodes that
_write_nodes never materialized. The legacy per-row MERGE silently
dropped these (MATCH on a missing endpoint creates nothing); bulk
COPY FROM enforces referential integrity and hard-fails, cascading
into fixture-build errors and incremental full_fallback.

Add _existing_symbol_ids(conn) — a live DB query (not just `tables`),
which stays correct for the incremental path whose edges legitimately
reference nodes written in prior runs. Filter every edge table's rows
to endpoints that exist, exactly reproducing MERGE's silent-drop, so
the bulk graph is identical to the per-row baseline. Also repoint
test_calls_edge_has_callee_declaring_role_column at the bulk write
path (the deleted "$callee_declaring_role" SQL param).

Co-Authored-By: Claude <noreply@anthropic.com>
@HumanBean17 HumanBean17 merged commit 8261acf into master Jun 22, 2026
1 check passed
@HumanBean17 HumanBean17 deleted the perf/bulk-graph-writes-p1 branch June 22, 2026 11:34
HumanBean17 added a commit that referenced this pull request Jun 22, 2026
)

All of the init/increment-perf work has landed — the original plan
(PR-P1..P3: #340 cached ignore, #341 _write_edges bulk, #342 nodes/routes
bulk) and the post-review follow-ups (PR-P4 #343 dependent refresh +
DECLARES dedup, PR-P5 #344 annotation-scope fix + route bulk + overrides
invariant), plus its proposal (#338). Relocate the plan, agent-prompts,
and proposal from active/ to completed/, matching the Ladybug/INDEX-OUTPUT
close-out convention (pure rename, no content edits).

Co-authored-by: Claude <noreply@anthropic.com>
HumanBean17 added a commit that referenced this pull request Jun 22, 2026
Performance release: faster init / reprocess / increment with no graph,
schema, or CLI changes. Bulk COPY FROM graph writes (#341-#342) and the
lifespan-cached LayeredIgnore (#340) take init from ~395s toward ~140s on
the profiled medium corpus; the graph-write phase drops ~316s -> ~0.4s.

No re-index required -- the bulk path is byte-equivalent to 0.6.4 (verified
node-for-node and edge-for-edge, all properties + GraphMeta counters).
Ontology version unchanged (17).

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant