Add external node table by andyfengHKU · Pull Request #4105 · kuzudb/kuzu

andyfengHKU · 2024-08-19T20:58:26Z

Description

This is our initial PR to support directly execute cypher on relational database. It contains logic for node table only. Major changes including

Nested catalog entry.

We add a new catalog entry type ExternalNodeTable which has a nested entry structure. At parent level, it maintains the logic view of properties which aligns with the columns in relational tables. At child level, it maintains another catalog entry which contains the primary key property only. This child entry aligns with our physical storage.

In the current design, we still need to materialize primary key and use it as join condition when we try to read a property that does not exist in storage.

Scan external table

When we run MATCH (a:label) where label points to an external relational table, we need to scan external relational table and the primary key column materialized in our storage and then perform a join on primary key.

Some sanity benchmark numbers are

Setup
LDBC10 Comment table storing in DuckDB database. 8 Threads.

DuckDB native scanning: 0.3s.
Kuzu scanning DuckDB: 2s.
Scanning external database: copy primary key (6s) + join (3s) = 9s.

Slower than DuckDB is expected as we need to first materialize DuckDB's result and then re-scan it. The major overhead is in us scanning DuckDB result which @acquamarin should see if we can further optimize this.

Another bottleneck is the copy of primary key. I'm fairly confident I can bring the time to ~2s with some optimization.

Fixes # (issue)

Contributor agreement

I have read and agree to the Contributor Agreement.

github-actions · 2024-08-19T21:32:15Z

Benchmark Result

Master commit hash: 87de7f9128ff9946b95e62bad69bd8d8053a176c
Branch commit hash: fa217e70eddcc0e885602cc88af766a210af9ff6

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	674.13	664.18	9.95 (1.50%)
aggregation	q28	11628.45	11765.74	-137.29 (-1.17%)
filter	q14	153.00	143.50	9.50 (6.62%)
filter	q15	153.21	141.37	11.84 (8.37%)
filter	q16	329.48	320.61	8.87 (2.77%)
filter	q17	473.39	466.17	7.21 (1.55%)
filter	q18	1953.89	1970.60	-16.70 (-0.85%)
fixed_size_expr_evaluator	q07	564.07	556.76	7.31 (1.31%)
fixed_size_expr_evaluator	q08	772.76	765.63	7.13 (0.93%)
fixed_size_expr_evaluator	q09	773.93	764.40	9.53 (1.25%)
fixed_size_expr_evaluator	q10	266.35	255.68	10.67 (4.17%)
fixed_size_expr_evaluator	q11	260.79	250.99	9.81 (3.91%)
fixed_size_expr_evaluator	q12	259.71	250.45	9.27 (3.70%)
fixed_size_expr_evaluator	q13	1502.11	1478.49	23.62 (1.60%)
fixed_size_seq_scan	q23	145.55	135.71	9.84 (7.25%)
join	q31	11.61	12.63	-1.01 (-8.03%)
ldbc_snb_ic	q35	775.55	772.53	3.02 (0.39%)
ldbc_snb_ic	q36	51.92	48.53	3.39 (6.99%)
ldbc_snb_is	q32	8.66	10.00	-1.34 (-13.42%)
ldbc_snb_is	q33	16.73	19.10	-2.37 (-12.41%)
ldbc_snb_is	q34	8.12	8.78	-0.66 (-7.47%)
multi-rel	multi-rel-large-scan	2795.10	3118.05	-322.95 (-10.36%)
multi-rel	multi-rel-lookup	74.58	45.70	28.88 (63.19%)
multi-rel	multi-rel-small-scan	53.76	56.26	-2.50 (-4.44%)
order_by	q25	158.70	149.55	9.15 (6.12%)
order_by	q26	478.05	464.24	13.81 (2.98%)
order_by	q27	1430.05	1408.37	21.68 (1.54%)
scan_after_filter	q01	197.39	191.00	6.39 (3.35%)
scan_after_filter	q02	189.61	178.79	10.82 (6.05%)
shortest_path_ldbc100	q39	112.64	114.59	-1.95 (-1.70%)
var_size_expr_evaluator	q03	2098.64	2059.91	38.73 (1.88%)
var_size_expr_evaluator	q04	2246.39	2271.61	-25.22 (-1.11%)
var_size_expr_evaluator	q05	2646.17	2709.45	-63.28 (-2.34%)
var_size_expr_evaluator	q06	1395.17	1398.01	-2.84 (-0.20%)
var_size_seq_scan	q19	1492.84	1463.68	29.16 (1.99%)
var_size_seq_scan	q20	3149.41	3098.81	50.60 (1.63%)
var_size_seq_scan	q21	2459.69	2384.55	75.15 (3.15%)
var_size_seq_scan	q22	135.84	129.81	6.03 (4.65%)

ray6080 · 2024-08-21T18:21:11Z

src/binder/bind/bind_ddl.cpp

+    // Bind physical create node table info
+    auto pkDefinition = getDefinition(propertyDefinitions, extraInfo.pkName);
+    std::vector<PropertyDefinition> physicalPropertyDefinitions;
+    physicalPropertyDefinitions.push_back(pkDefinition.copy());


Do you need to copy here? getDefinition returns a copied one already.

ray6080 · 2024-08-21T18:33:32Z

src/binder/bind/bind_graph_pattern.cpp

 static std::vector<std::string> getPropertyNames(const std::vector<TableCatalogEntry*>& entries) {
-    std::vector<std::string> result;
-    std::unordered_set<std::string> propertyNamesSet;
+    auto distinctVector = DistinctVector<std::string>();


The DistinctVector usage seems unnecessarily complicated here, why not just use a set and copy to a vector at the end?

ray6080 · 2024-08-21T18:36:31Z

src/planner/plan/plan_join_order.cpp

    newSubgraph.addQueryNode(nodePos);
    auto plan = std::make_unique<LogicalPlan>();
-    appendScanNodeTable(node->getInternalID(), node->getTableIDs(), {}, *plan);
+    appendScanNodeTable(node->getInternalID(), {}, node->getEntries(), *plan);


Better add an inline comment to {}.

ray6080 · 2024-08-21T18:36:44Z

src/planner/plan/plan_join_order.cpp

        auto [boundNode, nbrNode] = getBoundAndNbrNodes(*rel, direction);
        const auto extendDirection = getExtendDirection(*rel, *boundNode);
-        appendScanNodeTable(boundNode->getInternalID(), boundNode->getTableIDs(), {}, *plan);
+        appendScanNodeTable(boundNode->getInternalID(), {}, boundNode->getEntries(), *plan);


Ditto on adding an inline comment to {}.

github-actions · 2024-08-26T19:33:34Z

Benchmark Result

Master commit hash: cf1330d6d9eb64cfca1e2e1c3f38ec21b1e2113a
Branch commit hash: 7317ae29bd073138c87524aa121ea447528ad94d

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	671.30	664.38	6.93 (1.04%)
aggregation	q28	11091.24	11131.84	-40.61 (-0.36%)
filter	q14	152.07	141.89	10.18 (7.18%)
filter	q15	152.12	143.51	8.61 (6.00%)
filter	q16	339.24	329.23	10.01 (3.04%)
filter	q17	475.77	464.14	11.63 (2.51%)
filter	q18	1978.25	1989.96	-11.71 (-0.59%)
fixed_size_expr_evaluator	q07	569.22	560.67	8.56 (1.53%)
fixed_size_expr_evaluator	q08	777.93	779.56	-1.63 (-0.21%)
fixed_size_expr_evaluator	q09	782.76	774.03	8.73 (1.13%)
fixed_size_expr_evaluator	q10	264.53	259.92	4.60 (1.77%)
fixed_size_expr_evaluator	q11	259.49	254.62	4.87 (1.91%)
fixed_size_expr_evaluator	q12	257.90	253.28	4.62 (1.82%)
fixed_size_expr_evaluator	q13	1493.24	1495.33	-2.09 (-0.14%)
fixed_size_seq_scan	q23	147.46	136.59	10.87 (7.96%)
join	q31	13.25	13.17	0.08 (0.61%)
ldbc_snb_ic	q35	743.04	755.55	-12.52 (-1.66%)
ldbc_snb_ic	q36	47.80	40.92	6.89 (16.83%)
ldbc_snb_is	q32	10.46	9.89	0.57 (5.80%)
ldbc_snb_is	q33	18.10	18.51	-0.41 (-2.24%)
ldbc_snb_is	q34	8.82	9.05	-0.23 (-2.54%)
multi-rel	multi-rel-large-scan	2732.98	2920.13	-187.16 (-6.41%)
multi-rel	multi-rel-lookup	38.29	66.73	-28.44 (-42.62%)
multi-rel	multi-rel-small-scan	50.63	48.80	1.83 (3.74%)
order_by	q25	157.29	147.10	10.19 (6.93%)
order_by	q26	476.13	466.22	9.91 (2.12%)
order_by	q27	1438.58	1431.95	6.63 (0.46%)
scan_after_filter	q01	197.72	191.78	5.94 (3.10%)
scan_after_filter	q02	186.20	181.47	4.74 (2.61%)
shortest_path_ldbc100	q39	74.16	75.70	-1.54 (-2.04%)
var_size_expr_evaluator	q03	2103.01	2079.95	23.06 (1.11%)
var_size_expr_evaluator	q04	2278.73	2295.57	-16.84 (-0.73%)
var_size_expr_evaluator	q05	2581.27	2615.85	-34.58 (-1.32%)
var_size_expr_evaluator	q06	1418.66	1385.06	33.59 (2.43%)
var_size_seq_scan	q19	1508.82	1480.47	28.34 (1.91%)
var_size_seq_scan	q20	3253.31	3216.86	36.45 (1.13%)
var_size_seq_scan	q21	2468.25	2449.41	18.84 (0.77%)
var_size_seq_scan	q22	135.77	132.76	3.01 (2.27%)

andyfengHKU force-pushed the external-table-base branch from 4e36d27 to e2e77a5 Compare August 19, 2024 21:11

ray6080 approved these changes Aug 21, 2024

View reviewed changes

andyfengHKU added 2 commits August 26, 2024 10:56

Add external node table

0345fc1

rebase

b386c41

andyfengHKU force-pushed the external-table-base branch from 3a97fc6 to b386c41 Compare August 26, 2024 19:01

Run clang-format

02f83f8

mewim requested a review from benjaminwinger as a code owner August 13, 2025 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add external node table#4105

Add external node table#4105
andyfengHKU wants to merge 3 commits intomasterfrom
external-table-base

andyfengHKU commented Aug 19, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 19, 2024

Uh oh!

ray6080 Aug 21, 2024

Uh oh!

ray6080 Aug 21, 2024

Uh oh!

ray6080 Aug 21, 2024

Uh oh!

ray6080 Aug 21, 2024

Uh oh!

github-actions bot commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andyfengHKU commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Nested catalog entry.

Scan external table

Contributor agreement

Uh oh!

github-actions bot commented Aug 19, 2024

Benchmark Result

Uh oh!

ray6080 Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

ray6080 Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

ray6080 Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

ray6080 Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 26, 2024

Benchmark Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andyfengHKU commented Aug 19, 2024 •

edited

Loading