You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/ops/storages.md
+88-60Lines changed: 88 additions & 60 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -95,16 +95,20 @@ In CocoIndex, you can export data to property graph databases.
95
95
This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships).
96
96
In particular,
97
97
98
-
1. You can export rows from some collectors to nodes in the graph.
99
-
2. You can export rows from some other collectors to relationships in the graph.
100
-
3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1.
101
-
CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship.
102
-
This guarantees that all relationships exported in 2 are valid.
98
+
1. You can export rows from some collectors to nodes in the graph. Each row is mapped to a node.
99
+
2. You can export rows from some other collectors to relationships in the graph. Each row is mapped to a relationship, together with the source and target nodes of the relationship.
100
+
3. Since relationships exported in 2 may share the same nodes (with nodes exported in 1, and other relationships exported in 2),
101
+
there will match and deduplicate nodes, using their primary key as each node's identifier.
103
102
104
-
We provide common types `NodeMapping`, `RelationshipMapping`, and `ReferencedNode`, to configure for each situation.
105
-
They're agnostic to specific graph databases.
103
+
This is what you need to provide to export data to a property graph database:
106
104
107
-
#### Nodes
105
+
* Provide `Nodes` to export nodes with each label (part 1).
106
+
* For labels that will appear as source/target nodes of relationships (part 2) without being exported as nodes (part 1), you need to declare these labels to provide label-level configuration like primary keys.
107
+
* Provide `Relationships` to export relationships with each type (part 2).
108
+
109
+
Nodes' matching and deduplicating are taken care of by CocoIndex, without additional configuration.
110
+
111
+
#### Nodes to Export
108
112
109
113
Here's how CocoIndex data elements map to nodes in the graph:
110
114
@@ -114,9 +118,9 @@ Here's how CocoIndex data elements map to nodes in the graph:
114
118
| a collected row | a node |
115
119
| a field | a property of node |
116
120
117
-
Note that the label used in different `NodeMapping`s should be unique.
121
+
Note that the label used in different `Nodes`s should be unique.
118
122
119
-
`cocoindex.storages.NodeMapping` is to describe mapping to nodes. It has the following fields:
123
+
`cocoindex.storages.Nodes` is to describe mapping to nodes. It has the following fields:
If a node label needs to appear as source or target of a relationship, but not exported as a node, you need to [declare](../core/flow_def#target-declarations) the label with necessary configuration.
177
+
178
+
The dataclass to describe the declaration is specific to each target storage (e.g. `cocoindex.storages.Neo4jDeclarations`),
179
+
while they share the following common fields:
180
+
181
+
*`nodes_label` (required): The label of the node.
182
+
* Options for [storage indexes](../core/flow_def#storage-indexes).
183
+
*`primary_key_fields` (required)
184
+
*`vector_indexes` (optional)
185
+
186
+
Continuing the same example above.
187
+
Considering we want to extract relationships from `Document` to `Place` later (i.e. a document mentions a place), but the `Place` label isn't exported as a node, we need to declare it:
188
+
189
+
```python
190
+
flow_builder.declare(
191
+
cocoindex.storages.Neo4jDeclarations(
192
+
connection=...,
193
+
nodes_label="Place",
194
+
primary_key_fields=["name"],
195
+
),
196
+
)
197
+
```
198
+
199
+
#### Relationships to Export
171
200
172
201
Here's how CocoIndex data elements map to relationships in the graph:
173
202
@@ -177,12 +206,12 @@ Here's how CocoIndex data elements map to relationships in the graph:
177
206
| a collected row | a relationship |
178
207
| a field | a property of relationship, or a property of source/target node, based on configuration |
179
208
180
-
Note that the type used in different `RelationshipMapping`s should be unique.
209
+
Note that the type used in different `Relationships`s should be unique.
181
210
182
-
`cocoindex.storages.RelationshipMapping` is to describe mapping to relationships. It has the following fields:
211
+
`cocoindex.storages.Relationships` is to describe mapping to relationships. It has the following fields:
183
212
184
213
*`rel_type` (type: `str`): The type of the relationship.
185
-
*`source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): Specify how to extract source/target node information from the collected row. It has the following fields:
214
+
*`source`/`target` (type: `cocoindex.storages.NodeFromFields`): Specify how to extract source/target node information from specific fields in the collected row. It has the following fields:
186
215
*`label` (type: `str`): The label of the node.
187
216
*`fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Specify field mappings from the collected rows to node properties, with the following fields:
188
217
*`source` (type: `str`): The name of the field in the collected row.
The nodes and relationships we got above are discrete elements.
333
+
To fit them into a connected property graph, CocoIndex will match and deduplicate nodes:
286
334
287
-
If a node appears as source or target of a relationship, but not exported using `NodeMapping`, CocoIndex will automatically create and keep these nodes until they're no longer referenced by any relationships.
335
+
* Match nodes based on their primary key values. Nodes with the same primary key values are considered as the same node.
336
+
* For non-primary key fields (a.k.a. value fields), CocoIndex will pick the values from an arbitrary one.
337
+
If multiple nodes (before deduplication) with the same primary key provide value fields, an arbitrary one will be picked.
288
338
289
-
:::note Merge of node values
339
+
:::note
290
340
291
-
If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships),
292
-
CocoIndex uses value fields provided by an arbitrary one of them.
293
341
The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.
294
342
295
343
:::
296
344
297
-
If a node's label specified in `NodeReferenceMapping` doesn't exist in any `NodeMapping`, you need to [declare](../core/flow_def#target-declarations) a `ReferencedNode` to configure [storage indexes](../core/flow_def#storage-indexes) for nodes with this label.
298
-
The following options are supported:
299
-
300
-
*`primary_key_fields` (required)
301
-
*`vector_indexes` (optional)
302
-
303
-
Using the same example above.
304
-
After combining exported nodes and relationships, we get the knowledge graph with all information:
345
+
After matching and deduplication, we get the final graph:
Nodes with `Place` label in the example aren't exported explicitly using `NodeMapping`, so CocoIndex will automatically create them as long as they're still referenced by any relationship.
@@ -388,6 +413,9 @@ The `Neo4j` storage exports each row as a relationship to Neo4j Knowledge Graph.
388
413
Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing options for nodes only referenced by relationships. It has the following fields:
389
414
390
415
*`connection` (type: auth reference to `Neo4jConnectionSpec`)
0 commit comments