You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/ops/storages.md
+98-64Lines changed: 98 additions & 64 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,24 +87,34 @@ You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoin
87
87
88
88
## Property Graph Targets
89
89
90
-
Property graph is a graph data model where both nodes and relationships can have properties.
90
+
Property graph is a widely-adopted model for knowledge graphs, where both nodes and relationships can have properties.
91
+
[Graph database concepts](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/) has a good introduction to basic concepts of property graphs.
92
+
93
+
The following concepts will be used in the following sections:
*[Node label](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-labels), which represents a type of nodes.
96
+
*[Relationship](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-relationship), which describes a connection between two nodes.
*[Properties](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-properties), which are key-value pairs associated with nodes and relationships.
91
99
92
100
### Data Mapping
93
101
94
-
In CocoIndex, you can export data to property graph databases.
95
-
This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships).
96
-
In particular,
102
+
Data from collectors are mapped to graph elements in various types:
103
+
104
+
1. Rows from collectors → Nodes in the graph
105
+
2. Rows from collectors → Relationships in the graph (including source and target nodes of the relationship)
97
106
98
-
1. You can export rows from some collectors to nodes in the graph.
99
-
2. You can export rows from some other collectors to relationships in the graph.
100
-
3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1.
101
-
CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship.
102
-
This guarantees that all relationships exported in 2 are valid.
107
+
This is what you need to provide to define these mappings:
103
108
104
-
We provide common types `NodeMapping`, `RelationshipMapping`, and `ReferencedNode`, to configure for each situation.
105
-
They're agnostic to specific graph databases.
109
+
* Specify [nodes to export](#nodes-to-export).
110
+
*[Declare extra node labels](#declare-extra-node-labels), for labels to appear as source/target nodes of relationships but not exported as nodes.
111
+
* Specify [relationships to export](#relationships-to-export).
106
112
107
-
#### Nodes
113
+
In addition, the same node may appear multiple times, from exported nodes and various relationships.
114
+
They should appear as the same node in the target graph database.
115
+
CocoIndex automatically [matches and deduplicates nodes](#nodes-matching-and-deduplicating) based on their primary key values.
116
+
117
+
#### Nodes to Export
108
118
109
119
Here's how CocoIndex data elements map to nodes in the graph:
110
120
@@ -114,9 +124,9 @@ Here's how CocoIndex data elements map to nodes in the graph:
114
124
| a collected row | a node |
115
125
| a field | a property of node |
116
126
117
-
Note that the label used in different `NodeMapping`s should be unique.
127
+
Note that the label used in different `Nodes`s should be unique.
118
128
119
-
`cocoindex.storages.NodeMapping` is to describe mapping to nodes. It has the following fields:
129
+
`cocoindex.storages.Nodes` is to describe mapping to nodes. It has the following fields:
If a node label needs to appear as source or target of a relationship, but not exported as a node, you need to [declare](../core/flow_def#target-declarations) the label with necessary configuration.
183
+
184
+
The dataclass to describe the declaration is specific to each target storage (e.g. `cocoindex.storages.Neo4jDeclarations`),
185
+
while they share the following common fields:
186
+
187
+
*`nodes_label` (required): The label of the node.
188
+
* Options for [storage indexes](../core/flow_def#storage-indexes).
189
+
*`primary_key_fields` (required)
190
+
*`vector_indexes` (optional)
191
+
192
+
Continuing the same example above.
193
+
Considering we want to extract relationships from `Document` to `Place` later (i.e. a document mentions a place), but the `Place` label isn't exported as a node, we need to declare it:
194
+
195
+
```python
196
+
flow_builder.declare(
197
+
cocoindex.storages.Neo4jDeclarations(
198
+
connection=...,
199
+
nodes_label="Place",
200
+
primary_key_fields=["name"],
201
+
),
202
+
)
203
+
```
204
+
205
+
#### Relationships to Export
171
206
172
207
Here's how CocoIndex data elements map to relationships in the graph:
173
208
@@ -177,12 +212,12 @@ Here's how CocoIndex data elements map to relationships in the graph:
177
212
| a collected row | a relationship |
178
213
| a field | a property of relationship, or a property of source/target node, based on configuration |
179
214
180
-
Note that the type used in different `RelationshipMapping`s should be unique.
215
+
Note that the type used in different `Relationships`s should be unique.
181
216
182
-
`cocoindex.storages.RelationshipMapping` is to describe mapping to relationships. It has the following fields:
217
+
`cocoindex.storages.Relationships` is to describe mapping to relationships. It has the following fields:
183
218
184
219
*`rel_type` (type: `str`): The type of the relationship.
185
-
*`source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): Specify how to extract source/target node information from the collected row. It has the following fields:
220
+
*`source`/`target` (type: `cocoindex.storages.NodeFromFields`): Specify how to extract source/target node information from specific fields in the collected row. It has the following fields:
186
221
*`label` (type: `str`): The label of the node.
187
222
*`fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Specify field mappings from the collected rows to node properties, with the following fields:
188
223
*`source` (type: `str`): The name of the field in the collected row.
The nodes and relationships we got above are discrete elements.
339
+
To fit them into a connected property graph, CocoIndex will match and deduplicate nodes automatically:
286
340
287
-
If a node appears as source or target of a relationship, but not exported using `NodeMapping`, CocoIndex will automatically create and keep these nodes until they're no longer referenced by any relationships.
341
+
* Match nodes based on their primary key values. Nodes with the same primary key values are considered as the same node.
342
+
* For non-primary key fields (a.k.a. value fields), CocoIndex will pick the values from an arbitrary one.
343
+
If multiple nodes (before deduplication) with the same primary key provide value fields, an arbitrary one will be picked.
288
344
289
-
:::note Merge of node values
345
+
:::note
290
346
291
-
If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships),
292
-
CocoIndex uses value fields provided by an arbitrary one of them.
293
347
The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.
294
348
295
349
:::
296
350
297
-
If a node's label specified in `NodeReferenceMapping` doesn't exist in any `NodeMapping`, you need to [declare](../core/flow_def#target-declarations) a `ReferencedNode` to configure [storage indexes](../core/flow_def#storage-indexes) for nodes with this label.
298
-
The following options are supported:
299
-
300
-
*`primary_key_fields` (required)
301
-
*`vector_indexes` (optional)
302
-
303
-
Using the same example above.
304
-
After combining exported nodes and relationships, we get the knowledge graph with all information:
351
+
After matching and deduplication, we get the final graph:
Nodes with `Place` label in the example aren't exported explicitly using `NodeMapping`, so CocoIndex will automatically create them as long as they're still referenced by any relationship.
@@ -388,6 +419,9 @@ The `Neo4j` storage exports each row as a relationship to Neo4j Knowledge Graph.
388
419
Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing options for nodes only referenced by relationships. It has the following fields:
389
420
390
421
*`connection` (type: auth reference to `Neo4jConnectionSpec`)
0 commit comments