Skip to content

Commit 6fb7e70

Browse files
committed
docs(kg): update graph docs to clarify and reflect latest change
1 parent 2d74302 commit 6fb7e70

File tree

1 file changed

+88
-60
lines changed

1 file changed

+88
-60
lines changed

docs/docs/ops/storages.md

Lines changed: 88 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -95,16 +95,20 @@ In CocoIndex, you can export data to property graph databases.
9595
This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships).
9696
In particular,
9797

98-
1. You can export rows from some collectors to nodes in the graph.
99-
2. You can export rows from some other collectors to relationships in the graph.
100-
3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1.
101-
CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship.
102-
This guarantees that all relationships exported in 2 are valid.
98+
1. You can export rows from some collectors to nodes in the graph. Each row is mapped to a node.
99+
2. You can export rows from some other collectors to relationships in the graph. Each row is mapped to a relationship, together with the source and target nodes of the relationship.
100+
3. Since relationships exported in 2 may share the same nodes (with nodes exported in 1, and other relationships exported in 2),
101+
there will match and deduplicate nodes, using their primary key as each node's identifier.
103102

104-
We provide common types `NodeMapping`, `RelationshipMapping`, and `ReferencedNode`, to configure for each situation.
105-
They're agnostic to specific graph databases.
103+
This is what you need to provide to export data to a property graph database:
106104

107-
#### Nodes
105+
* Provide `Nodes` to export nodes with each label (part 1).
106+
* For labels that will appear as source/target nodes of relationships (part 2) without being exported as nodes (part 1), you need to declare these labels to provide label-level configuration like primary keys.
107+
* Provide `Relationships` to export relationships with each type (part 2).
108+
109+
Nodes' matching and deduplicating are taken care of by CocoIndex, without additional configuration.
110+
111+
#### Nodes to Export
108112

109113
Here's how CocoIndex data elements map to nodes in the graph:
110114

@@ -114,9 +118,9 @@ Here's how CocoIndex data elements map to nodes in the graph:
114118
| a collected row | a node |
115119
| a field | a property of node |
116120

117-
Note that the label used in different `NodeMapping`s should be unique.
121+
Note that the label used in different `Nodes`s should be unique.
118122

119-
`cocoindex.storages.NodeMapping` is to describe mapping to nodes. It has the following fields:
123+
`cocoindex.storages.Nodes` is to describe mapping to nodes. It has the following fields:
120124

121125
* `label` (type: `str`): The label of the node.
122126

@@ -138,7 +142,7 @@ document_collector.export(
138142
...
139143
cocoindex.storages.Neo4j(
140144
...
141-
mapping=cocoindex.storages.NodeMapping(label="Document"),
145+
mapping=cocoindex.storages.Nodes(label="Document"),
142146
),
143147
primary_key_fields=["filename"],
144148
)
@@ -167,7 +171,32 @@ graph TD
167171
classDef node font-size:8pt,text-align:left,stroke-width:2;
168172
```
169173

170-
#### Relationships
174+
#### Nodes to Declare
175+
176+
If a node label needs to appear as source or target of a relationship, but not exported as a node, you need to [declare](../core/flow_def#target-declarations) the label with necessary configuration.
177+
178+
The dataclass to describe the declaration is specific to each target storage (e.g. `cocoindex.storages.Neo4jDeclarations`),
179+
while they share the following common fields:
180+
181+
* `nodes_label` (required): The label of the node.
182+
* Options for [storage indexes](../core/flow_def#storage-indexes).
183+
* `primary_key_fields` (required)
184+
* `vector_indexes` (optional)
185+
186+
Continuing the same example above.
187+
Considering we want to extract relationships from `Document` to `Place` later (i.e. a document mentions a place), but the `Place` label isn't exported as a node, we need to declare it:
188+
189+
```python
190+
flow_builder.declare(
191+
cocoindex.storages.Neo4jDeclarations(
192+
connection = ...,
193+
nodes_label="Place",
194+
primary_key_fields=["name"],
195+
),
196+
)
197+
```
198+
199+
#### Relationships to Export
171200

172201
Here's how CocoIndex data elements map to relationships in the graph:
173202

@@ -177,12 +206,12 @@ Here's how CocoIndex data elements map to relationships in the graph:
177206
| a collected row | a relationship |
178207
| a field | a property of relationship, or a property of source/target node, based on configuration |
179208

180-
Note that the type used in different `RelationshipMapping`s should be unique.
209+
Note that the type used in different `Relationships`s should be unique.
181210

182-
`cocoindex.storages.RelationshipMapping` is to describe mapping to relationships. It has the following fields:
211+
`cocoindex.storages.Relationships` is to describe mapping to relationships. It has the following fields:
183212

184213
* `rel_type` (type: `str`): The type of the relationship.
185-
* `source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): Specify how to extract source/target node information from the collected row. It has the following fields:
214+
* `source`/`target` (type: `cocoindex.storages.NodeFromFields`): Specify how to extract source/target node information from specific fields in the collected row. It has the following fields:
186215
* `label` (type: `str`): The label of the node.
187216
* `fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Specify field mappings from the collected rows to node properties, with the following fields:
188217
* `source` (type: `str`): The name of the field in the collected row.
@@ -218,13 +247,13 @@ doc_place_collector.export(
218247
...
219248
cocoindex.storages.Neo4j(
220249
...
221-
mapping=cocoindex.storages.RelationshipMapping(
250+
mapping=cocoindex.storages.Relationships(
222251
rel_type="MENTION",
223-
source=cocoindex.storages.NodeReferenceMapping(
252+
source=cocoindex.storages.NodeFromFields(
224253
label="Document",
225254
fields=[cocoindex.storages.TargetFieldMapping(source="doc_filename", target="filename")],
226255
),
227-
target=cocoindex.storages.NodeReferenceMapping(
256+
target=cocoindex.storages.NodeFromFields(
228257
label="Place",
229258
fields=[
230259
cocoindex.storages.TargetFieldMapping(source="place_name", target="name"),
@@ -250,58 +279,70 @@ graph TD
250279
classDef: nodeRef
251280
}
252281
253-
Doc_Chapter2@{
282+
Doc_Chapter2_a@{
254283
shape: rounded
255284
label: "**[Document]**
256285
**filename\\*: chapter2.md**"
257286
classDef: nodeRef
258287
}
259288
260-
Place_CrystalPalace@{
289+
Doc_Chapter2_b@{
290+
shape: rounded
291+
label: "**[Document]**
292+
**filename\\*: chapter2.md**"
293+
classDef: nodeRef
294+
}
295+
296+
Place_CrystalPalace_a@{
261297
shape: rounded
262298
label: "**[Place]**
263299
**name\\*: Crystal Palace**
264300
embedding: [0.1, 0.5, ...]"
265-
classDef: nodeRef
301+
classDef: node
266302
}
267303
268304
Place_MagicForest@{
269305
shape: rounded
270306
label: "**[Place]**
271307
**name\\*: Magic Forest**
272308
embedding: [0.4, 0.2, ...]"
273-
classDef: nodeRef
309+
classDef: node
310+
}
311+
312+
Place_CrystalPalace_b@{
313+
shape: rounded
314+
label: "**[Place]**
315+
**name\\*: Crystal Palace**
316+
embedding: [0.1, 0.5, ...]"
317+
classDef: node
274318
}
275319
276-
Doc_Chapter1:::nodeRef -- **[MENTION]**{location:12} --> Place_CrystalPalace:::nodeRef
277-
Doc_Chapter2:::nodeRef -- **[MENTION]**{location:23} --> Place_MagicForest:::nodeRef
278-
Doc_Chapter2:::nodeRef -- **[MENTION]**{location:56} --> Place_CrystalPalace:::nodeRef
320+
321+
Doc_Chapter1:::nodeRef -- **[MENTION]**{location:12} --> Place_CrystalPalace_a:::node
322+
Doc_Chapter2_a:::nodeRef -- **[MENTION]**{location:23} --> Place_MagicForest:::node
323+
Doc_Chapter2_b:::nodeRef -- **[MENTION]**{location:56} --> Place_CrystalPalace_b:::node
279324
280325
classDef nodeRef font-size:8pt,text-align:left,fill:transparent,stroke-width:1,stroke-dasharray:5 5;
326+
classDef node font-size:8pt,text-align:left,stroke-width:2;
281327
282328
```
283329

330+
#### Nodes Matching and Deduplicating
284331

285-
#### Nodes only referenced by relationships
332+
The nodes and relationships we got above are discrete elements.
333+
To fit them into a connected property graph, CocoIndex will match and deduplicate nodes:
286334

287-
If a node appears as source or target of a relationship, but not exported using `NodeMapping`, CocoIndex will automatically create and keep these nodes until they're no longer referenced by any relationships.
335+
* Match nodes based on their primary key values. Nodes with the same primary key values are considered as the same node.
336+
* For non-primary key fields (a.k.a. value fields), CocoIndex will pick the values from an arbitrary one.
337+
If multiple nodes (before deduplication) with the same primary key provide value fields, an arbitrary one will be picked.
288338

289-
:::note Merge of node values
339+
:::note
290340

291-
If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships),
292-
CocoIndex uses value fields provided by an arbitrary one of them.
293341
The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.
294342

295343
:::
296344

297-
If a node's label specified in `NodeReferenceMapping` doesn't exist in any `NodeMapping`, you need to [declare](../core/flow_def#target-declarations) a `ReferencedNode` to configure [storage indexes](../core/flow_def#storage-indexes) for nodes with this label.
298-
The following options are supported:
299-
300-
* `primary_key_fields` (required)
301-
* `vector_indexes` (optional)
302-
303-
Using the same example above.
304-
After combining exported nodes and relationships, we get the knowledge graph with all information:
345+
After matching and deduplication, we get the final graph:
305346

306347
```mermaid
307348
graph TD
@@ -326,38 +367,22 @@ graph TD
326367
label: "**[Place]**
327368
**name\\*: Crystal Palace**
328369
embedding: [0.1, 0.5, ...]"
329-
classDef: nodeRef
370+
classDef: node
330371
}
331372
332373
Place_MagicForest@{
333374
shape: rounded
334375
label: "**[Place]**
335376
**name\\*: Magic Forest**
336377
embedding: [0.4, 0.2, ...]"
337-
classDef: nodeRef
378+
classDef: node
338379
}
339380
340-
Doc_Chapter1:::node -- **[MENTION]**{location:12} --> Place_CrystalPalace:::nodeRef
341-
Doc_Chapter2:::node -- **[MENTION]**{location:23} --> Place_MagicForest:::nodeRef
342-
Doc_Chapter2:::node -- **[MENTION]**{location:56} --> Place_CrystalPalace:::nodeRef
381+
Doc_Chapter1:::node -- **[MENTION]**{location:12} --> Place_CrystalPalace:::node
382+
Doc_Chapter2:::node -- **[MENTION]**{location:23} --> Place_MagicForest:::node
383+
Doc_Chapter2:::node -- **[MENTION]**{location:56} --> Place_CrystalPalace:::node
343384
344385
classDef node font-size:8pt,text-align:left,stroke-width:2;
345-
classDef nodeRef font-size:8pt,text-align:left,fill:transparent,stroke-width:1,stroke-dasharray:5 5;
346-
347-
```
348-
349-
Nodes with `Place` label in the example aren't exported explicitly using `NodeMapping`, so CocoIndex will automatically create them as long as they're still referenced by any relationship.
350-
You need to declare a `ReferencedNode`:
351-
352-
```python
353-
flow_builder.declare(
354-
cocoindex.storages.Neo4jDeclarations(
355-
...
356-
referenced_nodes=[
357-
cocoindex.storages.ReferencedNode(label="Place", primary_key_fields=["name"]),
358-
],
359-
),
360-
)
361386
```
362387

363388
### Neo4j
@@ -388,6 +413,9 @@ The `Neo4j` storage exports each row as a relationship to Neo4j Knowledge Graph.
388413
Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing options for nodes only referenced by relationships. It has the following fields:
389414

390415
* `connection` (type: auth reference to `Neo4jConnectionSpec`)
391-
* `relationships` (type: `Sequence[ReferencedNode]`)
416+
* Fields for [nodes to declare](#nodes-to-declare), including
417+
* `nodes_label` (required)
418+
* `primary_key_fields` (required)
419+
* `vector_indexes` (optional)
392420

393421
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/docs_to_knowledge_graph).

0 commit comments

Comments
 (0)