Skip to content

Commit d107a00

Browse files
authored
feat(kg): make the way to map data to KG more clear (#409)
* feat(kg): rename spec type names * docs(kg): update graph docs to clarify and reflect latest change * docs(kg): update to make it simpler
1 parent a456d0b commit d107a00

File tree

5 files changed

+166
-133
lines changed

5 files changed

+166
-133
lines changed

docs/docs/ops/storages.md

Lines changed: 98 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -87,24 +87,34 @@ You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoin
8787

8888
## Property Graph Targets
8989

90-
Property graph is a graph data model where both nodes and relationships can have properties.
90+
Property graph is a widely-adopted model for knowledge graphs, where both nodes and relationships can have properties.
91+
[Graph database concepts](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/) has a good introduction to basic concepts of property graphs.
92+
93+
The following concepts will be used in the following sections:
94+
* [Node](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-node)
95+
* [Node label](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-labels), which represents a type of nodes.
96+
* [Relationship](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-relationship), which describes a connection between two nodes.
97+
* [Relationship type](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-relationship-type)
98+
* [Properties](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-properties), which are key-value pairs associated with nodes and relationships.
9199

92100
### Data Mapping
93101

94-
In CocoIndex, you can export data to property graph databases.
95-
This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships).
96-
In particular,
102+
Data from collectors are mapped to graph elements in various types:
103+
104+
1. Rows from collectors → Nodes in the graph
105+
2. Rows from collectors → Relationships in the graph (including source and target nodes of the relationship)
97106

98-
1. You can export rows from some collectors to nodes in the graph.
99-
2. You can export rows from some other collectors to relationships in the graph.
100-
3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1.
101-
CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship.
102-
This guarantees that all relationships exported in 2 are valid.
107+
This is what you need to provide to define these mappings:
103108

104-
We provide common types `NodeMapping`, `RelationshipMapping`, and `ReferencedNode`, to configure for each situation.
105-
They're agnostic to specific graph databases.
109+
* Specify [nodes to export](#nodes-to-export).
110+
* [Declare extra node labels](#declare-extra-node-labels), for labels to appear as source/target nodes of relationships but not exported as nodes.
111+
* Specify [relationships to export](#relationships-to-export).
106112

107-
#### Nodes
113+
In addition, the same node may appear multiple times, from exported nodes and various relationships.
114+
They should appear as the same node in the target graph database.
115+
CocoIndex automatically [matches and deduplicates nodes](#nodes-matching-and-deduplicating) based on their primary key values.
116+
117+
#### Nodes to Export
108118

109119
Here's how CocoIndex data elements map to nodes in the graph:
110120

@@ -114,9 +124,9 @@ Here's how CocoIndex data elements map to nodes in the graph:
114124
| a collected row | a node |
115125
| a field | a property of node |
116126

117-
Note that the label used in different `NodeMapping`s should be unique.
127+
Note that the label used in different `Nodes`s should be unique.
118128

119-
`cocoindex.storages.NodeMapping` is to describe mapping to nodes. It has the following fields:
129+
`cocoindex.storages.Nodes` is to describe mapping to nodes. It has the following fields:
120130

121131
* `label` (type: `str`): The label of the node.
122132

@@ -138,7 +148,7 @@ document_collector.export(
138148
...
139149
cocoindex.storages.Neo4j(
140150
...
141-
mapping=cocoindex.storages.NodeMapping(label="Document"),
151+
mapping=cocoindex.storages.Nodes(label="Document"),
142152
),
143153
primary_key_fields=["filename"],
144154
)
@@ -167,7 +177,32 @@ graph TD
167177
classDef node font-size:8pt,text-align:left,stroke-width:2;
168178
```
169179

170-
#### Relationships
180+
#### Declare Extra Node Labels
181+
182+
If a node label needs to appear as source or target of a relationship, but not exported as a node, you need to [declare](../core/flow_def#target-declarations) the label with necessary configuration.
183+
184+
The dataclass to describe the declaration is specific to each target storage (e.g. `cocoindex.storages.Neo4jDeclarations`),
185+
while they share the following common fields:
186+
187+
* `nodes_label` (required): The label of the node.
188+
* Options for [storage indexes](../core/flow_def#storage-indexes).
189+
* `primary_key_fields` (required)
190+
* `vector_indexes` (optional)
191+
192+
Continuing the same example above.
193+
Considering we want to extract relationships from `Document` to `Place` later (i.e. a document mentions a place), but the `Place` label isn't exported as a node, we need to declare it:
194+
195+
```python
196+
flow_builder.declare(
197+
cocoindex.storages.Neo4jDeclarations(
198+
connection = ...,
199+
nodes_label="Place",
200+
primary_key_fields=["name"],
201+
),
202+
)
203+
```
204+
205+
#### Relationships to Export
171206

172207
Here's how CocoIndex data elements map to relationships in the graph:
173208

@@ -177,12 +212,12 @@ Here's how CocoIndex data elements map to relationships in the graph:
177212
| a collected row | a relationship |
178213
| a field | a property of relationship, or a property of source/target node, based on configuration |
179214

180-
Note that the type used in different `RelationshipMapping`s should be unique.
215+
Note that the type used in different `Relationships`s should be unique.
181216

182-
`cocoindex.storages.RelationshipMapping` is to describe mapping to relationships. It has the following fields:
217+
`cocoindex.storages.Relationships` is to describe mapping to relationships. It has the following fields:
183218

184219
* `rel_type` (type: `str`): The type of the relationship.
185-
* `source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): Specify how to extract source/target node information from the collected row. It has the following fields:
220+
* `source`/`target` (type: `cocoindex.storages.NodeFromFields`): Specify how to extract source/target node information from specific fields in the collected row. It has the following fields:
186221
* `label` (type: `str`): The label of the node.
187222
* `fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Specify field mappings from the collected rows to node properties, with the following fields:
188223
* `source` (type: `str`): The name of the field in the collected row.
@@ -218,13 +253,13 @@ doc_place_collector.export(
218253
...
219254
cocoindex.storages.Neo4j(
220255
...
221-
mapping=cocoindex.storages.RelationshipMapping(
256+
mapping=cocoindex.storages.Relationships(
222257
rel_type="MENTION",
223-
source=cocoindex.storages.NodeReferenceMapping(
258+
source=cocoindex.storages.NodeFromFields(
224259
label="Document",
225260
fields=[cocoindex.storages.TargetFieldMapping(source="doc_filename", target="filename")],
226261
),
227-
target=cocoindex.storages.NodeReferenceMapping(
262+
target=cocoindex.storages.NodeFromFields(
228263
label="Place",
229264
fields=[
230265
cocoindex.storages.TargetFieldMapping(source="place_name", target="name"),
@@ -250,58 +285,70 @@ graph TD
250285
classDef: nodeRef
251286
}
252287
253-
Doc_Chapter2@{
288+
Doc_Chapter2_a@{
254289
shape: rounded
255290
label: "**[Document]**
256291
**filename\\*: chapter2.md**"
257292
classDef: nodeRef
258293
}
259294
260-
Place_CrystalPalace@{
295+
Doc_Chapter2_b@{
296+
shape: rounded
297+
label: "**[Document]**
298+
**filename\\*: chapter2.md**"
299+
classDef: nodeRef
300+
}
301+
302+
Place_CrystalPalace_a@{
261303
shape: rounded
262304
label: "**[Place]**
263305
**name\\*: Crystal Palace**
264306
embedding: [0.1, 0.5, ...]"
265-
classDef: nodeRef
307+
classDef: node
266308
}
267309
268310
Place_MagicForest@{
269311
shape: rounded
270312
label: "**[Place]**
271313
**name\\*: Magic Forest**
272314
embedding: [0.4, 0.2, ...]"
273-
classDef: nodeRef
315+
classDef: node
316+
}
317+
318+
Place_CrystalPalace_b@{
319+
shape: rounded
320+
label: "**[Place]**
321+
**name\\*: Crystal Palace**
322+
embedding: [0.1, 0.5, ...]"
323+
classDef: node
274324
}
275325
276-
Doc_Chapter1:::nodeRef -- **[MENTION]**{location:12} --> Place_CrystalPalace:::nodeRef
277-
Doc_Chapter2:::nodeRef -- **[MENTION]**{location:23} --> Place_MagicForest:::nodeRef
278-
Doc_Chapter2:::nodeRef -- **[MENTION]**{location:56} --> Place_CrystalPalace:::nodeRef
326+
327+
Doc_Chapter1:::nodeRef -- **:MENTION** (location:12) --> Place_CrystalPalace_a:::node
328+
Doc_Chapter2_a:::nodeRef -- **:MENTION** (location:23) --> Place_MagicForest:::node
329+
Doc_Chapter2_b:::nodeRef -- **:MENTION** (location:56) --> Place_CrystalPalace_b:::node
279330
280331
classDef nodeRef font-size:8pt,text-align:left,fill:transparent,stroke-width:1,stroke-dasharray:5 5;
332+
classDef node font-size:8pt,text-align:left,stroke-width:2;
281333
282334
```
283335

336+
#### Nodes Matching and Deduplicating
284337

285-
#### Nodes only referenced by relationships
338+
The nodes and relationships we got above are discrete elements.
339+
To fit them into a connected property graph, CocoIndex will match and deduplicate nodes automatically:
286340

287-
If a node appears as source or target of a relationship, but not exported using `NodeMapping`, CocoIndex will automatically create and keep these nodes until they're no longer referenced by any relationships.
341+
* Match nodes based on their primary key values. Nodes with the same primary key values are considered as the same node.
342+
* For non-primary key fields (a.k.a. value fields), CocoIndex will pick the values from an arbitrary one.
343+
If multiple nodes (before deduplication) with the same primary key provide value fields, an arbitrary one will be picked.
288344

289-
:::note Merge of node values
345+
:::note
290346

291-
If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships),
292-
CocoIndex uses value fields provided by an arbitrary one of them.
293347
The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.
294348

295349
:::
296350

297-
If a node's label specified in `NodeReferenceMapping` doesn't exist in any `NodeMapping`, you need to [declare](../core/flow_def#target-declarations) a `ReferencedNode` to configure [storage indexes](../core/flow_def#storage-indexes) for nodes with this label.
298-
The following options are supported:
299-
300-
* `primary_key_fields` (required)
301-
* `vector_indexes` (optional)
302-
303-
Using the same example above.
304-
After combining exported nodes and relationships, we get the knowledge graph with all information:
351+
After matching and deduplication, we get the final graph:
305352

306353
```mermaid
307354
graph TD
@@ -326,38 +373,22 @@ graph TD
326373
label: "**[Place]**
327374
**name\\*: Crystal Palace**
328375
embedding: [0.1, 0.5, ...]"
329-
classDef: nodeRef
376+
classDef: node
330377
}
331378
332379
Place_MagicForest@{
333380
shape: rounded
334381
label: "**[Place]**
335382
**name\\*: Magic Forest**
336383
embedding: [0.4, 0.2, ...]"
337-
classDef: nodeRef
384+
classDef: node
338385
}
339386
340-
Doc_Chapter1:::node -- **[MENTION]**{location:12} --> Place_CrystalPalace:::nodeRef
341-
Doc_Chapter2:::node -- **[MENTION]**{location:23} --> Place_MagicForest:::nodeRef
342-
Doc_Chapter2:::node -- **[MENTION]**{location:56} --> Place_CrystalPalace:::nodeRef
387+
Doc_Chapter1:::node -- **:MENTION** (location:12) --> Place_CrystalPalace:::node
388+
Doc_Chapter2:::node -- **:MENTION** (location:23) --> Place_MagicForest:::node
389+
Doc_Chapter2:::node -- **:MENTION** (location:56) --> Place_CrystalPalace:::node
343390
344391
classDef node font-size:8pt,text-align:left,stroke-width:2;
345-
classDef nodeRef font-size:8pt,text-align:left,fill:transparent,stroke-width:1,stroke-dasharray:5 5;
346-
347-
```
348-
349-
Nodes with `Place` label in the example aren't exported explicitly using `NodeMapping`, so CocoIndex will automatically create them as long as they're still referenced by any relationship.
350-
You need to declare a `ReferencedNode`:
351-
352-
```python
353-
flow_builder.declare(
354-
cocoindex.storages.Neo4jDeclarations(
355-
...
356-
referenced_nodes=[
357-
cocoindex.storages.ReferencedNode(label="Place", primary_key_fields=["name"]),
358-
],
359-
),
360-
)
361392
```
362393

363394
### Neo4j
@@ -388,6 +419,9 @@ The `Neo4j` storage exports each row as a relationship to Neo4j Knowledge Graph.
388419
Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing options for nodes only referenced by relationships. It has the following fields:
389420

390421
* `connection` (type: auth reference to `Neo4jConnectionSpec`)
391-
* `relationships` (type: `Sequence[ReferencedNode]`)
422+
* Fields for [nodes to declare](#nodes-to-declare), including
423+
* `nodes_label` (required)
424+
* `primary_key_fields` (required)
425+
* `vector_indexes` (optional)
392426

393427
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/docs_to_knowledge_graph).

examples/docs_to_knowledge_graph/main.py

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -93,35 +93,31 @@ def docs_to_kg_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.D
9393
"document_node",
9494
cocoindex.storages.Neo4j(
9595
connection=conn_spec,
96-
mapping=cocoindex.storages.NodeMapping(label="Document")),
96+
mapping=cocoindex.storages.Nodes(label="Document")),
9797
primary_key_fields=["filename"],
9898
)
9999
# Declare reference Node to reference entity node in a relationship
100100
flow_builder.declare(
101-
cocoindex.storages.Neo4jDeclarations(
101+
cocoindex.storages.Neo4jDeclaration(
102102
connection=conn_spec,
103-
referenced_nodes=[
104-
cocoindex.storages.ReferencedNode(
105-
label="Entity",
106-
primary_key_fields=["value"],
107-
)
108-
]
103+
nodes_label="Entity",
104+
primary_key_fields=["value"],
109105
)
110106
)
111107
entity_relationship.export(
112108
"entity_relationship",
113109
cocoindex.storages.Neo4j(
114110
connection=conn_spec,
115-
mapping=cocoindex.storages.RelationshipMapping(
111+
mapping=cocoindex.storages.Relationships(
116112
rel_type="RELATIONSHIP",
117-
source=cocoindex.storages.NodeReferenceMapping(
113+
source=cocoindex.storages.NodeFromFields(
118114
label="Entity",
119115
fields=[
120116
cocoindex.storages.TargetFieldMapping(
121117
source="subject", target="value"),
122118
]
123119
),
124-
target=cocoindex.storages.NodeReferenceMapping(
120+
target=cocoindex.storages.NodeFromFields(
125121
label="Entity",
126122
fields=[
127123
cocoindex.storages.TargetFieldMapping(
@@ -136,13 +132,13 @@ def docs_to_kg_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.D
136132
"entity_mention",
137133
cocoindex.storages.Neo4j(
138134
connection=conn_spec,
139-
mapping=cocoindex.storages.RelationshipMapping(
135+
mapping=cocoindex.storages.Relationships(
140136
rel_type="MENTION",
141-
source=cocoindex.storages.NodeReferenceMapping(
137+
source=cocoindex.storages.NodeFromFields(
142138
label="Document",
143139
fields=[cocoindex.storages.TargetFieldMapping("filename")],
144140
),
145-
target=cocoindex.storages.NodeReferenceMapping(
141+
target=cocoindex.storages.NodeFromFields(
146142
label="Entity",
147143
fields=[cocoindex.storages.TargetFieldMapping(
148144
source="entity", target="value")],

python/cocoindex/storages.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ class TargetFieldMapping:
3737
target: str | None = None
3838

3939
@dataclass
40-
class NodeReferenceMapping:
40+
class NodeFromFields:
4141
"""Spec for a referenced graph node, usually as part of a relationship."""
4242
label: str
4343
fields: list[TargetFieldMapping]
@@ -50,30 +50,37 @@ class ReferencedNode:
5050
vector_indexes: Sequence[index.VectorIndexDef] = ()
5151

5252
@dataclass
53-
class NodeMapping:
53+
class Nodes:
5454
"""Spec to map a row to a graph node."""
5555
kind = "Node"
5656

5757
label: str
5858

5959
@dataclass
60-
class RelationshipMapping:
60+
class Relationships:
6161
"""Spec to map a row to a graph relationship."""
6262
kind = "Relationship"
6363

6464
rel_type: str
65-
source: NodeReferenceMapping
66-
target: NodeReferenceMapping
65+
source: NodeFromFields
66+
target: NodeFromFields
67+
68+
# For backwards compatibility only
69+
NodeMapping = Nodes
70+
RelationshipMapping = Relationships
71+
NodeReferenceMapping = NodeFromFields
6772

6873
class Neo4j(op.StorageSpec):
6974
"""Graph storage powered by Neo4j."""
7075

7176
connection: AuthEntryReference
72-
mapping: NodeMapping | RelationshipMapping
77+
mapping: Nodes | Relationships
7378

74-
class Neo4jDeclarations(op.DeclarationSpec):
79+
class Neo4jDeclaration(op.DeclarationSpec):
7580
"""Declarations for Neo4j."""
7681

7782
kind = "Neo4j"
7883
connection: AuthEntryReference
79-
referenced_nodes: Sequence[ReferencedNode] = ()
84+
nodes_label: str
85+
primary_key_fields: Sequence[str]
86+
vector_indexes: Sequence[index.VectorIndexDef] = ()

0 commit comments

Comments
 (0)