Skip to content

Commit 5700a16

Browse files
authored
docs(kg): make mapping data to knowledge graph more clear (#378)
1 parent 9b755ad commit 5700a16

File tree

4 files changed

+211
-78
lines changed

4 files changed

+211
-78
lines changed

.vscode/settings.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"cSpell.words": [
3+
"cocoindex",
4+
"reindexing",
5+
"timedelta"
6+
]
7+
}

docs/docs/core/flow_def.mdx

Lines changed: 70 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,55 @@ See [Flow Running](/docs/core/flow_methods) for more details on it.
4949
</TabItem>
5050
</Tabs>
5151

52-
## Flow Builder
52+
## Data Scope
53+
54+
A **data scope** represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk.
55+
A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.
56+
57+
### Get or Add a Field
58+
59+
You can get or add a field of a data scope (which is a data slice).
60+
61+
:::note
5362

54-
The `FlowBuilder` object is the starting point to construct a flow.
63+
You cannot override an existing field.
64+
65+
:::
66+
67+
<Tabs>
68+
<TabItem value="python" label="Python" default>
69+
70+
Getting and setting a field of a data scope is done by the `[]` operator with a field name:
71+
72+
```python
73+
@cocoindex.flow_def(name="DemoFlow")
74+
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
75+
76+
# Add "documents" to the top-level data scope.
77+
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
78+
79+
# Each row of "documents" is a child scope.
80+
with data_scope["documents"].row() as document:
81+
82+
# Get "content" from the document scope, transform, and add "summary" to scope.
83+
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
84+
```
85+
86+
</TabItem>
87+
</Tabs>
88+
89+
### Add a collector
90+
91+
See [Data Collector](#data-collector) below for more details.
92+
93+
## Data Slice
94+
95+
A **data slice** references a subset of data belonging to a data scope, e.g. a specific field from a data scope.
96+
A data slice has a certain data type, and it's the input for most operations.
5597

5698
### Import from source
5799

100+
To get the initial data slice, we need to start from importing data from a source.
58101
`FlowBuilder` provides a `add_source()` method to import data from external sources.
59102
A *source spec* needs to be provided for any import operation, to describe the source and parameters related to the source.
60103
Import must happen at the top level, and the field created by import must be in the top-level struct.
@@ -72,10 +115,6 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
72115
</TabItem>
73116
</Tabs>
74117

75-
`add_source()` returns a `DataSlice`. Once external data sources are imported, you can further transform them using methods exposed by these data objects, as discussed in the following sections.
76-
77-
We'll describe different data objects in next few sections.
78-
79118
:::note
80119

81120
The actual value of data is not available at the time when we define the flow: it's only available at runtime.
@@ -111,51 +150,6 @@ and only perform transformations on changed source keys.
111150

112151
:::
113152

114-
## Data Scope
115-
116-
A **data scope** represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk.
117-
A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.
118-
119-
### Get or Add a Field
120-
121-
You can get or add a field of a data scope (which is a data slice).
122-
123-
:::note
124-
125-
You cannot override an existing field.
126-
127-
:::
128-
129-
<Tabs>
130-
<TabItem value="python" label="Python" default>
131-
132-
Getting and setting a field of a data scope is done by the `[]` operator with a field name:
133-
134-
```python
135-
@cocoindex.flow_def(name="DemoFlow")
136-
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
137-
138-
# Add "documents" to the top-level data scope.
139-
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
140-
141-
# Each row of "documents" is a child scope.
142-
with data_scope["documents"].row() as document:
143-
144-
# Get "content" from the document scope, transform, and add "summary" to scope.
145-
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
146-
```
147-
148-
</TabItem>
149-
</Tabs>
150-
151-
### Add a collector
152-
153-
See [Data Collector](#data-collector) below for more details.
154-
155-
## Data Slice
156-
157-
A **data slice** references a subset of data belonging to a data scope, e.g. a specific field from a data scope.
158-
A data slice has a certain data type, and it's the input for most operations.
159153

160154
### Transform
161155

@@ -164,7 +158,7 @@ A *function spec* needs to be provided for any transform operation, to describe
164158

165159
The function takes one or multiple data arguments.
166160
The first argument is the data slice to be transformed, and the `transform()` method is applied from it.
167-
Other arguments can be passed in as positional arguments or keyword arguments, aftert the function spec.
161+
Other arguments can be passed in as positional arguments or keyword arguments, after the function spec.
168162

169163
<Tabs>
170164
<TabItem value="python" label="Python" default>
@@ -300,6 +294,29 @@ CocoIndex provides a common way to configure indexes for various storages.
300294

301295
## Miscellaneous
302296

297+
### Target Declarations
298+
299+
Most time a target storage is created by calling `export()` method on a collector, and this `export()` call comes with configurations needed for the target storage, e.g. options for storage indexes.
300+
Occasionally, you may need to specify some configurations for target storage out of the context of any specific data collector.
301+
302+
For example, for graph database targets like `Neo4j`, you may have a data collector to export data to Neo4j relationships, which will create nodes referenced by various relationships in turn.
303+
These nodes don't directly come from any specific data collector (consider relationships from different data collectors may share the same nodes).
304+
To specify configurations for these nodes, you can *declare* spec for related node labels.
305+
306+
`FlowBuilder` provides `declare()` method for this purpose, which takes the spec to declare, as provided by various target types.
307+
308+
<Tabs>
309+
<TabItem value="python" label="Python" default>
310+
311+
```python
312+
flow_builder.declare(
313+
cocoindex.storages.Neo4jDeclarations(...)
314+
)
315+
```
316+
317+
</TabItem>
318+
</Tabs>
319+
303320
### Auth Registry
304321

305322
CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.
@@ -310,11 +327,10 @@ Operation spec is the default way to configure a backend. But it has the followi
310327
* Once an operation is removed after flow definition code change, the spec is also gone.
311328
But we still need to be able to drop the backend (e.g. a table) by `cocoindex setup` or `cocoindex drop`.
312329

313-
314330
Auth registry is introduced to solve the problems above. It works as follows:
315331

316332
* You can create new **auth entry** by a key and a value.
317-
* You can references the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4jRelationship` takes `connection` field in the form of auth entry reference.
333+
* You can references the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
318334

319335
<Tabs>
320336
<TabItem value="python" label="Python" default>

docs/docs/core/flow_methods.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ This action has two modes:
6262
:::info
6363

6464
For both modes, CocoIndex is performing *incremental processing*,
65-
i.e. we only performs computations and storage mutations on source data that are changed, or the flow has changed.
65+
i.e. we only perform computations and storage mutations on source data that are changed, or the flow has changed.
6666
This is to achieve best efficiency.
6767

6868
:::

docs/docs/ops/storages.md

Lines changed: 133 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,61 @@
11
---
22
title: Storages
33
description: CocoIndex Built-in Storages
4+
toc_max_heading_level: 4
45
---
56

67
# CocoIndex Built-in Storages
78

8-
## Postgres
9+
For each target storage, data are exported from a data collector, containing data of multiple entries, each with multiple fields.
10+
The way to map data from a data collector to a target storage depends on data model of the target storage.
11+
12+
## Entry-Oriented Targets
13+
14+
Entry-Oriented Storage organizes data into independent entries, such as rows, key-value pairs, or documents.
15+
Each entry is self-contained and does not explicitly link to others.
16+
There is usually a straightforward mapping from data collector rows to entries.
17+
18+
### Postgres
919

1020
Exports data to Postgres database (with pgvector extension).
1121

22+
#### Data Mapping
23+
24+
Here's how CocoIndex data elements map to Postgres elements during export:
25+
26+
| CocoIndex Element | Postgres Element |
27+
|-------------------|------------------|
28+
| an export target | a unique table |
29+
| a collected row | a row |
30+
| a field | a column |
31+
32+
For example, if you have a data collector that collects rows with fields `id`, `title`, and `embedding`, it will be exported to a Postgres table with corresponding columns.
33+
It should be a unique table, meaning that no other export target should export to the same table.
34+
35+
#### Spec
36+
1237
The spec takes the following fields:
1338

1439
* `database_url` (type: `str`, optional): The URL of the Postgres database to use as the internal storage, e.g. `postgres://cocoindex:cocoindex@localhost/cocoindex`. If unspecified, will use the same database as the [internal storage](/docs/core/basics#internal-storage).
1540

1641
* `table_name` (type: `str`, optional): The name of the table to store to. If unspecified, will generate a new automatically. We recommend specifying a name explicitly if you want to directly query the table. It can be omitted if you want to use CocoIndex's query handlers to query the table.
1742

18-
## Qdrant
43+
### Qdrant
1944

2045
Exports data to a [Qdrant](https://qdrant.tech/) collection.
2146

47+
#### Data Mapping
48+
49+
Here's how CocoIndex data elements map to Qdrant elements during export:
50+
51+
| CocoIndex Element | Qdrant Element |
52+
|-------------------|------------------|
53+
| an export target | a unique collection |
54+
| a collected row | a point |
55+
| a field | a named vector (for fields with vector type); a field within payload (otherwise) |
56+
57+
#### Spec
58+
2259
The spec takes the following fields:
2360

2461
* `collection_name` (type: `str`, required): The name of the collection to export the data to.
@@ -46,9 +83,97 @@ doc_embeddings.export(
4683

4784
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant).
4885

49-
## Neo4j
86+
## Property Graph Targets
87+
88+
Property graph is a graph data model where both nodes and relationships can have properties.
89+
90+
### Data Mapping
91+
92+
In CocoIndex, you can export data to property graph databases.
93+
This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships).
94+
In particular,
95+
96+
1. You can export rows from some collectors to nodes in the graph.
97+
2. You can export rows from some other collectors to relationships in the graph.
98+
3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1.
99+
CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship.
100+
This guarantees that all relationships exported in 2 are valid.
101+
102+
We provide common types `NodeMapping`, `RelationshipMapping`, and `ReferencedNode`, to configure for each situation.
103+
They're agnostic to specific graph databases.
104+
105+
#### Nodes
106+
107+
Here's how CocoIndex data elements map to nodes in the graph:
108+
109+
| CocoIndex Element | Graph Element |
110+
|-------------------|------------------|
111+
| an export target | nodes with a unique label |
112+
| a collected row | a node |
113+
| a field | a property of node |
114+
115+
Note that the label used in different `NodeMapping`s should be unique.
116+
117+
`cocoindex.storages.NodeMapping` is to describe mapping to nodes. It has the following fields:
118+
119+
* `label` (type: `str`): The label of the node.
120+
121+
For example, if you have a data collector that collects rows with fields `id`, `name` and `gender`, it can be exported to a node with label `Person` and properties `id` `name` and `gender`.
50122

51-
If you don't have a Postgres database, you can start a Postgres SQL database for cocoindex using our docker compose config:
123+
#### Relationships
124+
125+
Here's how CocoIndex data elements map to relationships in the graph:
126+
127+
| CocoIndex Element | Graph Element |
128+
|-------------------|------------------|
129+
| an export target | relationships with a unique type |
130+
| a collected row | a relationship |
131+
| a field | a property of relationship, or a property of source/target node, based on configuration |
132+
133+
Note that the type used in different `RelationshipMapping`s should be unique.
134+
135+
`cocoindex.storages.RelationshipMapping` is to describe mapping to relationships. It has the following fields:
136+
137+
* `rel_type` (type: `str`): The type of the relationship.
138+
* `source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): Specify how to extract source/target node information from the collected row. It has the following fields:
139+
* `label` (type: `str`): The label of the node.
140+
* `fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Specify field mappings from the collected rows to node properties, with the following fields:
141+
* `source` (type: `str`): The name of the field in the collected row.
142+
* `target` (type: `str`, optional): The name of the field to use as the node field. If unspecified, will use the same as `source`.
143+
144+
:::note Map necessary fields for nodes of relationships
145+
146+
You need to map the following fields for nodes of each relationship:
147+
148+
* Make sure all primary key fields for the label are mapped.
149+
* Optionally, you can also map non-key fields. If you do so, please make sure all value fields are mapped.
150+
151+
:::
152+
153+
All fields in the collector that are not used in mappings for source or target node fields will be mapped to relationship properties.
154+
155+
#### Nodes only referenced by relationships
156+
157+
If a node appears as source or target of a relationship, but not exported using `NodeMapping`, CocoIndex will automatically create and keep these nodes until it's no longer referenced by any relationships.
158+
159+
:::note Merge of node values
160+
161+
If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships),
162+
CocoIndex uses value fields provided by an arbitrary one of them.
163+
The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.
164+
165+
:::
166+
167+
If a node's label specified in `NodeReferenceMapping` doesn't exist in any `NodeMapping`, you need to [declare](../core/flow_def#target-declarations) a `ReferencedNode` to configure [storage indexes](../core/flow_def#storage-indexes) for nodes with this label.
168+
The following options are supported:
169+
170+
* `primary_key_fields` (required)
171+
* `vector_indexes` (optional)
172+
173+
174+
### Neo4j
175+
176+
If you don't have a Neo4j database, you can start a Neo4j database using our docker compose config:
52177

53178
```bash
54179
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
@@ -69,25 +194,10 @@ The `Neo4j` storage exports each row as a relationship to Neo4j Knowledge Graph.
69194
* `user` (type: `str`): Username for the Neo4j database.
70195
* `password` (type: `str`): Password for the Neo4j database.
71196
* `db` (type: `str`, optional): The name of the Neo4j database to use as the internal storage, e.g. `neo4j`.
72-
* `mapping`: The mapping from collected row to nodes or relationships of the graph. 2 variations are supported:
73-
* `cocoindex.storages.NodeMapping`: Each collected row is mapped to a node in the graph. It has the following fields:
74-
* `label`: The label of the node.
75-
* `cocoindex.storages.RelationshipMapping`: Each collected row is mapped to a relationship in the graph,
76-
With the following fields:
77-
78-
* `rel_type` (type: `str`): The type of the relationship.
79-
* `source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): The source/target node of the relationship, with the following fields:
80-
* `label` (type: `str`): The label of the node.
81-
* `fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Map fields from the collector to nodes in Neo4j, with the following fields:
82-
* `source` (type: `str`): The name of the field in the collected row.
83-
* `target` (type: `str`, optional): The name of the field to use as the node field. If unspecified, will use the same as `source`.
84-
85-
:::info
197+
* `mapping` (type: `NodeMapping | RelationshipMapping`): The mapping from collected row to nodes or relationships of the graph. 2 variations are supported:
86198

87-
All fields specified in `fields.source` will be mapped to properties of source/target nodes. All remaining fields will be mapped to relationship properties by default.
199+
Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing options for nodes only referenced by relationships. It has the following fields:
88200

89-
:::
201+
* `connection` (type: auth reference to `Neo4jConnectionSpec`)
202+
* `relationships` (type: `Sequence[ReferencedNode]`)
90203

91-
* `nodes_storage_spec` (type: `dict[str, cocoindex.storages.NodeStorageSpec]`): This configures indexes for different node labels. Key is the node label. The value type `NodeStorageSpec` has the following fields to configure [storage indexes](../core/flow_def#storage-indexes) for the node.
92-
* `primary_key_fields` is required.
93-
* `vector_indexes` is also supported and optional.

0 commit comments

Comments
 (0)