Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 76 additions & 5 deletions docs/docs/core/flow_def.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -259,14 +259,11 @@ Export must happen at the top level of a flow, i.e. not within any child scopes

* `name`: the name to identify the export target.
* `target_spec`: the storage spec as the export target.
* `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
* `vector_indexes` (`Sequence[VectorIndexDef]`, optional): the fields to create vector index. `VectorIndexDef` has the following fields:
* `field_name`: the field to create vector index.
* `metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.
* `setup_by_user` (optional):
whether the export target is setup by user.
By default, CocoIndex is managing the target setup (surfaced by the `cocoindex setup` CLI subcommand), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
If `True`, the export target will be managed by users, and users are responsible for creating the target and updating it upon change.
* Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.

<Tabs>
<TabItem value="python" label="Python" default>
Expand All @@ -280,7 +277,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
demo_collector.export(
"demo_storage", DemoStorageSpec(...),
primary_key_fields=["field1"],
vector_index=[("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
vector_indexes=[cocoindex.VectorIndexDef("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```

</TabItem>
Expand All @@ -289,3 +286,77 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
The target storage is managed by CocoIndex, i.e. it'll be created by [CocoIndex CLI](/docs/core/cli) when you run `cocoindex setup`, and the data will be automatically updated (including stale data removal) when updating the index.
The `name` for the same storage should remain stable across different runs.
If it changes, CocoIndex will treat it as an old storage removed and a new one created, and perform setup changes and reindexing accordingly.

#### Storage Indexes

Many storage supports indexes, to boost efficiency in retrieving data.
CocoIndex provides a common way to configure indexes for various storages.

* *Primary key*. `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
* *Vector index*. `vector_indexes` (`Sequence[VectorIndexDef]`): the fields to create vector index. `VectorIndexDef` has the following fields:
* `field_name`: the field to create vector index.
* `metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.


## Miscellaneous

### Auth Registry

CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.

Operation spec is the default way to configure a backend. But it has the following limitations:

* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
* Once an operation is removed after flow definition code change, the spec is also gone.
But we still need to be able to drop the backend (e.g. a table) by `cocoindex setup` or `cocoindex drop`.


Auth registry is introduced to solve the problems above. It works as follows:

* You can create new **auth entry** by a key and a value.
* You can references the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4jRelationship` takes `connection` field in the form of auth entry reference.

<Tabs>
<TabItem value="python" label="Python" default>

You can add an auth entry by `cocoindex.add_auth_entry()` function, which returns a `cocoindex.AuthEntryReference`:

```python
my_graph_conn = cocoindex.add_auth_entry(
"my_graph_conn",
cocoindex.storages.Neo4jConnectionSpec(
uri="bolt://localhost:7687",
user="neo4j",
password="cocoindex",
))
```

Then reference it when building a spec that takes an auth entry:

* You can either reference by the `AuthEntryReference` object directly:

```python
demo_collector.export(
"MyGraph",
cocoindex.storages.Neo4jRelationship(connection=my_graph_conn, ...)
)
```

* You can also reference it by the key string, using `cocoindex.ref_auth_entry()` function:

```python
demo_collector.export(
"MyGraph",
cocoindex.storages.Neo4jRelationship(connection=cocoindex.ref_auth_entry("my_graph_conn"), ...))
```

</TabItem>
</Tabs>

Note that CocoIndex backends use the key of an auth entry to identify the backend.

* Keep the key stable.
If the key doesn't change, it's considered to be the same backend (even if the underlying way to connect/authenticate change).

* If a key is no longer referenced in any operation spec, keep it until the next `cocoindex setup` or `cocoindex drop`,
so that when cocoindex will be able to perform cleanups.
36 changes: 36 additions & 0 deletions docs/docs/ops/storages.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,39 @@ doc_embeddings.export(
```

You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant).

## Neo4j

### Setup

If you don't have a Postgres database, you can start a Postgres SQL database for cocoindex using our docker compose config:

```bash
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
```

### Neo4jRelationship

The `Neo4jRelationship` storage exports each row as a relationship to Neo4j Knowledge Graph.
When you collect rows for `Neo4jRelationship`, fields will be mapped to a relationship and source/target nodes for the relationship:

* You can explicitly specify fields mapped to source/target nodes.
* All remaining fields will be mapped to relationship properties by default.


The spec takes the following fields:

* `connection` (type: [auth reference](../core/flow_def#auth-registry) to `Neo4jConnectionSpec`): The connection to the Neo4j database. `Neo4jConnectionSpec` has the following fields:
* `uri` (type: `str`): The URI of the Neo4j database to use as the internal storage, e.g. `bolt://localhost:7687`.
* `user` (type: `str`): Username for the Neo4j database.
* `password` (type: `str`): Password for the Neo4j database.
* `db` (type: `str`, optional): The name of the Neo4j database to use as the internal storage, e.g. `neo4j`.
* `rel_type` (type: `str`): The type of the relationship.
* `source`/`target` (type: `Neo4jRelationshipEndSpec`): The source/target node of the relationship, with the following fields:
* `label` (type: `str`): The label of the node.
* `fields` (type: `list[Neo4jFieldMapping]`): Map fields from the collector to nodes in Neo4j, with the following fields:
* `field_name` (type: `str`): The name of the field in the collected row.
* `node_field_name` (type: `str`, optional): The name of the field to use as the node field. If unspecified, will use the same as `field_name`.
* `nodes` (type: `dict[str, Neo4jRelationshipNodeSpec]`): This configures indexes for different node labels. Key is the node label. The value `Neo4jRelationshipNodeSpec` has the following fields to configure [storage indexes](../core/flow_def#storage-indexes) for the node.
* `primary_key_fields` is required.
* `vector_indexes` is also supported and optional.