Skip to content

Commit dff1a9c

Browse files
authored
Merge branch 'main' into np-array
2 parents 87043d1 + 1702504 commit dff1a9c

File tree

21 files changed

+1220
-741
lines changed

21 files changed

+1220
-741
lines changed

docs/docs/getting_started/overview.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,30 @@ slug: /
55

66
# Welcome to CocoIndex
77

8-
Prepare high quality data that is tailored for the purpose is essential for a successful AI application in production.
8+
CocoIndex is an ultra-performant real-time data transformation framework for AI, with incremental processing.
99

10-
CocoIndex is a data indexing platform for AI use cases - semantic search, RAG, agentic workflow on top of embedding / knowledge graph etc. CocoIndex aims to be the best in class scalable data indexing infrastructure with built in observability and lineage.
10+
As a data framework, CocoIndex takes it to the next level on data freshness. **Incremental processing** is one of the core values provided by CocoIndex.
1111

12-
CocoIndex can help you connecting to all the data sources, identify the best indexing strategy and setup the most robust pipeline - chunking, embedding model, deduping/reconciling, vector stores, knowledge graph etc. And then providing standard API to access the index.
12+
## Programming Model
13+
CocoIndex follows the idea of [Dataflow programming](https://en.wikipedia.org/wiki/Dataflow_programming) model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.
14+
15+
The gist of an example data transformation:
16+
```python
17+
# import
18+
data['content'] = flow_builder.add_source(...)
19+
20+
# transform
21+
data['out'] = data['content']
22+
.transform(...)
23+
.transform(...)
24+
25+
# collect data
26+
collector.collect(...)
27+
28+
# export to db, vector db, graph db ...
29+
collector.export(...)
30+
```
31+
32+
Get Started:
33+
- [Quick Start](https://cocoindex.io/docs/getting_started/quickstart)
1334

14-
CocoIndex does all the heavy lifting work and plumbing for the data, so you can focus on your business logic and build your AI application on top of robust data indices.

docs/docs/ops/functions.md

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,40 @@ Input data:
2626

2727
* `text` (type: `str`, required): The text to split.
2828
* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
29+
* `min_chunk_size` (type: `int`, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
30+
31+
:::note
32+
33+
`SplitRecursively` will do its best to make the output chunks sized between `min_chunk_size` and `chunk_size`.
34+
However, it's possible that some chunks are smaller than `min_chunk_size` or larger than `chunk_size` in rare cases, e.g. too short input text, or non-splittable large text.
35+
36+
Please avoid setting `min_chunk_size` to a value too close to `chunk_size`, to leave more rooms for the function to plan the optimal chunking.
37+
38+
:::
39+
2940
* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
3041
* `language` (type: `str`, optional): The language of the document.
31-
Can be a langauge name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
32-
To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
33-
If it's unspecified or the specified language is not supported, it will be treated as plain text.
42+
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
43+
44+
* `custom_languages` (type: `list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
45+
* `language_name` (type: `str`, required): Name of the language.
46+
* `aliases` (type: `list[str]`, optional): A list of aliases for the language.
47+
It's an error if any language name or alias is duplicated.
48+
49+
* `separators_regex` (type: `list[str]`, required): A list of regex patterns to split the text.
50+
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
51+
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
52+
53+
:::note
54+
55+
We use the `language` field to determine how to split the input text, following these rules:
56+
57+
* We'll match the input `language` field against the `language_name` or `aliases` of each custom language specification, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
58+
* If no match is found, we'll match the `language` field against the builtin language configurations.
59+
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
60+
* If no match is found, the input will be treated as plain text.
61+
62+
:::
3463

3564
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
3665

docs/docs/ops/storages.md

Lines changed: 52 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -54,34 +54,21 @@ Here's how CocoIndex data elements map to Qdrant elements during export:
5454
|-------------------|------------------|
5555
| an export target | a unique collection |
5656
| a collected row | a point |
57-
| a field | a named vector (for fields with vector type); a field within payload (otherwise) |
57+
| a field | a named vector, if fits into Qdrant vector; or a field within payload otherwise |
58+
59+
A vector with `Float32`, `Float64` or `Int64` type, and with fixed dimension, fits into Qdrant vector.
5860

5961
#### Spec
6062

6163
The spec takes the following fields:
6264

63-
* `collection_name` (type: `str`, required): The name of the collection to export the data to.
64-
65-
* `grpc_url` (type: `str`, optional): The [gRPC URL](https://qdrant.tech/documentation/interfaces/#grpc-interface) of the Qdrant instance. Defaults to `http://localhost:6334/`.
66-
67-
* `api_key` (type: `str`, optional). API key to authenticate requests with.
65+
* `connection` (type: [auth reference](../core/flow_def#auth-registry) to `QdrantConnection`, optional): The connection to the Qdrant instance. `QdrantConnection` has the following fields:
66+
* `grpc_url` (type: `str`): The [gRPC URL](https://qdrant.tech/documentation/interfaces/#grpc-interface) of the Qdrant instance, e.g. `http://localhost:6334/`.
67+
* `api_key` (type: `str`, optional). API key to authenticate requests with.
6868

69-
Before exporting, you must create a collection with a [vector name](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) that matches the vector field name in CocoIndex, and set `setup_by_user=True` during export.
69+
If `connection` is not provided, will use local Qdrant instance at `http://localhost:6334/` by default.
7070

71-
Example:
72-
73-
```python
74-
doc_embeddings.export(
75-
"doc_embeddings",
76-
cocoindex.storages.Qdrant(
77-
collection_name="cocoindex",
78-
grpc_url="https://xyz-example.cloud-region.cloud-provider.cloud.qdrant.io:6334/",
79-
api_key="<your-api-key-here>",
80-
),
81-
primary_key_fields=["id_field"],
82-
setup_by_user=True,
83-
)
84-
```
71+
* `collection_name` (type: `str`, required): The name of the collection to export the data to.
8572

8673
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant).
8774

@@ -399,19 +386,7 @@ You can find end-to-end examples fitting into any of supported property graphs i
399386

400387
### Neo4j
401388

402-
If you don't have a Neo4j database, you can start a Neo4j database using our docker compose config:
403-
404-
```bash
405-
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
406-
```
407-
408-
:::warning
409-
410-
The docker compose config above will start a Neo4j Enterprise instance under the [Evaluation License](https://neo4j.com/terms/enterprise_us/),
411-
with 30 days trial period.
412-
Please read and agree the license before starting the instance.
413-
414-
:::
389+
#### Spec
415390

416391
The `Neo4j` target spec takes the following fields:
417392

@@ -430,17 +405,32 @@ Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing
430405
* `primary_key_fields` (required)
431406
* `vector_indexes` (optional)
432407

433-
### Kuzu
408+
#### Neo4j dev instance
434409

435-
CocoIndex supports talking to Kuzu through its [API server](https://github.com/kuzudb/api-server).
436-
You can bring up a Kuzu API server locally by running:
410+
If you don't have a Neo4j database, you can start a Neo4j database using our docker compose config:
437411

438412
```bash
439-
KUZU_DB_DIR=$HOME/.kuzudb
440-
KUZU_PORT=8123
441-
docker run -d --name kuzu -p ${KUZU_PORT}:8000 -v ${KUZU_DB_DIR}:/database kuzudb/api-server:latest
413+
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
442414
```
443415

416+
If will bring up a Neo4j instance, which can be accessed by username `neo4j` and password `cocoindex`.
417+
You can access the Neo4j browser at [http://localhost:7474](http://localhost:7474).
418+
419+
:::warning
420+
421+
The docker compose config above will start a Neo4j Enterprise instance under the [Evaluation License](https://neo4j.com/terms/enterprise_us/),
422+
with 30 days trial period.
423+
Please read and agree the license before starting the instance.
424+
425+
:::
426+
427+
428+
### Kuzu
429+
430+
#### Spec
431+
432+
CocoIndex supports talking to Kuzu through its [API server](https://github.com/kuzudb/api-server).
433+
444434
The `Kuzu` target spec takes the following fields:
445435

446436
* `connection` (type: [auth reference](../core/flow_def#auth-registry) to `KuzuConnectionSpec`): The connection to the Kuzu database. `KuzuConnectionSpec` has the following fields:
@@ -453,3 +443,25 @@ Kuzu also provides a declaration spec `KuzuDeclaration`, to configure indexing o
453443
* Fields for [nodes to declare](#declare-extra-node-labels), including
454444
* `nodes_label` (required)
455445
* `primary_key_fields` (required)
446+
447+
#### Kuzu dev instance
448+
449+
If you don't have a Kuzu instance yet, you can bring up a Kuzu API server locally by running:
450+
451+
```bash
452+
KUZU_DB_DIR=$HOME/.kuzudb
453+
KUZU_PORT=8123
454+
docker run -d --name kuzu -p ${KUZU_PORT}:8000 -v ${KUZU_DB_DIR}:/database kuzudb/api-server:latest
455+
```
456+
457+
To explore the graph you built with Kuzu, you can use the [Kuzu Explorer](https://github.com/kuzudb/explorer).
458+
Currently Kuzu API server and the explorer cannot be up at the same time. So you need to stop the API server before running the explorer.
459+
460+
To start the instance of the explorer, run:
461+
462+
```bash
463+
KUZU_EXPLORER_PORT=8124
464+
docker run -d --name kuzu-explorer -p ${KUZU_EXPLORER_PORT}:8000 -v ${KUZU_DB_DIR}:/database -e MODE=READ_ONLY kuzudb/explorer:latest
465+
```
466+
467+
You can then access the explorer at [http://localhost:8124](http://localhost:8124).

docs/docusaurus.config.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,11 @@ const config: Config = {
8686
// Replace with your project's social card
8787
image: 'img/social-card.jpg',
8888
metadata: [{ name: 'description', content: 'Official documentation for CocoIndex - Learn how to use CocoIndex to build robust data indexing pipelines for AI applications. Comprehensive guides, API references, and best practices for implementing efficient data processing workflows.' }],
89+
colorMode: {
90+
defaultMode: 'light',
91+
disableSwitch: false,
92+
respectPrefersColorScheme: true,
93+
},
8994
navbar: {
9095
title: 'CocoIndex',
9196
logo: {

examples/amazon_s3_embedding/main.py

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -98,19 +98,20 @@ def search(pool: ConnectionPool, query: str, top_k: int = 5):
9898
def _main():
9999
# Initialize the database connection pool.
100100
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
101-
# Run queries in a loop to demonstrate the query capabilities.
102-
while True:
103-
query = input("Enter search query (or Enter to quit): ")
104-
if query == "":
105-
break
106-
# Run the query function with the database connection pool and the query.
107-
results = search(pool, query)
108-
print("\nSearch results:")
109-
for result in results:
110-
print(f"[{result['score']:.3f}] {result['filename']}")
111-
print(f" {result['text']}")
112-
print("---")
113-
print()
101+
with cocoindex.FlowLiveUpdater(amazon_s3_text_embedding_flow):
102+
# Run queries in a loop to demonstrate the query capabilities.
103+
while True:
104+
query = input("Enter search query (or Enter to quit): ")
105+
if query == "":
106+
break
107+
# Run the query function with the database connection pool and the query.
108+
results = search(pool, query)
109+
print("\nSearch results:")
110+
for result in results:
111+
print(f"[{result['score']:.3f}] {result['filename']}")
112+
print(f" {result['text']}")
113+
print("---")
114+
print()
114115

115116

116117
if __name__ == "__main__":

examples/code_embedding/main.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def code_to_embedding(
2727
@cocoindex.flow_def(name="CodeEmbedding")
2828
def code_embedding_flow(
2929
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
30-
):
30+
) -> None:
3131
"""
3232
Define an example flow that embeds files into a vector database.
3333
"""
@@ -46,6 +46,7 @@ def code_embedding_flow(
4646
cocoindex.functions.SplitRecursively(),
4747
language=file["extension"],
4848
chunk_size=1000,
49+
min_chunk_size=300,
4950
chunk_overlap=300,
5051
)
5152
with file["chunks"].row() as chunk:

examples/docs_to_knowledge_graph/README.md

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ Please drop [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) a s
1212

1313
![example-explanation](https://github.com/user-attachments/assets/07ddbd60-106f-427f-b7cc-16b73b142d27)
1414

15-
1615
## Prerequisite
1716
* [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
18-
* [Install Neo4j](https://cocoindex.io/docs/ops/storages#neo4j) if you don't have one.
17+
* Install [Neo4j](https://cocoindex.io/docs/ops/storages#neo4j-dev-instance) or [Kuzu](https://cocoindex.io/docs/ops/storages#kuzu-dev-instance) if you don't have one.
18+
* The example uses Neo4j by default for now. If you want to use Kuzu, find out the "SELECT ONE GRAPH DATABASE TO USE" section and switch the active branch.
1919
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
2020

2121
## Documentation
@@ -45,21 +45,18 @@ cocoindex update main.py
4545

4646
### Browse the knowledge graph
4747

48-
After the knowledge graph is build, you can explore the knowledge graph you built in Neo4j Browser.
48+
After the knowledge graph is built, you can explore the knowledge graph.
4949

50-
For the dev enviroment, you can connect neo4j browser using credentials:
51-
- username: `neo4j`
52-
- password: `cocoindex`
53-
which is pre-configured in the our docker compose [config.yaml](https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml).
50+
* If you're using Neo4j, you can open the explorer at [http://localhost:7474](http://localhost:7474), with username `neo4j` and password `cocoindex`.
51+
* If you're using Kuzu, you can start a Kuzu explorer locally. See [Kuzu dev instance](https://cocoindex.io/docs/ops/storages#kuzu-dev-instance) for more details.
5452

55-
You can open it at [http://localhost:7474](http://localhost:7474), and run the following Cypher query to get all relationships:
53+
You can run the following Cypher query to get all relationships:
5654

5755
```cypher
5856
MATCH p=()-->() RETURN p
5957
```
60-
<img width="1366" alt="neo4j-for-coco-docs" src="https://github.com/user-attachments/assets/3c8b6329-6fee-4533-9480-571399b57e57" />
61-
6258

59+
<img width="1366" alt="neo4j-for-coco-docs" src="https://github.com/user-attachments/assets/3c8b6329-6fee-4533-9480-571399b57e57" />
6360

6461
## CocoInsight
6562
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.

examples/docs_to_knowledge_graph/main.py

Lines changed: 26 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,6 @@
55
import dataclasses
66
import cocoindex
77

8-
9-
@dataclasses.dataclass
10-
class DocumentSummary:
11-
"""Describe a summary of a document."""
12-
13-
title: str
14-
summary: str
15-
16-
17-
@dataclasses.dataclass
18-
class Relationship:
19-
"""
20-
Describe a relationship between two entities.
21-
Subject and object should be Core CocoIndex concepts only, should be nouns. For example, `CocoIndex`, `Incremental Processing`, `ETL`, `Data` etc.
22-
"""
23-
24-
subject: str
25-
predicate: str
26-
object: str
27-
28-
298
neo4j_conn_spec = cocoindex.add_auth_entry(
309
"Neo4jConnection",
3110
cocoindex.storages.Neo4jConnection(
@@ -41,19 +20,43 @@ class Relationship:
4120
),
4221
)
4322

44-
# Use Neo4j as the graph database
23+
# SELECT ONE GRAPH DATABASE TO USE
24+
# This example can use either Neo4j or Kuzu as the graph database.
25+
# Please make sure only one branch is live and others are commented out.
26+
27+
# Use Neo4j
4528
GraphDbSpec = cocoindex.storages.Neo4j
4629
GraphDbConnection = cocoindex.storages.Neo4jConnection
4730
GraphDbDeclaration = cocoindex.storages.Neo4jDeclaration
4831
conn_spec = neo4j_conn_spec
4932

50-
# Use Kuzu as the graph database
33+
# Use Kuzu
5134
# GraphDbSpec = cocoindex.storages.Kuzu
5235
# GraphDbConnection = cocoindex.storages.KuzuConnection
5336
# GraphDbDeclaration = cocoindex.storages.KuzuDeclaration
5437
# conn_spec = kuzu_conn_spec
5538

5639

40+
@dataclasses.dataclass
41+
class DocumentSummary:
42+
"""Describe a summary of a document."""
43+
44+
title: str
45+
summary: str
46+
47+
48+
@dataclasses.dataclass
49+
class Relationship:
50+
"""
51+
Describe a relationship between two entities.
52+
Subject and object should be Core CocoIndex concepts only, should be nouns. For example, `CocoIndex`, `Incremental Processing`, `ETL`, `Data` etc.
53+
"""
54+
55+
subject: str
56+
predicate: str
57+
object: str
58+
59+
5760
@cocoindex.flow_def(name="DocsToKG")
5861
def docs_to_kg_flow(
5962
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope

0 commit comments

Comments
 (0)