You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/getting_started/overview.md
+24-4Lines changed: 24 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,10 +5,30 @@ slug: /
5
5
6
6
# Welcome to CocoIndex
7
7
8
-
Prepare high quality data that is tailored for the purpose is essential for a successful AI application in production.
8
+
CocoIndex is an ultra-performant real-time data transformation framework for AI, with incremental processing.
9
9
10
-
CocoIndex is a data indexing platform for AI use cases - semantic search, RAG, agentic workflow on top of embedding / knowledge graph etc. CocoIndex aims to be the best in class scalable data indexing infrastructure with built in observability and lineage.
10
+
As a data framework, CocoIndex takes it to the next level on data freshness. **Incremental processing** is one of the core values provided by CocoIndex.
11
11
12
-
CocoIndex can help you connecting to all the data sources, identify the best indexing strategy and setup the most robust pipeline - chunking, embedding model, deduping/reconciling, vector stores, knowledge graph etc. And then providing standard API to access the index.
12
+
## Programming Model
13
+
CocoIndex follows the idea of [Dataflow programming](https://en.wikipedia.org/wiki/Dataflow_programming) model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.
CocoIndex does all the heavy lifting work and plumbing for the data, so you can focus on your business logic and build your AI application on top of robust data indices.
Copy file name to clipboardExpand all lines: docs/docs/ops/functions.md
+32-3Lines changed: 32 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,11 +26,40 @@ Input data:
26
26
27
27
*`text` (type: `str`, required): The text to split.
28
28
*`chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
29
+
*`min_chunk_size` (type: `int`, optional): The minimum size of each chunk, in bytes. If not provided, default to `chunk_size / 2`.
30
+
31
+
:::note
32
+
33
+
`SplitRecursively` will do its best to make the output chunks sized between `min_chunk_size` and `chunk_size`.
34
+
However, it's possible that some chunks are smaller than `min_chunk_size` or larger than `chunk_size` in rare cases, e.g. too short input text, or non-splittable large text.
35
+
36
+
Please avoid setting `min_chunk_size` to a value too close to `chunk_size`, to leave more rooms for the function to plan the optimal chunking.
37
+
38
+
:::
39
+
29
40
*`chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
30
41
*`language` (type: `str`, optional): The language of the document.
31
-
Can be a langauge name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
32
-
To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
33
-
If it's unspecified or the specified language is not supported, it will be treated as plain text.
42
+
Can be a language name (e.g. `Python`, `Javascript`, `Markdown`) or a file extension (e.g. `.py`, `.js`, `.md`).
43
+
44
+
*`custom_languages` (type: `list[CustomLanguageSpec]`, optional): This allows you to customize the way to chunking specific languages using regular expressions. Each `CustomLanguageSpec` is a dict with the following fields:
45
+
*`language_name` (type: `str`, required): Name of the language.
46
+
*`aliases` (type: `list[str]`, optional): A list of aliases for the language.
47
+
It's an error if any language name or alias is duplicated.
48
+
49
+
*`separators_regex` (type: `list[str]`, required): A list of regex patterns to split the text.
50
+
Higher-level boundaries should come first, and lower-level should be listed later. e.g. `[r"\n# ", r"\n## ", r"\n\n", r"\. "]`.
51
+
See [regex Syntax](https://docs.rs/regex/latest/regex/#syntax) for supported regular expression syntax.
52
+
53
+
:::note
54
+
55
+
We use the `language` field to determine how to split the input text, following these rules:
56
+
57
+
* We'll match the input `language` field against the `language_name` or `aliases` of each custom language specification, and use the matched one. If value of `language` is null, it'll be treated as empty string when matching `language_name` or `aliases`.
58
+
* If no match is found, we'll match the `language` field against the builtin language configurations.
59
+
For all supported builtin language names and aliases (extensions), see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
60
+
* If no match is found, the input will be treated as plain text.
61
+
62
+
:::
34
63
35
64
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
Copy file name to clipboardExpand all lines: docs/docs/ops/storages.md
+52-40Lines changed: 52 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,34 +54,21 @@ Here's how CocoIndex data elements map to Qdrant elements during export:
54
54
|-------------------|------------------|
55
55
| an export target | a unique collection |
56
56
| a collected row | a point |
57
-
| a field | a named vector (for fields with vector type); a field within payload (otherwise) |
57
+
| a field | a named vector, if fits into Qdrant vector; or a field within payload otherwise |
58
+
59
+
A vector with `Float32`, `Float64` or `Int64` type, and with fixed dimension, fits into Qdrant vector.
58
60
59
61
#### Spec
60
62
61
63
The spec takes the following fields:
62
64
63
-
*`collection_name` (type: `str`, required): The name of the collection to export the data to.
64
-
65
-
*`grpc_url` (type: `str`, optional): The [gRPC URL](https://qdrant.tech/documentation/interfaces/#grpc-interface) of the Qdrant instance. Defaults to `http://localhost:6334/`.
66
-
67
-
*`api_key` (type: `str`, optional). API key to authenticate requests with.
65
+
*`connection` (type: [auth reference](../core/flow_def#auth-registry) to `QdrantConnection`, optional): The connection to the Qdrant instance. `QdrantConnection` has the following fields:
66
+
*`grpc_url` (type: `str`): The [gRPC URL](https://qdrant.tech/documentation/interfaces/#grpc-interface) of the Qdrant instance, e.g. `http://localhost:6334/`.
67
+
*`api_key` (type: `str`, optional). API key to authenticate requests with.
68
68
69
-
Before exporting, you must create a collection with a [vector name](https://qdrant.tech/documentation/concepts/vectors/#named-vectors) that matches the vector field name in CocoIndex, and set `setup_by_user=True` during export.
69
+
If `connection` is not provided, will use local Qdrant instance at `http://localhost:6334/` by default.
*`collection_name` (type: `str`, required): The name of the collection to export the data to.
85
72
86
73
You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant).
87
74
@@ -399,19 +386,7 @@ You can find end-to-end examples fitting into any of supported property graphs i
399
386
400
387
### Neo4j
401
388
402
-
If you don't have a Neo4j database, you can start a Neo4j database using our docker compose config:
403
-
404
-
```bash
405
-
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
406
-
```
407
-
408
-
:::warning
409
-
410
-
The docker compose config above will start a Neo4j Enterprise instance under the [Evaluation License](https://neo4j.com/terms/enterprise_us/),
411
-
with 30 days trial period.
412
-
Please read and agree the license before starting the instance.
413
-
414
-
:::
389
+
#### Spec
415
390
416
391
The `Neo4j` target spec takes the following fields:
417
392
@@ -430,17 +405,32 @@ Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing
430
405
*`primary_key_fields` (required)
431
406
*`vector_indexes` (optional)
432
407
433
-
###Kuzu
408
+
#### Neo4j dev instance
434
409
435
-
CocoIndex supports talking to Kuzu through its [API server](https://github.com/kuzudb/api-server).
436
-
You can bring up a Kuzu API server locally by running:
410
+
If you don't have a Neo4j database, you can start a Neo4j database using our docker compose config:
437
411
438
412
```bash
439
-
KUZU_DB_DIR=$HOME/.kuzudb
440
-
KUZU_PORT=8123
441
-
docker run -d --name kuzu -p ${KUZU_PORT}:8000 -v ${KUZU_DB_DIR}:/database kuzudb/api-server:latest
413
+
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
442
414
```
443
415
416
+
If will bring up a Neo4j instance, which can be accessed by username `neo4j` and password `cocoindex`.
417
+
You can access the Neo4j browser at [http://localhost:7474](http://localhost:7474).
418
+
419
+
:::warning
420
+
421
+
The docker compose config above will start a Neo4j Enterprise instance under the [Evaluation License](https://neo4j.com/terms/enterprise_us/),
422
+
with 30 days trial period.
423
+
Please read and agree the license before starting the instance.
424
+
425
+
:::
426
+
427
+
428
+
### Kuzu
429
+
430
+
#### Spec
431
+
432
+
CocoIndex supports talking to Kuzu through its [API server](https://github.com/kuzudb/api-server).
433
+
444
434
The `Kuzu` target spec takes the following fields:
445
435
446
436
*`connection` (type: [auth reference](../core/flow_def#auth-registry) to `KuzuConnectionSpec`): The connection to the Kuzu database. `KuzuConnectionSpec` has the following fields:
@@ -453,3 +443,25 @@ Kuzu also provides a declaration spec `KuzuDeclaration`, to configure indexing o
453
443
* Fields for [nodes to declare](#declare-extra-node-labels), including
454
444
*`nodes_label` (required)
455
445
*`primary_key_fields` (required)
446
+
447
+
#### Kuzu dev instance
448
+
449
+
If you don't have a Kuzu instance yet, you can bring up a Kuzu API server locally by running:
450
+
451
+
```bash
452
+
KUZU_DB_DIR=$HOME/.kuzudb
453
+
KUZU_PORT=8123
454
+
docker run -d --name kuzu -p ${KUZU_PORT}:8000 -v ${KUZU_DB_DIR}:/database kuzudb/api-server:latest
455
+
```
456
+
457
+
To explore the graph you built with Kuzu, you can use the [Kuzu Explorer](https://github.com/kuzudb/explorer).
458
+
Currently Kuzu API server and the explorer cannot be up at the same time. So you need to stop the API server before running the explorer.
Copy file name to clipboardExpand all lines: docs/docusaurus.config.ts
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -86,6 +86,11 @@ const config: Config = {
86
86
// Replace with your project's social card
87
87
image: 'img/social-card.jpg',
88
88
metadata: [{name: 'description',content: 'Official documentation for CocoIndex - Learn how to use CocoIndex to build robust data indexing pipelines for AI applications. Comprehensive guides, API references, and best practices for implementing efficient data processing workflows.'}],
*[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
18
-
*[Install Neo4j](https://cocoindex.io/docs/ops/storages#neo4j) if you don't have one.
17
+
* Install [Neo4j](https://cocoindex.io/docs/ops/storages#neo4j-dev-instance) or [Kuzu](https://cocoindex.io/docs/ops/storages#kuzu-dev-instance) if you don't have one.
18
+
* The example uses Neo4j by default for now. If you want to use Kuzu, find out the "SELECT ONE GRAPH DATABASE TO USE" section and switch the active branch.
19
19
*[Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
20
20
21
21
## Documentation
@@ -45,21 +45,18 @@ cocoindex update main.py
45
45
46
46
### Browse the knowledge graph
47
47
48
-
After the knowledge graph is build, you can explore the knowledge graph you built in Neo4j Browser.
48
+
After the knowledge graph is built, you can explore the knowledge graph.
49
49
50
-
For the dev enviroment, you can connect neo4j browser using credentials:
51
-
- username: `neo4j`
52
-
- password: `cocoindex`
53
-
which is pre-configured in the our docker compose [config.yaml](https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml).
50
+
* If you're using Neo4j, you can open the explorer at [http://localhost:7474](http://localhost:7474), with username `neo4j` and password `cocoindex`.
51
+
* If you're using Kuzu, you can start a Kuzu explorer locally. See [Kuzu dev instance](https://cocoindex.io/docs/ops/storages#kuzu-dev-instance) for more details.
54
52
55
-
You can open it at [http://localhost:7474](http://localhost:7474), and run the following Cypher query to get all relationships:
53
+
You can run the following Cypher query to get all relationships:
0 commit comments