You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/contributing/guide.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ description: How to contribute to CocoIndex
5
5
6
6
[CocoIndex](https://github.com/cocoindex-io/cocoindex) is an open source project. We are respectful, open and friendly. This guide explains how to get involved and contribute to [CocoIndex](https://github.com/cocoindex-io/cocoindex).
7
7
8
-
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
8
+
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
9
9
If you are unsure about anything, it is a good place to discuss! We'd love to collaborate and will always be friendly.
1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
23
23
24
-
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
24
+
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
25
25
This enables better metadata-driven semantic search results. For example, you can match text queries against titles and abstracts.
26
26
27
-
3. Build an index of authors and all the file names associated with each author
27
+
3. Build an index of authors and all the file names associated with each author
28
28
to answer questions like "Give me all the papers by Jeff Dean."
29
29
30
30
4. If you want to perform full PDF embedding for the paper, you can extend the flow.
@@ -108,7 +108,7 @@ After this step, we should have the basic info of each paper.
108
108
109
109
We will convert the first page to Markdown using Marker. Alternatively, you can easily plug in any PDF parser, such as Docling using CocoIndex's [custom function](https://cocoindex.io/docs/custom_ops/custom_functions).
110
110
111
-
Define a marker converter function and cache it, since its initialization is resource-intensive.
111
+
Define a marker converter function and cache it, since its initialization is resource-intensive.
112
112
This ensures that the same converter instance is reused for different input files.
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
22
+
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
23
23
24
24
## Use Cases
25
25
A wide range of applications can be built with an effective codebase index that is always up-to-date.
@@ -44,14 +44,14 @@ The flow is composed of the following steps:
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
107
+
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
108
108
and the method is expected to update the backend setup to match the current state.
109
109
110
-
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
110
+
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
111
111
and when `current` is `None`, we need to delete it.
112
112
113
113
@@ -135,8 +135,8 @@ def apply_setup_change(
135
135
os.rmdir(previous.directory)
136
136
```
137
137
138
-
The `mutate()` method is called by CocoIndex to apply data changes to the target,
139
-
batching mutations to potentially multiple targets of the same type.
138
+
The `mutate()` method is called by CocoIndex to apply data changes to the target,
139
+
batching mutations to potentially multiple targets of the same type.
140
140
This allows the target connector flexibility in implementation (e.g., atomic commits, or processing items with dependencies in a specific order).
141
141
142
142
Each element in the batch corresponds to a specific target and is represented by a tuple containing:
@@ -151,8 +151,8 @@ class LocalFileTargetValues:
151
151
html: str
152
152
```
153
153
154
-
The value type of the `dict` is `LocalFileTargetValues | None`,
155
-
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
154
+
The value type of the `dict` is `LocalFileTargetValues | None`,
155
+
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
156
156
idempotency is expected here.
157
157
158
158
```python
@@ -217,7 +217,5 @@ This keeps your knowledge graph continuously synchronized with your document sou
217
217
Sometimes there may be an internal/homegrown tool or API (e.g. within a company) that's not publicly available.
218
218
These can only be connected through custom targets.
219
219
220
-
### Faster adoption of new export logic
220
+
### Faster adoption of new export logic
221
221
When a new tool, database, or API joins your stack, simply define a Target Spec and Target Connector — start exporting right away, with no pipeline refactoring required.
Copy file name to clipboardExpand all lines: docs/docs/examples/examples/docs_to_knowledge_graph.md
+10-11Lines changed: 10 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ and then build a knowledge graph.
36
36
- CocoIndex can direct map the collected data to Neo4j nodes and relationships.
37
37
38
38
## Setup
39
-
*[Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
39
+
*[Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
40
40
*[Install Neo4j](https://cocoindex.io/docs/ops/targets#neo4j-dev-instance), a graph database.
41
41
*[Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
42
42
@@ -51,7 +51,7 @@ and then build a knowledge graph.
51
51
52
52
### Add documents as source
53
53
54
-
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
54
+
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
55
55
56
56
```python
57
57
@cocoindex.flow_def(name="DocsToKG")
@@ -141,7 +141,7 @@ Next, we will use `cocoindex.functions.ExtractByLlm` to extract the relationship
141
141
doc["relationships"] = doc["content"].transform(
142
142
cocoindex.functions.ExtractByLlm(
143
143
llm_spec=cocoindex.LlmSpec(
144
-
api_type=cocoindex.LlmApiType.OPENAI,
144
+
api_type=cocoindex.LlmApiType.OPENAI,
145
145
model="gpt-4o"
146
146
),
147
147
output_type=list[Relationship],
@@ -187,7 +187,7 @@ with doc["relationships"].row() as relationship:
187
187
188
188
189
189
### Build knowledge graph
190
-
190
+
191
191
#### Basic concepts
192
192
All nodes for Neo4j need two things:
193
193
1. Label: The type of the node. E.g., `Document`, `Entity`.
@@ -236,10 +236,10 @@ This exports Neo4j nodes with label `Document` from the `document_node` collecto
236
236
237
237
#### Export `RELATIONSHIP` and `Entity` nodes to Neo4j
238
238
239
-
We don't have explicit collector for `Entity` nodes.
239
+
We don't have explicit collector for `Entity` nodes.
240
240
They are part of the `entity_relationship` collector and fields are collected during the relationship extraction.
241
241
242
-
To export them as Neo4j nodes, we need to first declare `Entity` nodes.
242
+
To export them as Neo4j nodes, we need to first declare `Entity` nodes.
243
243
244
244
```python
245
245
flow_builder.declare(
@@ -289,7 +289,7 @@ In a relationship, there's:
289
289
2. A relationship connecting the source and target.
290
290
Note that different relationships may share the same source and target nodes.
291
291
292
-
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
292
+
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
293
293
294
294
#### Export the `entity_mention` to Neo4j.
295
295
@@ -334,7 +334,7 @@ It creates relationships by:
334
334
```sh
335
335
cocoindex update --setup main.py
336
336
```
337
-
337
+
338
338
You'll see the index updates state in the terminal. For example,
339
339
340
340
```
@@ -343,7 +343,7 @@ It creates relationships by:
343
343
344
344
## CocoInsight
345
345
346
-
I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline. It is in free beta now, you can give it a try.
346
+
I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline. It is in free beta now, you can give it a try.
347
347
348
348
```sh
349
349
cocoindex server -ci main
@@ -369,7 +369,7 @@ MATCH p=()-->() RETURN p
369
369
## Kuzu
370
370
Cocoindex natively supports Kuzu - a high performant, embedded open source graph database.
The GraphDB interface in CocoIndex is standardized, you just need to **switch the configuration** without any additional code changes. CocoIndex supports exporting to Kuzu through its API server. You can bring up a Kuzu API server locally by running:
CocoIndex supports native integration with ColPali - with just a few lines of code, you embed and index images with ColPali’s late-interaction architecture. We also build a light weight image search application with FastAPI.
22
22
23
23
24
-
## ColPali
24
+
## ColPali
25
25
26
26
**ColPali (Contextual Late-interaction over Patches)** is a powerful model for multimodal retrieval.
0 commit comments