Skip to content

Commit bb7c0ad

Browse files
authored
cleanup: run pre-commit hooks to cleanup files (#1025)
1 parent ab918be commit bb7c0ad

23 files changed

+122
-129
lines changed

docs/docs/contributing/guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: How to contribute to CocoIndex
55

66
[CocoIndex](https://github.com/cocoindex-io/cocoindex) is an open source project. We are respectful, open and friendly. This guide explains how to get involved and contribute to [CocoIndex](https://github.com/cocoindex-io/cocoindex).
77

8-
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
8+
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
99
If you are unsure about anything, it is a good place to discuss! We'd love to collaborate and will always be friendly.
1010

1111
## Good First Issues

docs/docs/contributing/setup_dev_environment.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,4 @@ Follow the steps below to get CocoIndex built on the latest codebase locally - i
4444
- Before running a specific example, set extra environment variables, for exposing extra traces, allowing dev UI, etc.
4545
```sh
4646
. ./.env.lib_debug
47-
```
47+
```

docs/docs/examples/examples/academic_papers_index.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
2121

2222
1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
2323

24-
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
24+
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
2525
This enables better metadata-driven semantic search results. For example, you can match text queries against titles and abstracts.
2626

27-
3. Build an index of authors and all the file names associated with each author
27+
3. Build an index of authors and all the file names associated with each author
2828
to answer questions like "Give me all the papers by Jeff Dean."
2929

3030
4. If you want to perform full PDF embedding for the paper, you can extend the flow.
@@ -108,7 +108,7 @@ After this step, we should have the basic info of each paper.
108108

109109
We will convert the first page to Markdown using Marker. Alternatively, you can easily plug in any PDF parser, such as Docling using CocoIndex's [custom function](https://cocoindex.io/docs/custom_ops/custom_functions).
110110

111-
Define a marker converter function and cache it, since its initialization is resource-intensive.
111+
Define a marker converter function and cache it, since its initialization is resource-intensive.
112112
This ensures that the same converter instance is reused for different input files.
113113

114114
```python
@@ -137,7 +137,7 @@ def pdf_to_markdown(content: bytes) -> str:
137137
Pass it to your transform
138138

139139
```python
140-
with data_scope["documents"].row() as doc:
140+
with data_scope["documents"].row() as doc:
141141
# ... process
142142
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
143143
pdf_to_markdown
@@ -200,7 +200,7 @@ paper_metadata.collect(
200200
Just collect anything you need :)
201201

202202
### Collect `author` to `filename` information
203-
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
203+
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
204204
Simply collect by author.
205205

206206
```python
@@ -229,8 +229,8 @@ doc["title_embedding"] = doc["metadata"]["title"].transform(
229229

230230
### Abstract
231231

232-
Split abstract into chunks, embed each chunk and collect their embeddings.
233-
Sometimes the abstract could be very long.
232+
Split abstract into chunks, embed each chunk and collect their embeddings.
233+
Sometimes the abstract could be very long.
234234

235235
```python
236236
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
@@ -308,7 +308,7 @@ author_papers.export(
308308
"author_papers",
309309
cocoindex.targets.Postgres(),
310310
primary_key_fields=["author_name", "filename"],
311-
)
311+
)
312312
metadata_embeddings.export(
313313
"metadata_embeddings",
314314
cocoindex.targets.Postgres(),
@@ -328,9 +328,9 @@ In this example we use PGVector as embedding store. With CocoIndex, you can do o
328328

329329
## Query the index
330330

331-
You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
332-
how to build query against embeddings.
333-
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
331+
You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
332+
how to build query against embeddings.
333+
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
334334

335335
- Many databases already have optimized query implementations with their own best practices
336336
- The query space has excellent solutions for querying, reranking, and other search-related functionality.

docs/docs/examples/examples/codebase_index.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
1919
![Codebase Index](/img/examples/codebase_index/cover.png)
2020

2121
## Overview
22-
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
22+
In this tutorial, we will build codebase index. [CocoIndex](https://github.com/cocoindex-io/cocoindex) provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.
2323

2424
## Use Cases
2525
A wide range of applications can be built with an effective codebase index that is always up-to-date.
@@ -44,14 +44,14 @@ The flow is composed of the following steps:
4444
- Generate embeddings for each chunk
4545
- Store in a vector database for retrieval
4646

47-
## Setup
47+
## Setup
4848
- Install Postgres, follow [installation guide](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
4949
- Install CocoIndex
5050
```bash
5151
pip install -U cocoindex
5252
```
5353

54-
## Add the codebase as a source.
54+
## Add the codebase as a source.
5555
We will index the CocoIndex codebase. Here we use the `LocalFile` source to ingest files from the CocoIndex codebase root directory.
5656

5757
```python
@@ -67,7 +67,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
6767
- Include files with the extensions of `.py`, `.rs`, `.toml`, `.md`, `.mdx`
6868
- Exclude files and directories starting `.`, `target` in the root and `node_modules` under any directory.
6969

70-
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
70+
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
7171
<DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Sources" />
7272

7373

@@ -96,14 +96,14 @@ with data_scope["files"].row() as file:
9696
file["extension"] = file["filename"].transform(extract_extension)
9797
file["chunks"] = file["content"].transform(
9898
cocoindex.functions.SplitRecursively(),
99-
language=file["extension"], chunk_size=1000, chunk_overlap=300)
99+
language=file["extension"], chunk_size=1000, chunk_overlap=300)
100100
```
101101
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#splitrecursively" text="SplitRecursively" margin="0 0 16px 0" />
102102

103103
![SplitRecursively](/img/examples/codebase_index/chunk.png)
104104

105105
### Embed the chunks
106-
We use `SentenceTransformerEmbed` to embed the chunks.
106+
We use `SentenceTransformerEmbed` to embed the chunks.
107107

108108
```python
109109
@cocoindex.transform_flow()
@@ -141,7 +141,7 @@ code_embeddings.export(
141141
vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
142142
```
143143

144-
We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.
144+
We use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to measure the similarity between the query and the indexed data.
145145

146146
## Query the index
147147
We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
@@ -230,4 +230,4 @@ Follow the url from the terminal - `https://cocoindex.io/cocoinsight` to access
230230
231231
SplitRecursively has native support for all major programming languages.
232232
233-
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#supported-languages" text="Supported Languages" margin="0 0 16px 0" />
233+
<DocumentationButton url="https://cocoindex.io/docs/ops/functions#supported-languages" text="Supported Languages" margin="0 0 16px 0" />

docs/docs/examples/examples/custom_targets.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
3535
refresh_interval=timedelta(seconds=5),
3636
)
3737
```
38-
This ingestion creates a table with `filename` and `content` fields.
38+
This ingestion creates a table with `filename` and `content` fields.
3939
<DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Sources" />
4040

4141
## Process each file and collect
@@ -92,7 +92,7 @@ class LocalFileTargetConnector:
9292

9393
```
9494

95-
The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
95+
The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
9696
For example, it prints:
9797

9898
`Target: Local directory ./data/output`
@@ -104,10 +104,10 @@ def describe(key: str) -> str:
104104
return f"Local directory {key}"
105105
```
106106

107-
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
107+
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
108108
and the method is expected to update the backend setup to match the current state.
109109

110-
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
110+
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
111111
and when `current` is `None`, we need to delete it.
112112

113113

@@ -135,8 +135,8 @@ def apply_setup_change(
135135
os.rmdir(previous.directory)
136136
```
137137

138-
The `mutate()` method is called by CocoIndex to apply data changes to the target,
139-
batching mutations to potentially multiple targets of the same type.
138+
The `mutate()` method is called by CocoIndex to apply data changes to the target,
139+
batching mutations to potentially multiple targets of the same type.
140140
This allows the target connector flexibility in implementation (e.g., atomic commits, or processing items with dependencies in a specific order).
141141

142142
Each element in the batch corresponds to a specific target and is represented by a tuple containing:
@@ -151,8 +151,8 @@ class LocalFileTargetValues:
151151
html: str
152152
```
153153

154-
The value type of the `dict` is `LocalFileTargetValues | None`,
155-
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
154+
The value type of the `dict` is `LocalFileTargetValues | None`,
155+
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
156156
idempotency is expected here.
157157

158158
```python
@@ -217,7 +217,5 @@ This keeps your knowledge graph continuously synchronized with your document sou
217217
Sometimes there may be an internal/homegrown tool or API (e.g. within a company) that's not publicly available.
218218
These can only be connected through custom targets.
219219

220-
### Faster adoption of new export logic
220+
### Faster adoption of new export logic
221221
When a new tool, database, or API joins your stack, simply define a Target Spec and Target Connector — start exporting right away, with no pipeline refactoring required.
222-
223-

docs/docs/examples/examples/docs_to_knowledge_graph.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ and then build a knowledge graph.
3636
- CocoIndex can direct map the collected data to Neo4j nodes and relationships.
3737

3838
## Setup
39-
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
39+
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
4040
* [Install Neo4j](https://cocoindex.io/docs/ops/targets#neo4j-dev-instance), a graph database.
4141
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
4242

@@ -51,7 +51,7 @@ and then build a knowledge graph.
5151

5252
### Add documents as source
5353

54-
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
54+
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
5555

5656
```python
5757
@cocoindex.flow_def(name="DocsToKG")
@@ -141,7 +141,7 @@ Next, we will use `cocoindex.functions.ExtractByLlm` to extract the relationship
141141
doc["relationships"] = doc["content"].transform(
142142
cocoindex.functions.ExtractByLlm(
143143
llm_spec=cocoindex.LlmSpec(
144-
api_type=cocoindex.LlmApiType.OPENAI,
144+
api_type=cocoindex.LlmApiType.OPENAI,
145145
model="gpt-4o"
146146
),
147147
output_type=list[Relationship],
@@ -187,7 +187,7 @@ with doc["relationships"].row() as relationship:
187187

188188

189189
### Build knowledge graph
190-
190+
191191
#### Basic concepts
192192
All nodes for Neo4j need two things:
193193
1. Label: The type of the node. E.g., `Document`, `Entity`.
@@ -236,10 +236,10 @@ This exports Neo4j nodes with label `Document` from the `document_node` collecto
236236

237237
#### Export `RELATIONSHIP` and `Entity` nodes to Neo4j
238238

239-
We don't have explicit collector for `Entity` nodes.
239+
We don't have explicit collector for `Entity` nodes.
240240
They are part of the `entity_relationship` collector and fields are collected during the relationship extraction.
241241

242-
To export them as Neo4j nodes, we need to first declare `Entity` nodes.
242+
To export them as Neo4j nodes, we need to first declare `Entity` nodes.
243243

244244
```python
245245
flow_builder.declare(
@@ -289,7 +289,7 @@ In a relationship, there's:
289289
2. A relationship connecting the source and target.
290290
Note that different relationships may share the same source and target nodes.
291291

292-
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
292+
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
293293

294294
#### Export the `entity_mention` to Neo4j.
295295

@@ -334,7 +334,7 @@ It creates relationships by:
334334
```sh
335335
cocoindex update --setup main.py
336336
```
337-
337+
338338
You'll see the index updates state in the terminal. For example,
339339
340340
```
@@ -343,7 +343,7 @@ It creates relationships by:
343343
344344
## CocoInsight
345345
346-
I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline. It is in free beta now, you can give it a try.
346+
I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline. It is in free beta now, you can give it a try.
347347
348348
```sh
349349
cocoindex server -ci main
@@ -369,7 +369,7 @@ MATCH p=()-->() RETURN p
369369
## Kuzu
370370
Cocoindex natively supports Kuzu - a high performant, embedded open source graph database.
371371
372-
<DocumentationButton url="https://cocoindex.io/docs/ops/targets#kuzu" text="Kuzu" margin="0 0 16px 0" />
372+
<DocumentationButton url="https://cocoindex.io/docs/ops/targets#kuzu" text="Kuzu" margin="0 0 16px 0" />
373373
374374
The GraphDB interface in CocoIndex is standardized, you just need to **switch the configuration** without any additional code changes. CocoIndex supports exporting to Kuzu through its API server. You can bring up a Kuzu API server locally by running:
375375
@@ -391,4 +391,3 @@ kuzu_conn_spec = cocoindex.add_auth_entry(
391391
```
392392
393393
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/blob/30761f8ab674903d742c8ab2e18d4c588df6d46f/examples/docs_to_knowledge_graph/main.py#L33-L37" margin="0 0 16px 0" />
394-

docs/docs/examples/examples/document_ai.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ CocoIndex is a flexible ETL framework with incremental processing. We don’t b
2121

2222
## Set up
2323
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
24-
- Configure Project and Processor ID for Document AI API
24+
- Configure Project and Processor ID for Document AI API
2525
- [Official Google document AI API](https://cloud.google.com/document-ai/docs/try-docai) with free live demo.
2626
- Sign in to [Google Cloud Console](https://console.cloud.google.com/), create or open a project, and enable Document AI API.
2727
- ![image.png](/img/examples/document_ai/document_ai.png)

docs/docs/examples/examples/image_search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/c
2121
CocoIndex supports native integration with ColPali - with just a few lines of code, you embed and index images with ColPali’s late-interaction architecture. We also build a light weight image search application with FastAPI.
2222

2323

24-
## ColPali
24+
## ColPali
2525

2626
**ColPali (Contextual Late-interaction over Patches)** is a powerful model for multimodal retrieval.
2727

docs/docs/examples/examples/manual_extraction.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
188188
num_classes=len(module_info.classes),
189189
num_methods=len(module_info.methods),
190190
)
191-
```
191+
```
192192
193193
### Plug in the function into the flow
194194
```python
@@ -249,4 +249,3 @@ SELECT filename, module_info->'title' AS title, module_summary FROM modules_info
249249
cocoindex server -ci main
250250
```
251251
CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention.
252-

0 commit comments

Comments
 (0)