Skip to content

Commit fa0b70d

Browse files
committed
chore: undo changes unrelated to this PR
1 parent 79c579e commit fa0b70d

14 files changed

+98
-94
lines changed

docs/docs/contributing/guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: How to contribute to CocoIndex
55

66
[CocoIndex](https://github.com/cocoindex-io/cocoindex) is an open source project. We are respectful, open and friendly. This guide explains how to get involved and contribute to [CocoIndex](https://github.com/cocoindex-io/cocoindex).
77

8-
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
8+
Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
99
If you are unsure about anything, it is a good place to discuss! We'd love to collaborate and will always be friendly.
1010

1111
## Good First Issues

docs/docs/contributing/setup_dev_environment.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,4 @@ Follow the steps below to get CocoIndex built on the latest codebase locally - i
4444
- Before running a specific example, set extra environment variables, for exposing extra traces, allowing dev UI, etc.
4545
```sh
4646
. ./.env.lib_debug
47-
```
47+
```

docs/docs/examples/examples/academic_papers_index.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,10 @@ import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButto
1919

2020
1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
2121

22-
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
22+
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
2323
This enables better metadata-driven semantic search results. For example, you can match text queries against titles and abstracts.
2424

25-
3. Build an index of authors and all the file names associated with each author
25+
3. Build an index of authors and all the file names associated with each author
2626
to answer questions like "Give me all the papers by Jeff Dean."
2727

2828
4. If you want to perform full PDF embedding for the paper, you can extend the flow.
@@ -31,13 +31,13 @@ to answer questions like "Give me all the papers by Jeff Dean."
3131

3232
- [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
3333
CocoIndex uses PostgreSQL internally for incremental processing.
34-
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
34+
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
3535
Alternatively, we have native support for Gemini, Ollama, LiteLLM. Check out the [guide](https://cocoindex.io/docs/ai/llm#ollama).
3636
You can choose your favorite LLM provider and work completely on-premises.
3737

3838
## Define Indexing Flow
3939

40-
To better help you navigate what we will walk through, here is a flow diagram:
40+
To better help you navigate what we will walk through, here is a flow diagram:
4141

4242
1. Import a list of papers in PDF.
4343
2. For each file:
@@ -65,7 +65,7 @@ def paper_metadata_flow(
6565
)
6666
```
6767

68-
`flow_builder.add_source` will create a table with sub fields (`filename``content`),
68+
`flow_builder.add_source` will create a table with sub fields (`filename``content`),
6969
we can refer to the [documentation](https://cocoindex.io/docs/ops/sources) for more details.
7070

7171
### Extract and collect metadata
@@ -108,10 +108,10 @@ After this step, you should have the basic info of each paper.
108108

109109
### Parse basic info
110110

111-
We will convert the first page to Markdown using Marker.
111+
We will convert the first page to Markdown using Marker.
112112
Alternatively, you can easily plug in your favorite PDF parser, such as Docling.
113113

114-
Define a marker converter function and cache it, since its initialization is resource-intensive.
114+
Define a marker converter function and cache it, since its initialization is resource-intensive.
115115
This ensures that the same converter instance is reused for different input files.
116116

117117
```python
@@ -140,7 +140,7 @@ def pdf_to_markdown(content: bytes) -> str:
140140
Pass it to your transform
141141

142142
```python
143-
with data_scope["documents"].row() as doc:
143+
with data_scope["documents"].row() as doc:
144144
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
145145
pdf_to_markdown
146146
)
@@ -201,7 +201,7 @@ After this step, you should have the metadata of each paper.
201201
Just collect anything you need :)
202202

203203
#### Collect `author` to `filename` information
204-
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
204+
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
205205
Simply collect by author.
206206

207207
```python
@@ -230,8 +230,8 @@ doc["title_embedding"] = doc["metadata"]["title"].transform(
230230

231231
#### Abstract
232232

233-
Split abstract into chunks, embed each chunk and collect their embeddings.
234-
Sometimes the abstract could be very long.
233+
Split abstract into chunks, embed each chunk and collect their embeddings.
234+
Sometimes the abstract could be very long.
235235

236236
```python
237237
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
@@ -305,7 +305,7 @@ author_papers.export(
305305
"author_papers",
306306
cocoindex.targets.Postgres(),
307307
primary_key_fields=["author_name", "filename"],
308-
)
308+
)
309309
metadata_embeddings.export(
310310
"metadata_embeddings",
311311
cocoindex.targets.Postgres(),
@@ -325,14 +325,14 @@ We aim to standardize interfaces and make it like assembling building blocks.
325325

326326
## View in CocoInsight step by step
327327

328-
You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see
328+
You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see
329329
exactly how each field is constructed and what happens behind the scenes.
330330

331331
## Query the index
332332

333-
You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
334-
how to build query against embeddings.
335-
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
333+
You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
334+
how to build query against embeddings.
335+
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
336336

337337
- Many databases already have optimized query implementations with their own best practices
338338
- The query space has excellent solutions for querying, reranking, and other search-related functionality.

docs/docs/examples/examples/codebase_index.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButto
1515
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding"/>
1616
<YouTubeButton url="https://youtu.be/G3WstvhHO24?si=ndYfM0XRs03_hVPR" />
1717

18-
## Setup
18+
## Setup
1919

2020
If you don't have Postgres installed, please follow [installation guide](https://cocoindex.io/docs/getting_started/installation).
2121

22-
## Add the codebase as a source.
22+
## Add the codebase as a source.
2323

2424
Ingest files from the CocoIndex codebase root directory.
2525

@@ -39,7 +39,7 @@ def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
3939
- Include files with the extensions of `.py`, `.rs`, `.toml`, `.md`, `.mdx`
4040
- Exclude files and directories starting `.`, `target` in the root and `node_modules` under any directory.
4141

42-
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
42+
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
4343
See [documentation](https://cocoindex.io/docs/ops/sources) for more details.
4444

4545

@@ -70,23 +70,23 @@ Here we extract the extension of the filename and store it in the `extension` fi
7070

7171
### Split the file into chunks
7272

73-
We will chunk the code with Tree-sitter.
74-
We use the `SplitRecursively` function to split the file into chunks.
73+
We will chunk the code with Tree-sitter.
74+
We use the `SplitRecursively` function to split the file into chunks.
7575
It is integrated with Tree-sitter, so you can pass in the language to the `language` parameter.
7676
To see all supported language names and extensions, see the documentation [here](https://cocoindex.io/docs/ops/functions#splitrecursively). All the major languages are supported, e.g., Python, Rust, JavaScript, TypeScript, Java, C++, etc. If it's unspecified or the specified language is not supported, it will be treated as plain text.
7777

7878
```python
7979
with data_scope["files"].row() as file:
8080
file["chunks"] = file["content"].transform(
8181
cocoindex.functions.SplitRecursively(),
82-
language=file["extension"], chunk_size=1000, chunk_overlap=300)
82+
language=file["extension"], chunk_size=1000, chunk_overlap=300)
8383
```
8484

8585

8686
### Embed the chunks
8787

88-
We use `SentenceTransformerEmbed` to embed the chunks.
89-
You can refer to the documentation [here](https://cocoindex.io/docs/ops/functions#sentencetransformerembed).
88+
We use `SentenceTransformerEmbed` to embed the chunks.
89+
You can refer to the documentation [here](https://cocoindex.io/docs/ops/functions#sentencetransformerembed).
9090

9191
```python
9292
@cocoindex.transform_flow()
@@ -101,7 +101,7 @@ def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[lis
101101

102102
Then for each chunk, we will embed it using the `code_to_embedding` function. and collect the embeddings to the `code_embeddings` collector.
103103

104-
`@cocoindex.transform_flow()` is needed to share the transformation across indexing and query. We build a vector index and query against it,
104+
`@cocoindex.transform_flow()` is needed to share the transformation across indexing and query. We build a vector index and query against it,
105105
the embedding computation needs to be consistent between indexing and querying. See [documentation](https://cocoindex.io/docs/query#transform-flow) for more details.
106106

107107

@@ -126,7 +126,7 @@ code_embeddings.export(
126126
vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
127127
```
128128

129-
We use Consine Similarity to measure the similarity between the query and the indexed data.
129+
We use Consine Similarity to measure the similarity between the query and the indexed data.
130130
To learn more about Consine Similarity, see [Wiki](https://en.wikipedia.org/wiki/Cosine_similarity).
131131

132132
## Query the index

docs/docs/examples/examples/custom_targets.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Let’s walk through a simple example—exporting `.md` files as `.html` using a
1919
Check out the full [source code](https://github.com/cocoindex-io/cocoindex/tree/main/examples/custom_output_files).
2020

2121
The overall flow is simple:
22-
This example focuses on
22+
This example focuses on
2323
- how to configure your custom target
2424
- the flow effortless picks up the changes in the source, recomputes only what's changed and export to the target
2525

@@ -41,7 +41,7 @@ flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
4141
refresh_interval=timedelta(seconds=5),
4242
)
4343
```
44-
This ingestion creates a table with `filename` and `content` fields.
44+
This ingestion creates a table with `filename` and `content` fields.
4545

4646

4747
## Process each file and collect
@@ -91,7 +91,7 @@ class LocalFileTargetConnector:
9191

9292
```
9393

94-
The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
94+
The `describe()` method returns a human-readable string that describes the target, which is displayed in the CLI logs.
9595
For example, it prints:
9696

9797
`Target: Local directory ./data/output`
@@ -103,10 +103,10 @@ def describe(key: str) -> str:
103103
return f"Local directory {key}"
104104
```
105105

106-
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
106+
`apply_setup_change()` applies setup changes to the backend. The previous and current specs are passed as arguments,
107107
and the method is expected to update the backend setup to match the current state.
108108

109-
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
109+
A `None` spec indicates non-existence, so when `previous` is `None`, we need to create it,
110110
and when `current` is `None`, we need to delete it.
111111

112112

@@ -134,8 +134,8 @@ def apply_setup_change(
134134
os.rmdir(previous.directory)
135135
```
136136

137-
The `mutate()` method is called by CocoIndex to apply data changes to the target,
138-
batching mutations to potentially multiple targets of the same type.
137+
The `mutate()` method is called by CocoIndex to apply data changes to the target,
138+
batching mutations to potentially multiple targets of the same type.
139139
This allows the target connector flexibility in implementation (e.g., atomic commits, or processing items with dependencies in a specific order).
140140

141141
Each element in the batch corresponds to a specific target and is represented by a tuple containing:
@@ -150,8 +150,8 @@ class LocalFileTargetValues:
150150
html: str
151151
```
152152

153-
The value type of the `dict` is `LocalFileTargetValues | None`,
154-
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
153+
The value type of the `dict` is `LocalFileTargetValues | None`,
154+
where a non-`None` value means an upsert and `None` value means a delete. Similar to `apply_setup_changes()`,
155155
idempotency is expected here.
156156

157157
```python
@@ -218,5 +218,7 @@ This keeps your knowledge graph continuously synchronized with your document sou
218218
Sometimes there may be an internal/homegrown tool or API (e.g. within a company) that's not publicly available.
219219
These can only be connected through custom targets.
220220

221-
### Faster adoption of new export logic
221+
### Faster adoption of new export logic
222222
When a new tool, database, or API joins your stack, simply define a Target Spec and Target Connector — start exporting right away, with no pipeline refactoring required.
223+
224+

docs/docs/examples/examples/docs_to_knowledge_graph.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ We will generate two kinds of relationships:
2323
2. Mentions of entities in a document. E.g., "core/basics.mdx" mentions `CocoIndex` and `Incremental Processing`.
2424

2525
## Setup
26-
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
26+
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
2727
* [Install Neo4j](https://cocoindex.io/docs/ops/storages#Neo4j), a graph database.
2828
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, you can switch to Ollama, which runs LLM models locally - [guide](https://cocoindex.io/docs/ai/llm#ollama).
2929

@@ -36,7 +36,7 @@ You can read the official CocoIndex Documentation for Property Graph Targets [he
3636

3737
### Add documents as source
3838

39-
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
39+
We will process CocoIndex documentation markdown files (`.md`, `.mdx`) from the `docs/core` directory ([markdown files](https://github.com/cocoindex-io/cocoindex/tree/main/docs/docs/core), [deployed docs](https://cocoindex.io/docs/core/basics)).
4040

4141
```python
4242
@cocoindex.flow_def(name="DocsToKG")
@@ -124,7 +124,7 @@ Next, we will use `cocoindex.functions.ExtractByLlm` to extract the relationship
124124
doc["relationships"] = doc["content"].transform(
125125
cocoindex.functions.ExtractByLlm(
126126
llm_spec=cocoindex.LlmSpec(
127-
api_type=cocoindex.LlmApiType.OPENAI,
127+
api_type=cocoindex.LlmApiType.OPENAI,
128128
model="gpt-4o"
129129
),
130130
output_type=list[Relationship],
@@ -170,7 +170,7 @@ with doc["relationships"].row() as relationship:
170170

171171

172172
### Build knowledge graph
173-
173+
174174
#### Basic concepts
175175
All nodes for Neo4j need two things:
176176
1. Label: The type of the node. E.g., `Document`, `Entity`.
@@ -216,10 +216,10 @@ This exports Neo4j nodes with label `Document` from the `document_node` collecto
216216

217217
#### Export `RELATIONSHIP` and `Entity` nodes to Neo4j
218218

219-
We don't have explicit collector for `Entity` nodes.
219+
We don't have explicit collector for `Entity` nodes.
220220
They are part of the `entity_relationship` collector and fields are collected during the relationship extraction.
221221

222-
To export them as Neo4j nodes, we need to first declare `Entity` nodes.
222+
To export them as Neo4j nodes, we need to first declare `Entity` nodes.
223223

224224
```python
225225
flow_builder.declare(
@@ -268,7 +268,7 @@ In a relationship, there's:
268268
2. A relationship connecting the source and target.
269269
Note that different relationships may share the same source and target nodes.
270270

271-
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
271+
`NodeFromFields` takes the fields from the `entity_relationship` collector and creates `Entity` nodes.
272272

273273
#### Export the `entity_mention` to Neo4j.
274274

@@ -314,14 +314,14 @@ It creates relationships by:
314314
```sh
315315
cocoindex update --setup main.py
316316
```
317-
317+
318318
You'll see the index updates state in the terminal. For example, you'll see the following output:
319319

320320
```
321321
documents: 7 added, 0 removed, 0 updated
322322
```
323323

324-
3. (Optional) I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline.
324+
3. (Optional) I used CocoInsight to troubleshoot the index generation and understand the data lineage of the pipeline.
325325
It is in free beta now, you can give it a try. Run following command to start CocoInsight:
326326

327327
```sh
@@ -348,7 +348,8 @@ MATCH p=()-->() RETURN p
348348

349349

350350
## Support us
351-
We are constantly improving, and more features and examples are coming soon.
351+
We are constantly improving, and more features and examples are coming soon.
352352
If you love this article, please give us a star ⭐ at [GitHub repo](https://github.com/cocoindex-io/cocoindex) to help us grow.
353353

354354
Thanks for reading!
355+

docs/docs/examples/examples/image_search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,4 +211,4 @@ Once connected, CocoIndex continuously watches for changes — new uploads, upda
211211
## Support us
212212

213213
We’re constantly adding more examples and improving our runtime.
214-
If you found this helpful, please ⭐ star [CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex) and share it with others.
214+
If you found this helpful, please ⭐ star [CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex) and share it with others.

0 commit comments

Comments
 (0)