Skip to content

Commit 7ee84de

Browse files
authored
docs: academic papers example (#917)
1 parent 832ab93 commit 7ee84de

File tree

10 files changed

+65
-57
lines changed

10 files changed

+65
-57
lines changed

docs/docs/examples/examples/academic_papers_index.md

Lines changed: 60 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,10 @@ sidebar_custom_props:
1010
tags: [vector-index, metadata]
1111
---
1212

13-
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
13+
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
1414

1515
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/paper_metadata"/>
1616

17-
1817
## What we will achieve
1918

2019
1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
@@ -27,18 +26,8 @@ to answer questions like "Give me all the papers by Jeff Dean."
2726

2827
4. If you want to perform full PDF embedding for the paper, you can extend the flow.
2928

30-
## Setup
31-
32-
- [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
33-
CocoIndex uses PostgreSQL internally for incremental processing.
34-
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
35-
Alternatively, we have native support for Gemini, Ollama, LiteLLM. Check out the [guide](https://cocoindex.io/docs/ai/llm#ollama).
36-
You can choose your favorite LLM provider and work completely on-premises.
37-
38-
## Define Indexing Flow
39-
40-
To better help you navigate what we will walk through, here is a flow diagram:
41-
29+
## Flow Overview
30+
![Flow Overview](/img/examples/academic_papers_index/flow.png)
4231
1. Import a list of papers in PDF.
4332
2. For each file:
4433
- Extract the first page of the paper.
@@ -50,9 +39,15 @@ To better help you navigate what we will walk through, here is a flow diagram:
5039
- Author-to-paper mapping, for author-based query.
5140
- Embeddings for titles and abstract chunks, for semantic search.
5241

53-
Let’s zoom in on the steps.
42+
## Setup
43+
44+
- [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
45+
CocoIndex uses PostgreSQL internally for incremental processing.
46+
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
5447

55-
### Import the Papers
48+
<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />
49+
50+
## Import the Papers
5651

5752
```python
5853
@cocoindex.flow_def(name="PaperMetadata")
@@ -65,12 +60,12 @@ def paper_metadata_flow(
6560
)
6661
```
6762

68-
`flow_builder.add_source` will create a table with sub fields (`filename``content`),
69-
we can refer to the [documentation](https://cocoindex.io/docs/ops/sources) for more details.
63+
`flow_builder.add_source` will create a table with sub fields (`filename``content`).
64+
<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="Sources" margin="0 0 16px 0" />
7065

71-
### Extract and collect metadata
66+
## Extract and collect metadata
7267

73-
#### Extract first page for basic info
68+
### Extract first page for basic info
7469

7570
Define a custom function to extract the first page and number of pages of the PDF.
7671

@@ -96,20 +91,19 @@ def extract_basic_info(content: bytes) -> PaperBasicInfo:
9691

9792
```
9893

99-
Now, plug this into your flow.
100-
We extract metadata from the first page to minimize processing cost, since the entire PDF can be very large.
94+
Now plug this into the flow. We extract metadata from the first page to minimize processing cost, since the entire PDF can be very large.
10195

10296
```python
10397
with data_scope["documents"].row() as doc:
10498
doc["basic_info"] = doc["content"].transform(extract_basic_info)
10599
```
100+
![Extract basic info](/img/examples/academic_papers_index/basic_info.png)
106101

107-
After this step, you should have the basic info of each paper.
102+
After this step, we should have the basic info of each paper.
108103

109104
### Parse basic info
110105

111-
We will convert the first page to Markdown using Marker.
112-
Alternatively, you can easily plug in your favorite PDF parser, such as Docling.
106+
We will convert the first page to Markdown using Marker. Alternatively, you can easily plug in any PDF parser, such as Docling using CocoIndex's [custom function](https://cocoindex.io/docs/custom_ops/custom_functions).
113107

114108
Define a marker converter function and cache it, since its initialization is resource-intensive.
115109
This ensures that the same converter instance is reused for different input files.
@@ -140,18 +134,20 @@ def pdf_to_markdown(content: bytes) -> str:
140134
Pass it to your transform
141135

142136
```python
143-
with data_scope["documents"].row() as doc:
137+
with data_scope["documents"].row() as doc:
138+
# ... process
144139
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
145140
pdf_to_markdown
146141
)
147142
```
143+
![First page in Markdown](/img/examples/academic_papers_index/first_page.png)
148144

149145
After this step, you should have the first page of each paper in Markdown format.
150146

151-
#### Extract basic info with LLM
147+
### Extract basic info with LLM
152148

153149
Define a schema for LLM extraction. CocoIndex natively supports LLM-structured extraction with complex and nested schemas.
154-
If you are interested in learning more about nested schemas, refer to [this article](https://cocoindex.io/blogs/patient-intake-form-extraction-with-llm).
150+
If you are interested in learning more about nested schemas, refer to [this example](https://cocoindex.io/docs/examples/patient_form_extraction).
155151

156152
```python
157153
@dataclasses.dataclass
@@ -163,7 +159,6 @@ class PaperMetadata:
163159
title: str
164160
authors: list[Author]
165161
abstract: str
166-
167162
```
168163

169164
Plug it into the `ExtractByLlm` function. With a dataclass defined, CocoIndex will automatically parse the LLM response into the dataclass.
@@ -181,26 +176,27 @@ doc["metadata"] = doc["first_page_md"].transform(
181176
```
182177

183178
After this step, you should have the metadata of each paper.
179+
![Metadata](/img/examples/academic_papers_index/metadata.png)
184180

185-
#### Collect paper metadata
181+
### Collect paper metadata
186182

187183
```python
188-
paper_metadata = data_scope.add_collector()
189-
with data_scope["documents"].row() as doc:
190-
# ... process
191-
# Collect metadata
192-
paper_metadata.collect(
193-
filename=doc["filename"],
194-
title=doc["metadata"]["title"],
195-
authors=doc["metadata"]["authors"],
196-
abstract=doc["metadata"]["abstract"],
197-
num_pages=doc["basic_info"]["num_pages"],
198-
)
184+
paper_metadata = data_scope.add_collector()
185+
with data_scope["documents"].row() as doc:
186+
# ... process
187+
# Collect metadata
188+
paper_metadata.collect(
189+
filename=doc["filename"],
190+
title=doc["metadata"]["title"],
191+
authors=doc["metadata"]["authors"],
192+
abstract=doc["metadata"]["abstract"],
193+
num_pages=doc["basic_info"]["num_pages"],
194+
)
199195
```
200196

201197
Just collect anything you need :)
202198

203-
#### Collect `author` to `filename` information
199+
### Collect `author` to `filename` information
204200
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
205201
Simply collect by author.
206202

@@ -216,9 +212,9 @@ with data_scope["documents"].row() as doc:
216212
```
217213

218214

219-
### Compute and collect embeddings
215+
## Compute and collect embeddings
220216

221-
#### Title
217+
### Title
222218

223219
```python
224220
doc["title_embedding"] = doc["metadata"]["title"].transform(
@@ -228,7 +224,7 @@ doc["title_embedding"] = doc["metadata"]["title"].transform(
228224
)
229225
```
230226

231-
#### Abstract
227+
### Abstract
232228

233229
Split abstract into chunks, embed each chunk and collect their embeddings.
234230
Sometimes the abstract could be very long.
@@ -252,6 +248,8 @@ doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
252248

253249
After this step, you should have the abstract chunks of each paper.
254250

251+
![Abstract chunks](/img/examples/academic_papers_index/abstract_chunks.png)
252+
255253
Embed each chunk and collect their embeddings.
256254

257255
```python
@@ -265,7 +263,9 @@ with doc["abstract_chunks"].row() as chunk:
265263

266264
After this step, you should have the embeddings of the abstract chunks of each paper.
267265

268-
#### Collect embeddings
266+
![Abstract chunks embeddings](/img/examples/academic_papers_index/chunk_embedding.png)
267+
268+
### Collect embeddings
269269

270270
```python
271271
metadata_embeddings = data_scope.add_collector()
@@ -292,7 +292,7 @@ with data_scope["documents"].row() as doc:
292292
)
293293
```
294294

295-
### Export
295+
## Export
296296
Finally, we export the data to Postgres.
297297

298298
```python
@@ -319,14 +319,9 @@ metadata_embeddings.export(
319319
)
320320
```
321321

322-
In this example we use PGVector as embedding stores/
323-
With CocoIndex, you can do one line switch on other supported Vector databases like Qdrant, see this [guide](https://cocoindex.io/docs/ops/targets#entry-oriented-targets) for more details.
324-
We aim to standardize interfaces and make it like assembling building blocks.
322+
In this example we use PGVector as embedding store. With CocoIndex, you can do one line switch on other supported Vector databases.
325323

326-
## View in CocoInsight step by step
327-
328-
You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see
329-
exactly how each field is constructed and what happens behind the scenes.
324+
<DocumentationButton href="https://cocoindex.io/docs/ops/targets#entry-oriented-targets" text="Entry Oriented Targets" margin="0 0 16px 0" />
330325

331326
## Query the index
332327

@@ -338,3 +333,14 @@ For now CocoIndex doesn't provide additional query interface. We can write SQL o
338333
- The query space has excellent solutions for querying, reranking, and other search-related functionality.
339334

340335
If you need assist with writing the query, please feel free to reach out to us at [Discord](https://discord.com/invite/zpA9S2DR7s).
336+
337+
## CocoInsight
338+
339+
You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see exactly how each field is constructed and what happens behind the scenes.
340+
341+
342+
```sh
343+
cocoindex server -ci main.py
344+
```
345+
346+
Follow the url `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server, with zero pipeline data retention.

docs/docs/examples/examples/docs_to_knowledge_graph.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,10 @@ and then build a knowledge graph.
3535
## Setup
3636
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
3737
* [Install Neo4j](https://cocoindex.io/docs/ops/targets#neo4j-dev-instance), a graph database.
38-
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, you can switch to Ollama, which runs LLM models locally.
39-
<DocumentationButton href="https://cocoindex.io/docs/ai/llm#ollama" text="Ollama" margin="0 0 16px 0" />
38+
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
39+
40+
<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />
41+
4042

4143
## Documentation
4244
<DocumentationButton href="https://cocoindex.io/docs/ops/targets#property-graph-targets" text="Property Graph Targets" margin="0 0 16px 0" />
416 KB
Loading
261 KB
Loading
145 KB
Loading
-68.1 KB
Loading
141 KB
Loading
66.1 KB
Loading
201 KB
Loading

examples/paper_metadata/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ version = "0.1.0"
44
description = "Build index for papers with both metadata and content embeddings"
55
requires-python = ">=3.11"
66
dependencies = [
7-
"cocoindex[embeddings]>=0.1.79",
7+
"cocoindex[embeddings]>=0.1.83",
88
"pypdf>=5.7.0",
99
"marker-pdf>=1.5.2",
1010
]

0 commit comments

Comments
 (0)