Skip to content

Commit 472eb63

Browse files
authored
examples documentation - custom targets, academic papers (#893)
1 parent 85b373e commit 472eb63

File tree

3 files changed

+781
-0
lines changed

3 files changed

+781
-0
lines changed
Lines changed: 362 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,362 @@
1+
---
2+
title: Academic Papers Indexing
3+
description: Build a real-time academic papers index. Extract metadata, chunk and embed abstracts, and enable semantic and author-based search over academic PDFs.
4+
sidebar_class_name: hidden
5+
slug: /examples/academic_papers_index
6+
canonicalUrl: '/examples/academic_papers_index'
7+
---
8+
9+
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
10+
11+
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/paper_metadata"/>
12+
13+
14+
## What we will achieve
15+
16+
1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
17+
18+
2. Build vector embeddings for the metadata, such as the title and abstract, for semantic search.
19+
This enables better metadata-driven semantic search results. For example, you can match text queries against titles and abstracts.
20+
21+
3. Build an index of authors and all the file names associated with each author
22+
to answer questions like "Give me all the papers by Jeff Dean."
23+
24+
4. If you want to perform full PDF embedding for the paper, you can extend the flow.
25+
26+
## Core Components
27+
28+
1. **PDF Preprocessing**
29+
- Reads PDFs using `pypdf` and extracts:
30+
- Total number of pages
31+
- First page content (used as a proxy for metadata-rich information)
32+
33+
2. **Markdown Conversion**
34+
- Converts the first page to Markdown using [Marker](https://github.com/datalab-to/marker).
35+
36+
3. **LLM-Powered Metadata Extraction**
37+
- Sends the first-page Markdown to GPT-4o using CocoIndex's `ExtractByLlm` function.
38+
- Extracted metadata includes:
39+
- `title` (string)
40+
- `authors` (with name, email, and affiliation)
41+
- `abstract` (string)
42+
43+
4. **Semantic Embedding**
44+
- The title is embedded directly using the `all-MiniLM-L6-v2` model by the SentenceTransformer.
45+
- Abstracts are chunked based on semantic punctuation and token count, then each chunk is embedded individually.
46+
47+
5. **Relational Data Collection**
48+
- Authors are unrolled and collected into an `author_papers` relation, enabling queries like:
49+
- Show all papers by X
50+
- Which co-authors worked with Y?
51+
52+
## Setup
53+
54+
- [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
55+
CocoIndex uses PostgreSQL internally for incremental processing.
56+
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
57+
Alternatively, we have native support for Gemini, Ollama, LiteLLM. Check out the [guide](https://cocoindex.io/docs/ai/llm#ollama).
58+
You can choose your favorite LLM provider and work completely on-premises.
59+
60+
## Define Indexing Flow
61+
62+
To better help you navigate what we will walk through, here is a flow diagram:
63+
64+
1. Import a list of papers in PDF.
65+
2. For each file:
66+
- Extract the first page of the paper.
67+
- Convert the first page to Markdown.
68+
- Extract metadata (title, authors, abstract) from the first page.
69+
- Split the abstract into chunks, and compute embeddings for each chunk.
70+
3. Export to the following tables in Postgres with PGVector:
71+
- Metadata (title, authors, abstract) for each paper.
72+
- Author-to-paper mapping, for author-based query.
73+
- Embeddings for titles and abstract chunks, for semantic search.
74+
75+
Let’s zoom in on the steps.
76+
77+
### Import the Papers
78+
79+
```python
80+
@cocoindex.flow_def(name="PaperMetadata")
81+
def paper_metadata_flow(
82+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
83+
) -> None:
84+
data_scope["documents"] = flow_builder.add_source(
85+
cocoindex.sources.LocalFile(path="papers", binary=True),
86+
refresh_interval=datetime.timedelta(seconds=10),
87+
)
88+
```
89+
90+
`flow_builder.add_source` will create a table with sub fields (`filename``content`),
91+
we can refer to the [documentation](https://cocoindex.io/docs/ops/sources) for more details.
92+
93+
### Extract and collect metadata
94+
95+
#### Extract first page for basic info
96+
97+
Define a custom function to extract the first page and number of pages of the PDF.
98+
99+
```python
100+
@dataclasses.dataclass
101+
class PaperBasicInfo:
102+
num_pages: int
103+
first_page: bytes
104+
```
105+
106+
```python
107+
@cocoindex.op.function()
108+
def extract_basic_info(content: bytes) -> PaperBasicInfo:
109+
"""Extract the first pages of a PDF."""
110+
reader = PdfReader(io.BytesIO(content))
111+
112+
output = io.BytesIO()
113+
writer = PdfWriter()
114+
writer.add_page(reader.pages[0])
115+
writer.write(output)
116+
117+
return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())
118+
119+
```
120+
121+
Now, plug this into your flow.
122+
We extract metadata from the first page to minimize processing cost, since the entire PDF can be very large.
123+
124+
```python
125+
with data_scope["documents"].row() as doc:
126+
doc["basic_info"] = doc["content"].transform(extract_basic_info)
127+
```
128+
129+
After this step, you should have the basic info of each paper.
130+
131+
### Parse basic info
132+
133+
We will convert the first page to Markdown using Marker.
134+
Alternatively, you can easily plug in your favorite PDF parser, such as Docling.
135+
136+
Define a marker converter function and cache it, since its initialization is resource-intensive.
137+
This ensures that the same converter instance is reused for different input files.
138+
139+
```python
140+
@cache
141+
def get_marker_converter() -> PdfConverter:
142+
config_parser = ConfigParser({})
143+
return PdfConverter(
144+
create_model_dict(), config=config_parser.generate_config_dict()
145+
)
146+
```
147+
148+
Plug it into a custom function.
149+
150+
```python
151+
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)
152+
def pdf_to_markdown(content: bytes) -> str:
153+
"""Convert to Markdown."""
154+
155+
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
156+
temp_file.write(content)
157+
temp_file.flush()
158+
text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))
159+
return text
160+
```
161+
162+
Pass it to your transform
163+
164+
```python
165+
with data_scope["documents"].row() as doc:
166+
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
167+
pdf_to_markdown
168+
)
169+
```
170+
171+
After this step, you should have the first page of each paper in Markdown format.
172+
173+
#### Extract basic info with LLM
174+
175+
Define a schema for LLM extraction. CocoIndex natively supports LLM-structured extraction with complex and nested schemas.
176+
If you are interested in learning more about nested schemas, refer to [this article](https://cocoindex.io/blogs/patient-intake-form-extraction-with-llm).
177+
178+
```python
179+
@dataclasses.dataclass
180+
class PaperMetadata:
181+
"""
182+
Metadata for a paper.
183+
"""
184+
185+
title: str
186+
authors: list[Author]
187+
abstract: str
188+
189+
```
190+
191+
Plug it into the `ExtractByLlm` function. With a dataclass defined, CocoIndex will automatically parse the LLM response into the dataclass.
192+
193+
```python
194+
doc["metadata"] = doc["first_page_md"].transform(
195+
cocoindex.functions.ExtractByLlm(
196+
llm_spec=cocoindex.LlmSpec(
197+
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"
198+
),
199+
output_type=PaperMetadata,
200+
instruction="Please extract the metadata from the first page of the paper.",
201+
)
202+
)
203+
```
204+
205+
After this step, you should have the metadata of each paper.
206+
207+
#### Collect paper metadata
208+
209+
```python
210+
paper_metadata = data_scope.add_collector()
211+
with data_scope["documents"].row() as doc:
212+
# ... process
213+
# Collect metadata
214+
paper_metadata.collect(
215+
filename=doc["filename"],
216+
title=doc["metadata"]["title"],
217+
authors=doc["metadata"]["authors"],
218+
abstract=doc["metadata"]["abstract"],
219+
num_pages=doc["basic_info"]["num_pages"],
220+
)
221+
```
222+
223+
Just collect anything you need :)
224+
225+
#### Collect `author` to `filename` information
226+
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
227+
Simply collect by author.
228+
229+
```python
230+
author_papers = data_scope.add_collector()
231+
232+
with data_scope["documents"].row() as doc:
233+
with doc["metadata"]["authors"].row() as author:
234+
author_papers.collect(
235+
author_name=author["name"],
236+
filename=doc["filename"],
237+
)
238+
```
239+
240+
241+
### Compute and collect embeddings
242+
243+
#### Title
244+
245+
```python
246+
doc["title_embedding"] = doc["metadata"]["title"].transform(
247+
cocoindex.functions.SentenceTransformerEmbed(
248+
model="sentence-transformers/all-MiniLM-L6-v2"
249+
)
250+
)
251+
```
252+
253+
#### Abstract
254+
255+
Split abstract into chunks, embed each chunk and collect their embeddings.
256+
Sometimes the abstract could be very long.
257+
258+
```python
259+
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
260+
cocoindex.functions.SplitRecursively(
261+
custom_languages=[
262+
cocoindex.functions.CustomLanguageSpec(
263+
language_name="abstract",
264+
separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],
265+
)
266+
]
267+
),
268+
language="abstract",
269+
chunk_size=500,
270+
min_chunk_size=200,
271+
chunk_overlap=150,
272+
)
273+
```
274+
275+
After this step, you should have the abstract chunks of each paper.
276+
277+
Embed each chunk and collect their embeddings.
278+
279+
```python
280+
with doc["abstract_chunks"].row() as chunk:
281+
chunk["embedding"] = chunk["text"].transform(
282+
cocoindex.functions.SentenceTransformerEmbed(
283+
model="sentence-transformers/all-MiniLM-L6-v2"
284+
)
285+
)
286+
```
287+
288+
After this step, you should have the embeddings of the abstract chunks of each paper.
289+
290+
#### Collect embeddings
291+
292+
```python
293+
metadata_embeddings = data_scope.add_collector()
294+
295+
with data_scope["documents"].row() as doc:
296+
# ... process
297+
# collect title embedding
298+
metadata_embeddings.collect(
299+
id=cocoindex.GeneratedField.UUID,
300+
filename=doc["filename"],
301+
location="title",
302+
text=doc["metadata"]["title"],
303+
embedding=doc["title_embedding"],
304+
)
305+
with doc["abstract_chunks"].row() as chunk:
306+
# ... process
307+
# collect abstract chunks embeddings
308+
metadata_embeddings.collect(
309+
id=cocoindex.GeneratedField.UUID,
310+
filename=doc["filename"],
311+
location="abstract",
312+
text=chunk["text"],
313+
embedding=chunk["embedding"],
314+
)
315+
```
316+
317+
### Export
318+
Finally, we export the data to Postgres.
319+
320+
```python
321+
paper_metadata.export(
322+
"paper_metadata",
323+
cocoindex.targets.Postgres(),
324+
primary_key_fields=["filename"],
325+
)
326+
author_papers.export(
327+
"author_papers",
328+
cocoindex.targets.Postgres(),
329+
primary_key_fields=["author_name", "filename"],
330+
)
331+
metadata_embeddings.export(
332+
"metadata_embeddings",
333+
cocoindex.targets.Postgres(),
334+
primary_key_fields=["id"],
335+
vector_indexes=[
336+
cocoindex.VectorIndexDef(
337+
field_name="embedding",
338+
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
339+
)
340+
],
341+
)
342+
```
343+
344+
In this example we use PGVector as embedding stores/
345+
With CocoIndex, you can do one line switch on other supported Vector databases like Qdrant, see this [guide](https://cocoindex.io/docs/ops/targets#entry-oriented-targets) for more details.
346+
We aim to standardize interfaces and make it like assembling building blocks.
347+
348+
## View in CocoInsight step by step
349+
350+
You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see
351+
exactly how each field is constructed and what happens behind the scenes.
352+
353+
## Query the index
354+
355+
You can refer to this section of [Text Embeddings](https://cocoindex.io/blogs/text-embeddings-101#3-query-the-index) about
356+
how to build query against embeddings.
357+
For now CocoIndex doesn't provide additional query interface. We can write SQL or rely on the query engine by the target storage.
358+
359+
- Many databases already have optimized query implementations with their own best practices
360+
- The query space has excellent solutions for querying, reranking, and other search-related functionality.
361+
362+
If you need assist with writing the query, please feel free to reach out to us at [Discord](https://discord.com/invite/zpA9S2DR7s).

0 commit comments

Comments
 (0)