Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ It defines an index flow like this:
| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |
| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|
| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
| [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper |

More coming and stay tuned 👀!

Expand Down
4 changes: 4 additions & 0 deletions examples/paper_metadata/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Postgres database address for cocoindex
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex

OPENAI_API_KEY=
1 change: 1 addition & 0 deletions examples/paper_metadata/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
60 changes: 60 additions & 0 deletions examples/paper_metadata/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Build embedding index from PDF files and query with natural language
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)


In this example, we will build a bunch of tables for papers in PDF files, including:

- Metadata (title, authors, abstract) for each paper.
- Author-to-paper mapping, for author-based query.
- Embeddings for titles and abstract chunks, for semantics search.

We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

## Steps
### Indexing Flow

1. We will ingest a list of papers in PDF.
2. For each file, we:
- Extract the first page of the paper.
- Convert the first page to Markdown.
- Extract metadata (title, authors, abstract) from the first page.
- Split the abstract into chunks, and compute embeddings for each chunk.
3. We will export to the following tables in Postgres with PGVector:
- Metadata (title, authors, abstract) for each paper.
- Author-to-paper mapping, for author-based query.
- Embeddings for titles and abstract chunks, for semantics search.


## Prerequisite

1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.

2. dependencies:

```bash
pip install -e .
```
3. Create a `.env` file from `.env.example`, and fill `OPENAI_API_KEY`.

## Run

Update index, which will also setup the tables at the first time:

```bash
cocoindex update --setup main.py
```

You can also run the command with `-L`, which will watch for file changes and update the index automatically.

```bash
cocoindex update --setup -L main.py
```

## CocoInsight
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention. Run following command to start CocoInsight:

```
cocoindex server -ci main.py
```

Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
183 changes: 183 additions & 0 deletions examples/paper_metadata/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
import cocoindex
import io
import tempfile
import dataclasses
import datetime

from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from functools import cache
from pypdf import PdfReader, PdfWriter


@cache
def get_marker_converter() -> PdfConverter:
config_parser = ConfigParser({})
return PdfConverter(
create_model_dict(), config=config_parser.generate_config_dict()
)


@dataclasses.dataclass
class PaperBasicInfo:
num_pages: int
first_page: bytes


@cocoindex.op.function()
def extract_basic_info(content: bytes) -> PaperBasicInfo:
"""Extract the first pages of a PDF."""
reader = PdfReader(io.BytesIO(content))

output = io.BytesIO()
writer = PdfWriter()
writer.add_page(reader.pages[0])
writer.write(output)

return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())


@dataclasses.dataclass
class Author:
"""One author of the paper."""

name: str
email: str | None
affiliation: str | None


@dataclasses.dataclass
class PaperMetadata:
"""
Metadata for a paper.
"""

title: str
authors: list[Author]
abstract: str


@cocoindex.op.function(gpu=True, cache=True, behavior_version=2)
def pdf_to_markdown(content: bytes) -> str:
"""Convert to Markdown."""

with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(get_marker_converter()(temp_file.name))
return text


@cocoindex.transform_flow()
def text_to_embedding(
text: cocoindex.DataSlice[str],
) -> cocoindex.DataSlice[list[float]]:
"""
Embed the text using a SentenceTransformer model.
This is a shared logic between indexing and querying, so extract it as a function.
"""
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)


@cocoindex.flow_def(name="PaperMetadata")
def paper_metadata_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
"""
Define an example flow that embeds files into a vector database.
"""
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="papers", binary=True),
refresh_interval=datetime.timedelta(seconds=10),
)

paper_metadata = data_scope.add_collector()
metadata_embeddings = data_scope.add_collector()
author_papers = data_scope.add_collector()

with data_scope["documents"].row() as doc:
doc["basic_info"] = doc["content"].transform(extract_basic_info)
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
pdf_to_markdown
)
doc["metadata"] = doc["first_page_md"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"
),
output_type=PaperMetadata,
instruction="Please extract the metadata from the first page of the paper.",
)
)
doc["title_embedding"] = text_to_embedding(doc["metadata"]["title"])
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
cocoindex.functions.SplitRecursively(
custom_languages=[
cocoindex.functions.CustomLanguageSpec(
language_name="abstract",
separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],
)
]
),
language="abstract",
chunk_size=500,
min_chunk_size=200,
chunk_overlap=150,
)

paper_metadata.collect(
filename=doc["filename"],
title=doc["metadata"]["title"],
authors=doc["metadata"]["authors"],
abstract=doc["metadata"]["abstract"],
)
metadata_embeddings.collect(
id=cocoindex.GeneratedField.UUID,
filename=doc["filename"],
location="title",
text=doc["metadata"]["title"],
embedding=doc["title_embedding"],
)
with doc["metadata"]["authors"].row() as author:
author_papers.collect(
author_name=author["name"],
filename=doc["filename"],
)

with doc["abstract_chunks"].row() as chunk:
chunk["embedding"] = text_to_embedding(chunk["text"])
metadata_embeddings.collect(
id=cocoindex.GeneratedField.UUID,
filename=doc["filename"],
location="abstract",
text=chunk["text"],
embedding=chunk["embedding"],
)

paper_metadata.export(
"paper_metadata",
cocoindex.targets.Postgres(),
primary_key_fields=["filename"],
)
metadata_embeddings.export(
"metadata_embeddings",
cocoindex.targets.Postgres(),
primary_key_fields=["id"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
)
],
)
author_papers.export(
"author_papers",
cocoindex.targets.Postgres(),
primary_key_fields=["author_name", "filename"],
)
Binary file added examples/paper_metadata/papers/1706.03762v7.pdf
Binary file not shown.
Binary file added examples/paper_metadata/papers/1810.04805v2.pdf
Binary file not shown.
Binary file added examples/paper_metadata/papers/2502.06786v3.pdf
Binary file not shown.
Binary file added examples/paper_metadata/papers/2502.20346v1.pdf
Binary file not shown.
13 changes: 13 additions & 0 deletions examples/paper_metadata/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[project]
name = "paper-metadata"
version = "0.1.0"
description = "Build index for papers with both metadata and content embeddings"
requires-python = ">=3.11"
dependencies = [
"cocoindex[embeddings]>=0.1.62",
"pypdf>=5.7.0",
"marker-pdf>=1.5.2",
]

[tool.setuptools]
packages = []