Skip to content

Commit 3a8d966

Browse files
authored
example: add pdf_elements_embedding example (#1180)
1 parent 6002d34 commit 3a8d966

File tree

7 files changed

+304
-7
lines changed

7 files changed

+304
-7
lines changed

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
<a href="https://trendshift.io/repositories/13939" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13939" alt="cocoindex-io%2Fcocoindex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
2323
</div>
2424

25-
2625
Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
2726

2827
⭐ Drop a star to help us grow!
@@ -60,9 +59,8 @@ CocoIndex makes it effortless to transform data with AI, and keep source data an
6059

6160
</br>
6261

63-
64-
6562
## Exceptional velocity
63+
6664
Just declare transformation in dataflow with ~100 lines of python
6765

6866
```python
@@ -86,25 +84,30 @@ CocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_
8684
**Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
8785

8886
## Plug-and-Play Building Blocks
87+
8988
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
9089

9190
<p align="center">
9291
<img src="https://cocoindex.io/images/components.svg" alt="CocoIndex Features">
9392
</p>
9493

9594
## Data Freshness
95+
9696
CocoIndex keep source data and target in sync effortlessly.
9797

9898
<p align="center">
9999
<img src="https://github.com/user-attachments/assets/f4eb29b3-84ee-4fa0-a1e2-80eedeeabde6" alt="Incremental Processing" width="700">
100100
</p>
101101

102102
It has out-of-box support for incremental indexing:
103+
103104
- minimal recomputation on source or logic change.
104105
- (re-)processing necessary portions; reuse cache when possible
105106

106-
## Quick Start:
107+
## Quick Start
108+
107109
If you're new to CocoIndex, we recommend checking out
110+
108111
- 📖 [Documentation](https://cocoindex.io/docs)
109112
-[Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart)
110113
- 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)
@@ -119,7 +122,6 @@ pip install -U cocoindex
119122

120123
2. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.
121124

122-
123125
## Define data flow
124126

125127
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:
@@ -175,6 +177,7 @@ It defines an index flow like this:
175177
| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
176178
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
177179
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
180+
| [PDF Elements Embedding](examples/pdf_elements_embedding) | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |
178181
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
179182
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
180183
| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
@@ -191,16 +194,18 @@ It defines an index flow like this:
191194
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
192195
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
193196

194-
195197
More coming and stay tuned 👀!
196198

197199
## 📖 Documentation
200+
198201
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
199202

200203
## 🤝 Contributing
204+
201205
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
202206

203207
## 👥 Community
208+
204209
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
205210

206211
Join our community here:
@@ -210,8 +215,10 @@ Join our community here:
210215
- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
211216
- 📜 [Read our blog posts](https://cocoindex.io/blogs/)
212217

213-
## Support us:
218+
## Support us
219+
214220
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.
215221

216222
## License
223+
217224
CocoIndex is Apache 2.0 licensed.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
# Fallback to CPU for operations not supported by MPS on Mac.
5+
# It's no-op for other platforms.
6+
PYTORCH_ENABLE_MPS_FALLBACK=1
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/source_files
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Extract text and images from PDFs and build multimodal search
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
In this example, we extract texts and images from PDF pages, embed them with two models, and store them in Qdrant for multimodal search:
6+
7+
- Text: SentenceTransformers `all-MiniLM-L6-v2`
8+
- Images: CLIP `openai/clip-vit-large-patch14` (ViT-L/14, 768-dim)
9+
10+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
11+
12+
## Steps
13+
14+
### Indexing Flow
15+
16+
1. Ingest PDF files from the `source_files` directory.
17+
2. For each PDF page:
18+
- Extract page text and images using `pypdf`.
19+
- Skip very small images and create thumbnails up to 512×512 for consistency.
20+
- Split text into chunks with `SplitRecursively` (language="text", chunk_size=600, chunk_overlap=100).
21+
- Embed text chunks with SentenceTransformers (`all-MiniLM-L6-v2`).
22+
- Embed images with CLIP (`openai/clip-vit-large-patch14`).
23+
3. Save embeddings and metadata in Qdrant:
24+
- Text collection: `PdfElementsEmbeddingText`
25+
- Image collection: `PdfElementsEmbeddingImage`
26+
27+
## Prerequisite
28+
29+
[Install Qdrant](https://qdrant.tech/documentation/guides/installation/) if you don't have one running locally.
30+
31+
Start Qdrant with Docker (exposes HTTP 6333 and gRPC 6334):
32+
33+
```bash
34+
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
35+
```
36+
37+
Note: This example connects via gRPC at `http://localhost:6334`.
38+
39+
## Input Data Preparation
40+
41+
Download a few sample PDFs (all are board game manuals) and put them into the `source_files` directory by running:
42+
43+
```bash
44+
./fetch_manual_urls.sh
45+
```
46+
47+
You can also put your favorite PDFs into the `source_files` directory.
48+
49+
## Run
50+
51+
Install dependencies:
52+
53+
```bash
54+
pip install -e .
55+
```
56+
57+
Update index, which will also setup the tables at the first time:
58+
59+
```bash
60+
cocoindex update --setup main
61+
```
62+
63+
## CocoInsight
64+
65+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
66+
67+
```bash
68+
cocoindex server -ci main
69+
```
70+
71+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/sh
2+
3+
URLS=(
4+
https://www.catan.com/sites/default/files/2021-06/catan_base_rules_2020_200707.pdf
5+
https://michalskig.wordpress.com/wp-content/uploads/2010/10/manilaenglishgame_133_gamerules.pdf
6+
https://fgbradleys.com/wp-content/uploads/rules/Carcassonne-rules.pdf
7+
https://cdn.1j1ju.com/medias/2c/f9/7f-ticket-to-ride-rulebook.pdf
8+
)
9+
10+
OUTPUT_DIR="source_files"
11+
mkdir -p $OUTPUT_DIR
12+
for URL in "${URLS[@]}"; do
13+
echo "Fetching $URL"
14+
wget -P $OUTPUT_DIR $URL
15+
done
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
import cocoindex
2+
import io
3+
import torch
4+
import functools
5+
import PIL
6+
7+
from dataclasses import dataclass
8+
from pypdf import PdfReader
9+
from transformers import CLIPModel, CLIPProcessor
10+
from typing import Literal
11+
12+
13+
QDRANT_GRPC_URL = "http://localhost:6334"
14+
QDRANT_COLLECTION_IMAGE = "PdfElementsEmbeddingImage"
15+
QDRANT_COLLECTION_TEXT = "PdfElementsEmbeddingText"
16+
17+
CLIP_MODEL_NAME = "openai/clip-vit-large-patch14"
18+
CLIP_MODEL_DIMENSION = 768
19+
ClipVectorType = cocoindex.Vector[cocoindex.Float32, Literal[CLIP_MODEL_DIMENSION]]
20+
21+
IMG_THUMBNAIL_SIZE = (512, 512)
22+
23+
24+
@functools.cache
25+
def get_clip_model() -> tuple[CLIPModel, CLIPProcessor]:
26+
model = CLIPModel.from_pretrained(CLIP_MODEL_NAME)
27+
processor = CLIPProcessor.from_pretrained(CLIP_MODEL_NAME)
28+
return model, processor
29+
30+
31+
@cocoindex.op.function(cache=True, behavior_version=1, gpu=True)
32+
def clip_embed_image(img_bytes: bytes) -> ClipVectorType:
33+
"""
34+
Convert image to embedding using CLIP model.
35+
"""
36+
model, processor = get_clip_model()
37+
image = PIL.Image.open(io.BytesIO(img_bytes)).convert("RGB")
38+
inputs = processor(images=image, return_tensors="pt")
39+
with torch.no_grad():
40+
features = model.get_image_features(**inputs)
41+
return features[0].tolist()
42+
43+
44+
def clip_embed_query(text: str) -> ClipVectorType:
45+
"""
46+
Embed the caption using CLIP model.
47+
"""
48+
model, processor = get_clip_model()
49+
inputs = processor(text=[text], return_tensors="pt", padding=True)
50+
with torch.no_grad():
51+
features = model.get_text_features(**inputs)
52+
return features[0].tolist()
53+
54+
55+
@cocoindex.transform_flow()
56+
def embed_text(
57+
text: cocoindex.DataSlice[str],
58+
) -> cocoindex.DataSlice[cocoindex.Vector[cocoindex.Float32]]:
59+
"""
60+
Embed the text using a SentenceTransformer model.
61+
This is a shared logic between indexing and querying, so extract it as a function."""
62+
return text.transform(
63+
cocoindex.functions.SentenceTransformerEmbed(
64+
model="sentence-transformers/all-MiniLM-L6-v2"
65+
)
66+
)
67+
68+
69+
@dataclass
70+
class PdfImage:
71+
name: str
72+
data: bytes
73+
74+
75+
@dataclass
76+
class PdfPage:
77+
page_number: int
78+
text: str
79+
images: list[PdfImage]
80+
81+
82+
@cocoindex.op.function()
83+
def extract_pdf_elements(content: bytes) -> list[PdfPage]:
84+
"""
85+
Extract texts and images from a PDF file.
86+
"""
87+
reader = PdfReader(io.BytesIO(content))
88+
result = []
89+
for i, page in enumerate(reader.pages):
90+
text = page.extract_text()
91+
images = []
92+
for image in page.images:
93+
img = image.image
94+
if img is None:
95+
continue
96+
# Skip very small images.
97+
if img.width < 16 or img.height < 16:
98+
continue
99+
thumbnail = io.BytesIO()
100+
img.thumbnail(IMG_THUMBNAIL_SIZE)
101+
img.save(thumbnail, img.format or "PNG")
102+
images.append(PdfImage(name=image.name, data=thumbnail.getvalue()))
103+
result.append(PdfPage(page_number=i + 1, text=text, images=images))
104+
return result
105+
106+
107+
qdrant_connection = cocoindex.add_auth_entry(
108+
"qdrant_connection",
109+
cocoindex.targets.QdrantConnection(grpc_url=QDRANT_GRPC_URL),
110+
)
111+
112+
113+
@cocoindex.flow_def(name="PdfElementsEmbedding")
114+
def multi_format_indexing_flow(
115+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
116+
) -> None:
117+
"""
118+
Define an example flow that embeds files into a vector database.
119+
"""
120+
data_scope["documents"] = flow_builder.add_source(
121+
cocoindex.sources.LocalFile(
122+
path="source_files", included_patterns=["*.pdf"], binary=True
123+
)
124+
)
125+
126+
text_output = data_scope.add_collector()
127+
image_output = data_scope.add_collector()
128+
with data_scope["documents"].row() as doc:
129+
doc["pages"] = doc["content"].transform(extract_pdf_elements)
130+
with doc["pages"].row() as page:
131+
page["chunks"] = page["text"].transform(
132+
cocoindex.functions.SplitRecursively(
133+
custom_languages=[
134+
cocoindex.functions.CustomLanguageSpec(
135+
language_name="text",
136+
separators_regex=[
137+
r"\n(\s*\n)+",
138+
r"[\.!\?]\s+",
139+
r"\n",
140+
r"\s+",
141+
],
142+
)
143+
]
144+
),
145+
language="text",
146+
chunk_size=600,
147+
chunk_overlap=100,
148+
)
149+
with page["chunks"].row() as chunk:
150+
chunk["embedding"] = chunk["text"].call(embed_text)
151+
text_output.collect(
152+
id=cocoindex.GeneratedField.UUID,
153+
filename=doc["filename"],
154+
page=page["page_number"],
155+
text=chunk["text"],
156+
embedding=chunk["embedding"],
157+
)
158+
with page["images"].row() as image:
159+
image["embedding"] = image["data"].transform(clip_embed_image)
160+
image_output.collect(
161+
id=cocoindex.GeneratedField.UUID,
162+
filename=doc["filename"],
163+
page=page["page_number"],
164+
image_data=image["data"],
165+
embedding=image["embedding"],
166+
)
167+
168+
text_output.export(
169+
"text_embeddings",
170+
cocoindex.targets.Qdrant(
171+
connection=qdrant_connection,
172+
collection_name=QDRANT_COLLECTION_TEXT,
173+
),
174+
primary_key_fields=["id"],
175+
)
176+
image_output.export(
177+
"image_embeddings",
178+
cocoindex.targets.Qdrant(
179+
connection=qdrant_connection,
180+
collection_name=QDRANT_COLLECTION_IMAGE,
181+
),
182+
primary_key_fields=["id"],
183+
)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[project]
2+
name = "pdf-elements-embedding"
3+
version = "0.1.0"
4+
description = "Simple example for cocoindex: extract text and images from PDF files and build vector index."
5+
requires-python = ">=3.11"
6+
dependencies = [
7+
"cocoindex[embeddings,colpali]>=0.2.8",
8+
"pypdf>=5.7.0",
9+
"pillow>=10.0.0",
10+
"qdrant-client>=1.15.0",
11+
]
12+
13+
[tool.setuptools]
14+
packages = []

0 commit comments

Comments
 (0)