example: add pdf_elements_embedding example (#1180)

georgeh0 · web-flow · commit 3a8d96625d07 · 2025-10-12T23:07:24.000-07:00
diff --git a/README.md b/README.md
@@ -22,7 +22,6 @@
     <a href="https://trendshift.io/repositories/13939" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13939" alt="cocoindex-io%2Fcocoindex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </div>
 
-
 Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box.  Exceptional developer velocity. Production-ready at day 0.
 
 ⭐ Drop a star to help us grow!
@@ -60,9 +59,8 @@ CocoIndex makes it effortless to transform data with AI, and keep source data an
 
 </br>
 
-
-
 ## Exceptional velocity
+
 Just declare transformation in dataflow with ~100 lines of python
 
 ```python
@@ -86,25 +84,30 @@ CocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_
 **Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
 
 ## Plug-and-Play Building Blocks
+
 Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
 
 <p align="center">
     <img src="https://cocoindex.io/images/components.svg" alt="CocoIndex Features">
 </p>
 
 ## Data Freshness
+
 CocoIndex keep source data and target in sync effortlessly.
 
 <p align="center">
     <img src="https://github.com/user-attachments/assets/f4eb29b3-84ee-4fa0-a1e2-80eedeeabde6" alt="Incremental Processing" width="700">
 </p>
 
 It has out-of-box support for incremental indexing:
+
 - minimal recomputation on source or logic change.
 - (re-)processing necessary portions; reuse cache when possible
 
-## Quick Start:
+## Quick Start
+
 If you're new to CocoIndex, we recommend checking out
+
 - 📖 [Documentation](https://cocoindex.io/docs)
 - ⚡  [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart)
 - 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)
@@ -119,7 +122,6 @@ pip install -U cocoindex
 
 2. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.
 
-
 ## Define data flow
 
 Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:
@@ -175,6 +177,7 @@ It defines an index flow like this:
 | [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
 | [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
 | [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
+| [PDF Elements Embedding](examples/pdf_elements_embedding) | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |
 | [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
 | [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
 | [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
@@ -191,16 +194,18 @@ It defines an index flow like this:
 | [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
 | [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
 
-
 More coming and stay tuned 👀!
 
 ## 📖 Documentation
+
 For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
 
 ## 🤝 Contributing
+
 We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
 
 ## 👥 Community
+
 Welcome with a huge coconut hug 🥥⋆｡˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
 
 Join our community here:
@@ -210,8 +215,10 @@ Join our community here:
 - ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
 - 📜 [Read our blog posts](https://cocoindex.io/blogs/)
 
-## Support us:
+## Support us
+
 We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.
 
 ## License
+
 CocoIndex is Apache 2.0 licensed.
diff --git a/examples/pdf_elements_embedding/.env b/examples/pdf_elements_embedding/.env
@@ -0,0 +1,6 @@
+# Postgres database address for cocoindex
+COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
+
+# Fallback to CPU for operations not supported by MPS on Mac.
+# It's no-op for other platforms.
+PYTORCH_ENABLE_MPS_FALLBACK=1
diff --git a/examples/pdf_elements_embedding/.gitignore b/examples/pdf_elements_embedding/.gitignore
@@ -0,0 +1 @@
+/source_files
diff --git a/examples/pdf_elements_embedding/README.md b/examples/pdf_elements_embedding/README.md
@@ -0,0 +1,71 @@
+# Extract text and images from PDFs and build multimodal search
+
+[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
+
+In this example, we extract texts and images from PDF pages, embed them with two models, and store them in Qdrant for multimodal search:
+
+- Text: SentenceTransformers `all-MiniLM-L6-v2`
+- Images: CLIP `openai/clip-vit-large-patch14` (ViT-L/14, 768-dim)
+
+We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
+
+## Steps
+
+### Indexing Flow
+
+1. Ingest PDF files from the `source_files` directory.
+2. For each PDF page:
+   - Extract page text and images using `pypdf`.
+   - Skip very small images and create thumbnails up to 512×512 for consistency.
+   - Split text into chunks with `SplitRecursively` (language="text", chunk_size=600, chunk_overlap=100).
+   - Embed text chunks with SentenceTransformers (`all-MiniLM-L6-v2`).
+   - Embed images with CLIP (`openai/clip-vit-large-patch14`).
+3. Save embeddings and metadata in Qdrant:
+   - Text collection: `PdfElementsEmbeddingText`
+   - Image collection: `PdfElementsEmbeddingImage`
+
+## Prerequisite
+
+[Install Qdrant](https://qdrant.tech/documentation/guides/installation/) if you don't have one running locally.
+
+Start Qdrant with Docker (exposes HTTP 6333 and gRPC 6334):
+
+```bash
+docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
+```
+
+Note: This example connects via gRPC at `http://localhost:6334`.
+
+## Input Data Preparation
+
+Download a few sample PDFs (all are board game manuals) and put them into the `source_files` directory by running:
+
+```bash
+./fetch_manual_urls.sh
+```
+
+You can also put your favorite PDFs into the `source_files` directory.
+
+## Run
+
+Install dependencies:
+
+```bash
+pip install -e .
+```
+
+Update index, which will also setup the tables at the first time:
+
+```bash
+cocoindex update --setup main
+```
+
+## CocoInsight
+
+I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
+
+```bash
+cocoindex server -ci main
+```
+
+Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
diff --git a/examples/pdf_elements_embedding/fetch_manual_urls.sh b/examples/pdf_elements_embedding/fetch_manual_urls.sh
@@ -0,0 +1,15 @@
+#!/bin/sh
+
+URLS=(
+    https://www.catan.com/sites/default/files/2021-06/catan_base_rules_2020_200707.pdf
+    https://michalskig.wordpress.com/wp-content/uploads/2010/10/manilaenglishgame_133_gamerules.pdf
+    https://fgbradleys.com/wp-content/uploads/rules/Carcassonne-rules.pdf
+    https://cdn.1j1ju.com/medias/2c/f9/7f-ticket-to-ride-rulebook.pdf
+)
+
+OUTPUT_DIR="source_files"
+mkdir -p $OUTPUT_DIR
+for URL in "${URLS[@]}"; do
+    echo "Fetching $URL"
+    wget -P $OUTPUT_DIR $URL
+done
diff --git a/examples/pdf_elements_embedding/main.py b/examples/pdf_elements_embedding/main.py
@@ -0,0 +1,183 @@
+import cocoindex
+import io
+import torch
+import functools
+import PIL
+
+from dataclasses import dataclass
+from pypdf import PdfReader
+from transformers import CLIPModel, CLIPProcessor
+from typing import Literal
+
+
+QDRANT_GRPC_URL = "http://localhost:6334"
+QDRANT_COLLECTION_IMAGE = "PdfElementsEmbeddingImage"
+QDRANT_COLLECTION_TEXT = "PdfElementsEmbeddingText"
+
+CLIP_MODEL_NAME = "openai/clip-vit-large-patch14"
+CLIP_MODEL_DIMENSION = 768
+ClipVectorType = cocoindex.Vector[cocoindex.Float32, Literal[CLIP_MODEL_DIMENSION]]
+
+IMG_THUMBNAIL_SIZE = (512, 512)
+
+
+@functools.cache
+def get_clip_model() -> tuple[CLIPModel, CLIPProcessor]:
+    model = CLIPModel.from_pretrained(CLIP_MODEL_NAME)
+    processor = CLIPProcessor.from_pretrained(CLIP_MODEL_NAME)
+    return model, processor
+
+
+@cocoindex.op.function(cache=True, behavior_version=1, gpu=True)
+def clip_embed_image(img_bytes: bytes) -> ClipVectorType:
+    """
+    Convert image to embedding using CLIP model.
+    """
+    model, processor = get_clip_model()
+    image = PIL.Image.open(io.BytesIO(img_bytes)).convert("RGB")
+    inputs = processor(images=image, return_tensors="pt")
+    with torch.no_grad():
+        features = model.get_image_features(**inputs)
+    return features[0].tolist()
+
+
+def clip_embed_query(text: str) -> ClipVectorType:
+    """
+    Embed the caption using CLIP model.
+    """
+    model, processor = get_clip_model()
+    inputs = processor(text=[text], return_tensors="pt", padding=True)
+    with torch.no_grad():
+        features = model.get_text_features(**inputs)
+    return features[0].tolist()
+
+
+@cocoindex.transform_flow()
+def embed_text(
+    text: cocoindex.DataSlice[str],
+) -> cocoindex.DataSlice[cocoindex.Vector[cocoindex.Float32]]:
+    """
+    Embed the text using a SentenceTransformer model.
+    This is a shared logic between indexing and querying, so extract it as a function."""
+    return text.transform(
+        cocoindex.functions.SentenceTransformerEmbed(
+            model="sentence-transformers/all-MiniLM-L6-v2"
+        )
+    )
+
+
+@dataclass
+class PdfImage:
+    name: str
+    data: bytes
+
+
+@dataclass
+class PdfPage:
+    page_number: int
+    text: str
+    images: list[PdfImage]
+
+
+@cocoindex.op.function()
+def extract_pdf_elements(content: bytes) -> list[PdfPage]:
+    """
+    Extract texts and images from a PDF file.
+    """
+    reader = PdfReader(io.BytesIO(content))
+    result = []
+    for i, page in enumerate(reader.pages):
+        text = page.extract_text()
+        images = []
+        for image in page.images:
+            img = image.image
+            if img is None:
+                continue
+            # Skip very small images.
+            if img.width < 16 or img.height < 16:
+                continue
+            thumbnail = io.BytesIO()
+            img.thumbnail(IMG_THUMBNAIL_SIZE)
+            img.save(thumbnail, img.format or "PNG")
+            images.append(PdfImage(name=image.name, data=thumbnail.getvalue()))
+        result.append(PdfPage(page_number=i + 1, text=text, images=images))
+    return result
+
+
+qdrant_connection = cocoindex.add_auth_entry(
+    "qdrant_connection",
+    cocoindex.targets.QdrantConnection(grpc_url=QDRANT_GRPC_URL),
+)
+
+
+@cocoindex.flow_def(name="PdfElementsEmbedding")
+def multi_format_indexing_flow(
+    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
+) -> None:
+    """
+    Define an example flow that embeds files into a vector database.
+    """
+    data_scope["documents"] = flow_builder.add_source(
+        cocoindex.sources.LocalFile(
+            path="source_files", included_patterns=["*.pdf"], binary=True
+        )
+    )
+
+    text_output = data_scope.add_collector()
+    image_output = data_scope.add_collector()
+    with data_scope["documents"].row() as doc:
+        doc["pages"] = doc["content"].transform(extract_pdf_elements)
+        with doc["pages"].row() as page:
+            page["chunks"] = page["text"].transform(
+                cocoindex.functions.SplitRecursively(
+                    custom_languages=[
+                        cocoindex.functions.CustomLanguageSpec(
+                            language_name="text",
+                            separators_regex=[
+                                r"\n(\s*\n)+",
+                                r"[\.!\?]\s+",
+                                r"\n",
+                                r"\s+",
+                            ],
+                        )
+                    ]
+                ),
+                language="text",
+                chunk_size=600,
+                chunk_overlap=100,
+            )
+            with page["chunks"].row() as chunk:
+                chunk["embedding"] = chunk["text"].call(embed_text)
+                text_output.collect(
+                    id=cocoindex.GeneratedField.UUID,
+                    filename=doc["filename"],
+                    page=page["page_number"],
+                    text=chunk["text"],
+                    embedding=chunk["embedding"],
+                )
+            with page["images"].row() as image:
+                image["embedding"] = image["data"].transform(clip_embed_image)
+                image_output.collect(
+                    id=cocoindex.GeneratedField.UUID,
+                    filename=doc["filename"],
+                    page=page["page_number"],
+                    image_data=image["data"],
+                    embedding=image["embedding"],
+                )
+
+    text_output.export(
+        "text_embeddings",
+        cocoindex.targets.Qdrant(
+            connection=qdrant_connection,
+            collection_name=QDRANT_COLLECTION_TEXT,
+        ),
+        primary_key_fields=["id"],
+    )
+    image_output.export(
+        "image_embeddings",
+        cocoindex.targets.Qdrant(
+            connection=qdrant_connection,
+            collection_name=QDRANT_COLLECTION_IMAGE,
+        ),
+        primary_key_fields=["id"],
+    )
diff --git a/examples/pdf_elements_embedding/pyproject.toml b/examples/pdf_elements_embedding/pyproject.toml
@@ -0,0 +1,14 @@
+[project]
+name = "pdf-elements-embedding"
+version = "0.1.0"
+description = "Simple example for cocoindex: extract text and images from PDF files and build vector index."
+requires-python = ">=3.11"
+dependencies = [
+    "cocoindex[embeddings,colpali]>=0.2.8",
+    "pypdf>=5.7.0",
+    "pillow>=10.0.0",
+    "qdrant-client>=1.15.0",
+]
+
+[tool.setuptools]
+packages = []