Name	Name	Last commit message	Last commit date
parent directory ..
.env	.env
.gitignore	.gitignore
README.md	README.md
fetch_manual_urls.sh	fetch_manual_urls.sh
main.py	main.py
pyproject.toml	pyproject.toml

Name

Last commit message

Last commit date

Extract text and images from PDFs and build multimodal search

In this example, we extract texts and images from PDF pages, embed them with two models, and store them in Qdrant for multimodal search:

Text: SentenceTransformers all-MiniLM-L6-v2
Images: CLIP openai/clip-vit-large-patch14 (ViT-L/14, 768-dim)

We appreciate a star ⭐ at CocoIndex Github if this is helpful.

Steps

Indexing Flow

Ingest PDF files from the source_files directory.
For each PDF page:
- Extract page text and images using pypdf.
- Skip very small images and create thumbnails up to 512×512 for consistency.
- Split text into chunks with SplitRecursively (language="text", chunk_size=600, chunk_overlap=100).
- Embed text chunks with SentenceTransformers (all-MiniLM-L6-v2).
- Embed images with CLIP (openai/clip-vit-large-patch14).
Save embeddings and metadata in Qdrant:
- Text collection: PdfElementsEmbeddingText
- Image collection: PdfElementsEmbeddingImage

Prerequisite

Install Qdrant if you don't have one running locally.

Start Qdrant with Docker (exposes HTTP 6333 and gRPC 6334):

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

Note: This example connects via gRPC at http://localhost:6334.

Input Data Preparation

Download a few sample PDFs (all are board game manuals) and put them into the source_files directory by running:

./fetch_manual_urls.sh

You can also put your favorite PDFs into the source_files directory.

Run

Install dependencies:

pip install -e .

Update index, which will also setup the tables at the first time:

cocoindex update --setup main

CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:

cocoindex server -ci main

Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Extract text and images from PDFs and build multimodal search

Steps

Indexing Flow

Prerequisite

Input Data Preparation

Run

CocoInsight

FilesExpand file tree

pdf_elements_embedding

Directory actions

More options

Directory actions

More options

Latest commit

History

pdf_elements_embedding

Folders and files

parent directory

README.md

Extract text and images from PDFs and build multimodal search

Steps

Indexing Flow

Prerequisite

Input Data Preparation

Run

CocoInsight