PDF2VectorDB

-- Built during EF Hackathon London 2024 --

Ingestion engine for rich PDFs. It has two implementations:

Parse PDFs into images using Pillow and use ColPali (Vision Language Model) to process the images into vector embeddings - captures the context of text AND visual elements
SentencesTransformers library, PyMuPDF and PyPDF2 to convert the text into vector embeddings - captures the context of just the text, but has more metadata

The vector embeddings created above can be passed into a Qdrant vector database.

During the hackathon this was integrated into an agentic system, link here: https://github.com/KenjiPcx/ef-fall-hack

Pre-requisite steps:

Create a pipenv shell
Run pip install -r requirements.txt
Store your pdf files inside an input_pdfs folder at the root of the project
Get Docker running with Qdrant
- docker pull qdrant/qdrant:latest
- docker run -p 6333:6333 -d qdrant/qdrant

Method #1 of ingestion - using ColPali:

Run vlm_impl/pdf_to_image.py to parse the PDFs into images to feed into the ingestion engine
Run vlm_impl/store_embeddings.py - note to change the torch accelerator depending on your device e.g. "mps" for Apple Silicon macOS devices

Method #2 of ingestion - using SentencesTransformer:

Run text_impl/process_pdfs.py - note to change the torch accelerator depending on your device e.g. "mps" for Apple Silicon macOS devices

Validation:

Run the cells query_qdrant.ipynb - note to update the query with your specific request

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
qdrant		qdrant
text_impl		text_impl
utils		utils
vlm_impl		vlm_impl
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
query_qdrant.ipynb		query_qdrant.ipynb
uv.lock		uv.lock

Provide feedback