-- Built during EF Hackathon London 2024 --
Ingestion engine for rich PDFs. It has two implementations:
- Parse PDFs into images using Pillow and use ColPali (Vision Language Model) to process the images into vector embeddings - captures the context of text AND visual elements
- SentencesTransformers library, PyMuPDF and PyPDF2 to convert the text into vector embeddings - captures the context of just the text, but has more metadata
The vector embeddings created above can be passed into a Qdrant vector database.
During the hackathon this was integrated into an agentic system, link here: https://github.com/KenjiPcx/ef-fall-hack
Pre-requisite steps:
- Create a
pipenv shell - Run
pip install -r requirements.txt - Store your pdf files inside an
input_pdfsfolder at the root of the project - Get Docker running with Qdrant
docker pull qdrant/qdrant:latestdocker run -p 6333:6333 -d qdrant/qdrant
Method #1 of ingestion - using ColPali:
- Run
vlm_impl/pdf_to_image.pyto parse the PDFs into images to feed into the ingestion engine - Run
vlm_impl/store_embeddings.py- note to change the torch accelerator depending on your device e.g. "mps" for Apple Silicon macOS devices
Method #2 of ingestion - using SentencesTransformer:
- Run
text_impl/process_pdfs.py- note to change the torch accelerator depending on your device e.g. "mps" for Apple Silicon macOS devices
Validation:
- Run the cells
query_qdrant.ipynb- note to update the query with your specific request