In this example, we build a visual document indexing flow using ColPali for embedding PDFs and images. and query the index with natural language.
We appreciate a star ⭐ at CocoIndex Github if this is helpful.
- We ingest a list of PDF files and image files from the
source_filesdirectory. - For each file:
- PDF files: convert each page to a high-resolution image (300 DPI)
- Image files: use the image directly
- Generate visual embeddings for each page/image using ColPali model
- We will save the embeddings and metadata in Qdrant vector database.
We will match against user-provided natural language text using ColPali's text-to-visual embedding capability, enabling semantic search across visual document content.
Install Qdrant if you don't have one running locally.
You can start Qdrant with Docker:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrantInstall dependencies:
pip install -e .NOTE: The pdf2image requires poppler to be installed manually. Please refer to this document for the specific installation instructions for your platform.
Setup:
cocoindex setup mainUpdate index:
cocoindex update mainRun:
python main.pyThe example data files used in this demonstration come from the following sources:
- ArXiv Papers: Research papers sourced from ArXiv, an open-access repository of electronic preprints covering various scientific disciplines.
- Healthcare Industry Dataset: Images from the vidore/syntheticDocQA_healthcare_industry_test dataset on Hugging Face, which contains synthetic document question-answering data for healthcare industry documents.
- ESG Reports Dataset: Images from the vidore/esg_reports_eng_v2 dataset on Hugging Face, containing Environmental, Social, and Governance (ESG) reports.
We thank the creators and maintainers of these datasets for making their data available for research and development purposes.
This example uses ColPali, a state-of-the-art vision-language model that enables:
- Direct visual understanding of document layouts, tables, and figures
- Natural language queries against visual document content
- No need for OCR or text extraction - works directly with document images
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
cocoindex server -ci main
Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.