Build embedding index from PDF files and query with natural language

In this example, we will build a bunch of tables for papers in PDF files, including:

Metadata (title, authors, abstract) for each paper.
Author-to-paper mapping, for author-based query.
Embeddings for titles and abstract chunks, for semantics search.

We appreciate a star ⭐ at CocoIndex Github if this is helpful.

Steps

Indexing Flow

We will ingest a list of papers in PDF.
For each file, we:
- Extract the first page of the paper.
- Convert the first page to Markdown.
- Extract metadata (title, authors, abstract) from the first page.
- Split the abstract into chunks, and compute embeddings for each chunk.
We will export to the following tables in Postgres with PGVector:
- Metadata (title, authors, abstract) for each paper.
- Author-to-paper mapping, for author-based query.
- Embeddings for titles and abstract chunks, for semantics search.

Prerequisite

Install Postgres if you don't have one.
Install dependencies:
```
pip install -e .
```
Create a .env file from .env.example, and fill OPENAI_API_KEY.

Run

Update index, which will also setup the tables at the first time:

cocoindex update --setup main

You can also run the command with -L, which will watch for file changes and update the index automatically.

cocoindex update --setup -L main

CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention. Run following command to start CocoInsight:

cocoindex server -ci main

Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build embedding index from PDF files and query with natural language

Steps

Indexing Flow

Prerequisite

Run

CocoInsight

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Build embedding index from PDF files and query with natural language

Steps

Indexing Flow

Prerequisite

Run

CocoInsight