Semantic Research Paper Engine

RAG pipeline for semantic search over academic papers, with an embedded Chainlit copilot and LangChain orchestration.

Forked from tahreemrasul/semantic_research_engine and deployed as a reference implementation for retrieval-augmented generation over the arXiv corpus. Original work by Tahreem Rasul; see upstream repository for contribution history.

Overview

Retrieves academic papers from the arXiv API based on a user query, embeds them into a Chroma vector database, and exposes a conversational interface for asking questions grounded in the retrieved literature. The application combines a Chainlit-based copilot embedded in a web page with Literal AI observability for tracking prompt performance and generation quality.

Architecture

User Query → arXiv API → Paper Retrieval → Chroma Vector DB
                                                ↓
                              Embedding (HuggingFace) → RAG Pipeline
                                                ↓
                              GPT-3.5 Generation → Chainlit Copilot UI
                                                ↓
                              Literal AI Observability → Performance Tracking

Components

Layer	Technology	Purpose
Retrieval	arXiv API + LangChain	Fetch papers matching user query
Embedding	HuggingFace Transformers	Semantic vector representations
Vector Store	Chroma	Similarity search over paper embeddings
Generation	OpenAI GPT-3.5	Answer synthesis from retrieved context
Frontend	Chainlit Copilot	Conversational UI embedded in web app
Observability	Literal AI	Prompt optimization and generation tracking

Stack

LangChain — RAG pipeline orchestration and arXiv integration
Chroma — Vector database for paper embeddings
OpenAI — GPT-3.5 for answer generation
Chainlit — Copilot frontend with embedded web widget
Literal AI — Observability, prompt management, and performance tracking
HuggingFace — Embedding models

Project Structure

├── search_engine.py    # Main Chainlit application + RAG pipeline
├── rag_test.py         # Standalone RAG pipeline test script
├── index.html          # Web frontend with embedded Chainlit copilot
├── requirements.txt    # Python dependencies
└── .env                # API keys (OpenAI, Literal AI)

Setup

conda create --name semantic_research_engine python=3.10
conda activate semantic_research_engine
pip install -r requirements.txt

Create a .env file:

OPENAI_API_KEY=your-key
LITERAL_API_KEY=your-key

Usage

# Start Chainlit backend
chainlit run search_engine.py -w

# In a separate terminal, serve the web frontend
npx http-server

The copilot runs at localhost:8000; the web app at localhost:8080.

License

This fork inherits the MIT License from the upstream repository.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.chainlit		.chainlit
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
App_schematic.png		App_schematic.png
README.md		README.md
Semantic_Research_Engine.png		Semantic_Research_Engine.png
chainlit.md		chainlit.md
index.html		index.html
rag_test.py		rag_test.py
requirements.txt		requirements.txt
search_engine.py		search_engine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Research Paper Engine

Overview

Architecture

Components

Stack

Project Structure

Setup

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Research Paper Engine

Overview

Architecture

Components

Stack

Project Structure

Setup

Usage

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages