Skip to content

clarencestephen/semantic_research_engine

 
 

Repository files navigation

Semantic Research Paper Engine

RAG pipeline for semantic search over academic papers, with an embedded Chainlit copilot and LangChain orchestration.

Forked from tahreemrasul/semantic_research_engine and deployed as a reference implementation for retrieval-augmented generation over the arXiv corpus. Original work by Tahreem Rasul; see upstream repository for contribution history.


Overview

Retrieves academic papers from the arXiv API based on a user query, embeds them into a Chroma vector database, and exposes a conversational interface for asking questions grounded in the retrieved literature. The application combines a Chainlit-based copilot embedded in a web page with Literal AI observability for tracking prompt performance and generation quality.

Architecture

User Query → arXiv API → Paper Retrieval → Chroma Vector DB
                                                ↓
                              Embedding (HuggingFace) → RAG Pipeline
                                                ↓
                              GPT-3.5 Generation → Chainlit Copilot UI
                                                ↓
                              Literal AI Observability → Performance Tracking

Components

Layer Technology Purpose
Retrieval arXiv API + LangChain Fetch papers matching user query
Embedding HuggingFace Transformers Semantic vector representations
Vector Store Chroma Similarity search over paper embeddings
Generation OpenAI GPT-3.5 Answer synthesis from retrieved context
Frontend Chainlit Copilot Conversational UI embedded in web app
Observability Literal AI Prompt optimization and generation tracking

Stack

  • LangChain — RAG pipeline orchestration and arXiv integration
  • Chroma — Vector database for paper embeddings
  • OpenAI — GPT-3.5 for answer generation
  • Chainlit — Copilot frontend with embedded web widget
  • Literal AI — Observability, prompt management, and performance tracking
  • HuggingFace — Embedding models

Project Structure

├── search_engine.py    # Main Chainlit application + RAG pipeline
├── rag_test.py         # Standalone RAG pipeline test script
├── index.html          # Web frontend with embedded Chainlit copilot
├── requirements.txt    # Python dependencies
└── .env                # API keys (OpenAI, Literal AI)

Setup

conda create --name semantic_research_engine python=3.10
conda activate semantic_research_engine
pip install -r requirements.txt

Create a .env file:

OPENAI_API_KEY=your-key
LITERAL_API_KEY=your-key

Usage

# Start Chainlit backend
chainlit run search_engine.py -w

# In a separate terminal, serve the web frontend
npx http-server

The copilot runs at localhost:8000; the web app at localhost:8080.

License

This fork inherits the MIT License from the upstream repository.

About

A semantic research engine to get relevant papers based on a user query. Application frontend with Chainlit Copilot. Observability with Literal. Other Tech stack: LangChain, Chroma

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 69.8%
  • HTML 30.2%