Skip to content

rddwnn/course-navigator-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Course Navigator RAG Bot

This project implements a multi-course RAG (Retrieval-Augmented Generation) assistant with a Telegram interface.
The system ingests course materials, builds a lightweight RAPTOR-style index, retrieves relevant knowledge, and answers questions strictly based on course context.

Overview

The system supports:

  • multiple courses (os-2023, ir-2024, etc.)
  • PDF ingestion with token-based chunking
  • RAPTOR-lite index: Level-0 chunks + Level-1 summaries
  • embeddings for both levels
  • structured retrieval based on Level-1 similarity
  • context construction with token-budget enforcement
  • English answers generated by an LLM
  • Telegram bot interface
  • fully containerized deployment (Docker + uv)

Project Structure

project/
  data/
    <course_id>/
      raw/        # original PDFs and materials
      index/      # chunks, summaries, embeddings
  src/
    ingest.py          # PDF ingestion and RAPTOR-lite index builder
    tokenizer.py       # model-based token counter and chunk splitter
    raptor_index.py    # index structures and disk I/O
    rag_pipeline.py    # retrieval + context building + LLM answering
    bot.py             # Telegram bot entry point
    router.py          # aiogram routing (commands, states)
    bot_state.py       # FSM definitions
    config.py          # .env configuration
  Dockerfile
  docker-compose.yml
  pyproject.toml
  README.md

Requirements

  • Python 3.11+
  • uv (dependency manager)
  • Docker (optional, recommended for deployment)
  • Telegram Bot API token
  • OpenAI-compatible API key (for embeddings and LLM)

Installation (local)

Install dependencies

uv sync

Build index for a course

uv run python -m ingest <course_id>

Example:

uv run python -m ingest os-2023

Run the Telegram bot

uv run python -m bot

Environment Variables

Create a .env file in the project root:

OPENAI_API_KEY=your_key
OPENAI_MODEL=gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-large
TELEGRAM_BOT_TOKEN=your_telegram_token

Docker Deployment

Build and run with docker-compose

docker compose build
docker compose up -d

Filesystem layout on server

data/
  os-2023/
    raw/      # upload PDFs here
    index/    # generated automatically by ingest

Build index on the server

docker compose run --rm course-navigator-rag \
  uv run python -m course_navigator_rag.ingest os-2023

How Retrieval Works

  1. Level-1 summaries represent clusters of bottom-level chunks.
  2. A user question is embedded via text-embedding-3-large.
  3. Level-1 summaries are ranked by cosine similarity.
  4. Corresponding Level-0 chunks are collected with a token-budget constraint.
  5. The final context is sent to the LLM (gpt-4o-mini).
  6. The model answers strictly based on retrieved context.

This ensures deterministic, grounded responses with minimal hallucination.

Telegram Bot Flow

  • /start — choose a course
  • After choosing, every message is interpreted as a question
  • The bot retrieves context and replies in English
  • Interface remains in Russian for comfortable UX

Extending the System

To add a new course:

  1. Create folders:
    data/<new_course>/raw
    data/<new_course>/index
    
  2. Upload PDFs into raw/
  3. Run ingestion:
    uv run python -m course_navigator_rag.ingest <new_course>
    
  4. Add the course to AVAILABLE_COURSES in router.py.

About

Модульный Telegram-бот c Retrieval-Augmented Generation (RAG), предназначенный для поддержки нескольких учебных курсов одновременно.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors