This project implements a multi-course RAG (Retrieval-Augmented Generation) assistant with a Telegram interface.
The system ingests course materials, builds a lightweight RAPTOR-style index, retrieves relevant knowledge, and answers questions strictly based on course context.
The system supports:
- multiple courses (
os-2023,ir-2024, etc.) - PDF ingestion with token-based chunking
- RAPTOR-lite index: Level-0 chunks + Level-1 summaries
- embeddings for both levels
- structured retrieval based on Level-1 similarity
- context construction with token-budget enforcement
- English answers generated by an LLM
- Telegram bot interface
- fully containerized deployment (Docker + uv)
project/
data/
<course_id>/
raw/ # original PDFs and materials
index/ # chunks, summaries, embeddings
src/
ingest.py # PDF ingestion and RAPTOR-lite index builder
tokenizer.py # model-based token counter and chunk splitter
raptor_index.py # index structures and disk I/O
rag_pipeline.py # retrieval + context building + LLM answering
bot.py # Telegram bot entry point
router.py # aiogram routing (commands, states)
bot_state.py # FSM definitions
config.py # .env configuration
Dockerfile
docker-compose.yml
pyproject.toml
README.md
- Python 3.11+
- uv (dependency manager)
- Docker (optional, recommended for deployment)
- Telegram Bot API token
- OpenAI-compatible API key (for embeddings and LLM)
uv sync
uv run python -m ingest <course_id>
Example:
uv run python -m ingest os-2023
uv run python -m bot
Create a .env file in the project root:
OPENAI_API_KEY=your_key
OPENAI_MODEL=gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-large
TELEGRAM_BOT_TOKEN=your_telegram_token
docker compose build
docker compose up -d
data/
os-2023/
raw/ # upload PDFs here
index/ # generated automatically by ingest
docker compose run --rm course-navigator-rag \
uv run python -m course_navigator_rag.ingest os-2023
- Level-1 summaries represent clusters of bottom-level chunks.
- A user question is embedded via
text-embedding-3-large. - Level-1 summaries are ranked by cosine similarity.
- Corresponding Level-0 chunks are collected with a token-budget constraint.
- The final context is sent to the LLM (
gpt-4o-mini). - The model answers strictly based on retrieved context.
This ensures deterministic, grounded responses with minimal hallucination.
/start— choose a course- After choosing, every message is interpreted as a question
- The bot retrieves context and replies in English
- Interface remains in Russian for comfortable UX
To add a new course:
- Create folders:
data/<new_course>/raw data/<new_course>/index - Upload PDFs into
raw/ - Run ingestion:
uv run python -m course_navigator_rag.ingest <new_course> - Add the course to
AVAILABLE_COURSESinrouter.py.