This repository contains an experimental study bot that ingests PDF course material, stores it in a vector database and generates language‑aware summaries, flashcards and chat‑style Q&A. The code base is in flux and will be refactored into a modular tutoring system.
build_index.pyloads PDFs fromdocs/and creates a Chroma vector store.summarize_improved.pysummarises every chunk with OpenAI models, caches the results and tree‑merges them into asummary.md/ optional PDF.chat.pyprovides a CLI chat interface backed by the vector store.flashcards.pyturns retrieved chunks into an Anki deck.- Several legacy scripts exist and can be ignored during refactor.
- Install Python 3.11+.
pip install llama-index-core llama-index-llms-openai chromadb tiktoken genanki tenacity tqdm pypandoc- Export your OpenAI API key:
export OPENAI_API_KEY=sk-... - Optionally set the model via environment variable or edit
config.json.
Build the index:
python build_index.pyGenerate summaries (cached, async):
python summarize_improved.py --no-pdfChat with the material:
python chat.pyCreate flashcards:
python flashcards.pyTo ingest new PDFs, place them under docs/ (or a sub‑folder per module) and rerun build_index.py.
- Ingestor – splits PDFs into overlapping text chunks and stores them in Chroma.
- Summariser – summarises each chunk and reduces by module.
- Merger – combines module summaries into a final exam guide.
- Chat/Q&A – retrieves relevant chunks for user questions.
- Flashcard Builder – converts chunk summaries into Anki cards.
The refactor aims to replace ad‑hoc scripts with reusable modules and a CLI entry point. JSON based "Learning Units" (see agents.md) will track progress and relations between pieces of knowledge.