Jul352mf · Jul352mf · Jul 4, 2025 · coderabbitai · Jul 5, 2025 · coderabbitai
diff --git a/.gitattributes b/.gitattributes
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,18 @@
+# Python artifacts
+__pycache__/
+*.py[cod]
+*.egg-info/
+# Virtual envs
+venv/
+.env
+
+# Data
+chroma/
+cache/
+*.apkg
+
+# study PDFs kept local
+Dev/data/
+
+# Other
+.idea/
diff --git a/Dev/CONTRIBUTING.md b/Dev/CONTRIBUTING.md
@@ -0,0 +1,7 @@
+# Contributing
+
+1. Install dependencies using Poetry or `requirements.txt`.
+2. Follow the existing module structure under `src/study_tools`.
+3. Add tests for new functionality in `Dev/tests`.
+4. Run `ruff`, `black`, and `pytest` before submitting a PR.
+5. Document changes in `docs/changelog.md` and update `TODO.md` if needed.
diff --git a/Dev/README.md b/Dev/README.md
@@ -0,0 +1,22 @@
+# Study Tools Dev Package
+
+This `Dev/` directory houses the refactored implementation of the **Universal Study Tutor**. The old prototype remains in `messy_start/` for reference.
+Course PDFs should be placed in `Dev/data/`, which is ignored by Git.
+
+## Features
+- Configurable PDF ingestion and chunking
+- Async summarisation using local Mistral and OpenAI GPT‑4o
+- CLI tools for building the index, chat, flashcards and maintenance
+- Learning Unit JSON schema with status counters and categories
+- Externalised configuration via `config.yaml`
+- Course PDFs stored locally in `Dev/data/` (see `docs/MIGRATE_LARGE_FILES.md`)
+
+## Quickstart
+```bash
+python -m pip install -r requirements.txt
+python -m study_tools.build_index
+python -m study_tools.summarize
+python -m study_tools.cli_chat
+```
+
+See `docs/overview.md` for more details.
diff --git a/Dev/agents.yaml b/Dev/agents.yaml
@@ -0,0 +1,13 @@
+agents:
+  - name: Ingestor
+    role: Split PDFs into sentence-aware chunks and store them in Qdrant.
+  - name: Summariser
+    role: Summarise chunks using GPT-4o and cache results.
+  - name: Tagger
+    role: Classify chunks into categories with local Mistral.
+  - name: LUManager
+    role: Persist Learning Units with status counters and relations.
+  - name: Chat
+    role: Interactive Q&A and tutoring over the stored materials.
+  - name: FlashcardBuilder
+    role: Generate Anki-compatible decks from summaries.
diff --git a/Dev/config.yaml b/Dev/config.yaml
@@ -0,0 +1,23 @@
+paths:
+  docs_dir: data
+  chroma_dir: chroma
+  cache_dir: cache
+chunking:
+  chunk_size: 1024
+  chunk_overlap: 128
+  pages_per_group: 2
+  page_overlap: 1
+  chunk_group_limit: 6000
+models:
+  default: gpt-4o
+  tagging: mistral-7b-instruct
+  summarizer: gpt-4o
+context_windows:
+  gpt-4o: 128000
+  gpt-4-turbo: 128000
+  gpt-4: 8192
+  gpt-3.5-turbo: 16385
+  mistral-7b-instruct: 32768
+limits:
+  tokens_per_minute: 40000
+  token_margin: 512
diff --git a/Dev/docs/MIGRATE_LARGE_FILES.md b/Dev/docs/MIGRATE_LARGE_FILES.md
@@ -0,0 +1,11 @@
+# Handling Large PDF Files
+
+Place course PDFs inside `Dev/data/` which is ignored by Git. They are not versioned by default.
+
+If repository limits become a problem later, you can retroactively move PDFs into Git LFS with:
+
+```bash
+git lfs migrate import '*.pdf'
+```
+
+Otherwise keep the files locally and back them up to Google Drive or GCS as needed.
diff --git a/Dev/docs/TODO.md b/Dev/docs/TODO.md
@@ -0,0 +1,23 @@
+# TODO Backlog
+
+## P0
+- Centralised configuration loader (`utils.load_config`).
+- Remove hard coded paths; read from `config.yaml`.
+- Store PDFs in `Dev/data/` (optionally migrate to Git LFS later).
+
+## P1
+- OCR fallback and duplicate detection during ingestion.
+- Implement KnowledgeNode graph with status counters.
+- Tagging pipeline using local Mistral model.
+- CLI commands via `python -m study_tools <command>`.
+
+## P2
+- Evaluation harness (ROUGE-L, entity overlap, manual rubric).
+- Streamlit MVP for progress view.
+
+## P3
+- Difficulty-graded exam question generator (IRT).
+- Anki `*.apkg` exporter with AnkiConnect.
+
+## P4
+- Visual progress dashboard and Obsidian vault export.
diff --git a/Dev/docs/changelog.md b/Dev/docs/changelog.md
@@ -0,0 +1,7 @@
+# Changelog
+
+## 2025-07-03
+- Initial refactor: new `Dev/` package created.
+- Configuration moved to `config.yaml`.
+- PDFs now stored in `Dev/data/`; Git LFS usage is optional.
+- Migrated documentation and created skeleton tests.
diff --git a/chunking_model_report.md → Dev/docs/chunking_model_report.md b/chunking_model_report.md → Dev/docs/chunking_model_report.md
diff --git a/Dev/docs/overview.md b/Dev/docs/overview.md
@@ -0,0 +1,12 @@
+# Overview
+
+The Dev package implements the second iteration of the study bot based on the **Hybrid‑Edge** architecture:
+
+- **Local tagging** with Mistral‑7B‑Instruct classifies text chunks into categories.
+- **GPT‑4o/4.1** performs heavy summarisation and tutoring logic.
+- **SQLite** stores metadata and Learning Units. **Qdrant** provides vector search.
+- Outputs are plain JSON which are rendered to Markdown files.
+
+Course PDFs belong in `Dev/data/` and are not tracked in Git.
+
+Scripts read defaults from `config.yaml` so chunk sizes and model names are easily changed.
diff --git a/Dev/pyproject.toml b/Dev/pyproject.toml
@@ -0,0 +1,23 @@
+[tool.poetry]
+name = "study-tools"
+version = "0.2.0"
+description = "Universal Study Tutor"
+authors = ["Study Bot Team"]
+packages = [{include = "study_tools", from = "src"}]
+
+[tool.poetry.dependencies]
+python = "^3.12"
+llama-index-core = "*"
+llama-index-llms-openai = "*"
+chromadb = "*"
+tiktoken = "*"
+tenacity = "*"
+qdrant-client = "*"
+genanki = "*"
+tqdm = "*"
+pyyaml = "*"
+
+[tool.poetry.group.dev.dependencies]
+pytest = "*"
+ruff = "*"
+black = "*"
diff --git a/Dev/requirements.txt b/Dev/requirements.txt
@@ -0,0 +1,9 @@
+llama-index-core
+llama-index-llms-openai
+chromadb
+tiktoken
+tenacity
+qdrant-client
+genanki
+tqdm
+pyyaml
diff --git a/schema/learning_unit.schema.json → Dev/schema.json b/schema/learning_unit.schema.json → Dev/schema.json
diff --git a/Dev/src/study_tools/__init__.py b/Dev/src/study_tools/__init__.py
@@ -0,0 +1,11 @@
+"""Study Tools package."""
+
+__all__ = [
+    "build_index",
+    "summarize",
+    "cli_chat",
+    "flashcards",
+    "ingest",
+    "reset",
+    "utils",
+]
diff --git a/Dev/src/study_tools/build_index.py b/Dev/src/study_tools/build_index.py
@@ -0,0 +1,67 @@
+"""PDF ingestion and vector index creation."""
+
+from pathlib import Path
+import shutil
+
+# Heavy imports are done inside functions to allow importing this module without
+# optional dependencies.
+
+from .utils import load_config
+
+
+def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
+    import fitz  # PyMuPDF
+    from llama_index.core import Document
+    doc = fitz.open(pdf_path)
+    for i in range(0, len(doc), pages_per_group - overlap):
+        end = min(i + pages_per_group, len(doc))
-    for i in range(0, len(doc), pages_per_group - overlap):
-        end = min(i + pages_per_group, len(doc))
+def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
+    if overlap >= pages_per_group:
+        raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})")
+    if pages_per_group <= 0 or overlap < 0:
+        raise ValueError("pages_per_group must be positive and overlap must be non-negative")
+    import fitz  # PyMuPDF
+
+    # ... rest of extract_pages implementation ...
+    for i in range(0, len(doc), pages_per_group - overlap):
+        end = min(i + pages_per_group, len(doc))
+        …
-    for i in range(0, len(doc), pages_per_group - overlap):
-        end = min(i + pages_per_group, len(doc))
+def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
+    if overlap >= pages_per_group:
+        raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})")
+    if pages_per_group <= 0 or overlap < 0:
+        raise ValueError("pages_per_group must be positive and overlap must be non-negative")
+    import fitz  # PyMuPDF
+
+    # ... rest of extract_pages implementation ...
+    for i in range(0, len(doc), pages_per_group - overlap):
+        end = min(i + pages_per_group, len(doc))
+        …
+        text = "\n\n".join(doc[pg].get_text() for pg in range(i, end))
+        meta = {
+            "file_path": str(pdf_path),
+            "file_name": pdf_path.name,
+            "page_start": i + 1,
+            "page_end": end,
+        }
+        yield Document(text=text, metadata=meta)
+
+
+def main():
+    from llama_index.core import VectorStoreIndex, StorageContext, Document
+    from llama_index.core.node_parser import SentenceSplitter
+    from llama_index.vector_stores.qdrant import QdrantVectorStore
+    from qdrant_client import QdrantClient
+
+    cfg = load_config()
+    paths = cfg["paths"]
+    docs_dir = Path(paths["docs_dir"])
+    chroma_dir = Path(paths["chroma_dir"])
-    chroma_dir = Path(paths["chroma_dir"])
+    # Initialize the directory for vector store
+-   chroma_dir = Path(paths["chroma_dir"])
+   vector_store_dir = Path(paths["chroma_dir"])  # Consider renaming config key to "vector_store_dir"
+
+    # Remove any existing data
+-   if chroma_dir.exists():
+-       shutil.rmtree(chroma_dir)
+   if vector_store_dir.exists():
+       shutil.rmtree(vector_store_dir)
+
+    # Connect to Qdrant vector store
+-   client = QdrantClient(path=str(chroma_dir))
+   client = QdrantClient(path=str(vector_store_dir))
+
+    # Persist the embeddings/storage
+-   storage.persist(persist_dir=str(chroma_dir))
+   storage.persist(persist_dir=str(vector_store_dir))
-    chroma_dir = Path(paths["chroma_dir"])
+    # Initialize the directory for vector store
+-   chroma_dir = Path(paths["chroma_dir"])
+   vector_store_dir = Path(paths["chroma_dir"])  # Consider renaming config key to "vector_store_dir"
+
+    # Remove any existing data
+-   if chroma_dir.exists():
+-       shutil.rmtree(chroma_dir)
+   if vector_store_dir.exists():
+       shutil.rmtree(vector_store_dir)
+
+    # Connect to Qdrant vector store
+-   client = QdrantClient(path=str(chroma_dir))
+   client = QdrantClient(path=str(vector_store_dir))
+
+    # Persist the embeddings/storage
+-   storage.persist(persist_dir=str(chroma_dir))
+   storage.persist(persist_dir=str(vector_store_dir))
+    chunk = cfg["chunking"]
+
+    if chroma_dir.exists():
+        shutil.rmtree(chroma_dir)
+
+    docs = []
+    for pdf in docs_dir.rglob("*.pdf"):
+        docs.extend(
+            extract_pages(
+                pdf,
+                chunk["pages_per_group"],
+                chunk["page_overlap"],
+            )
+        )
+
+    splitter = SentenceSplitter(
+        chunk_size=chunk["chunk_size"],
+        chunk_overlap=chunk["chunk_overlap"],
+    )
+    nodes = splitter.get_nodes_from_documents(docs)
+
+    client = QdrantClient(path=str(chroma_dir))
+    store = QdrantVectorStore(client, collection_name="study")
+    storage = StorageContext.from_defaults(vector_store=store)
+    VectorStoreIndex(nodes, storage_context=storage)
+    storage.persist(persist_dir=str(chroma_dir))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Dev/src/study_tools/cli_chat.py b/Dev/src/study_tools/cli_chat.py
@@ -0,0 +1,43 @@
+"""CLI chat interface."""
+
+import argparse
+from pathlib import Path
+
+# heavy imports done in main()
+
+from .utils import load_config
+
+
+def main():
+    from llama_index.core import StorageContext, load_index_from_storage
+    from llama_index.llms.openai import OpenAI
+    from llama_index.vector_stores.qdrant import QdrantVectorStore
+    from qdrant_client import QdrantClient
+
+    cfg = load_config()
+    llm = OpenAI(model=cfg["models"]["summarizer"])
+    chroma_path = cfg["paths"]["chroma_dir"]
+    client = QdrantClient(path=chroma_path)
+    store = QdrantVectorStore(client, collection_name="study")
+    storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store)
+    index = load_index_from_storage(storage)
+    engine = index.as_chat_engine(chat_mode="condense_question", llm=llm, verbose=True)
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("question", nargs="*")
+    args = parser.parse_args()
+
+    if args.question:
+        q = " ".join(args.question)
+        print(engine.chat(q).response)
+    else:
+        print("Ask questions (blank to exit)")
+        while True:
+            q = input("? ")
+            if not q.strip():
+                break
+            print(engine.chat(q).response)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Dev/src/study_tools/flashcards.py b/Dev/src/study_tools/flashcards.py
@@ -0,0 +1,39 @@
+"""Generate Anki deck from summaries."""
+
+import uuid
+from pathlib import Path
+
+# heavy imports in main()
+
+from .utils import load_config
+
+
+def main():
+    import genanki
+    from llama_index.core import StorageContext, load_index_from_storage
+    from llama_index.vector_stores.qdrant import QdrantVectorStore
+    from qdrant_client import QdrantClient
+
+    cfg = load_config()
+    chroma_path = cfg["paths"]["chroma_dir"]
-    chroma_path = cfg["paths"]["chroma_dir"]
+    chroma_path = cfg["paths"]["qdrant_dir"]
-    chroma_path = cfg["paths"]["chroma_dir"]
+    chroma_path = cfg["paths"]["qdrant_dir"]
+    client = QdrantClient(path=chroma_path)
+    store = QdrantVectorStore(client, collection_name="study")
+    storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store)
+    index = load_index_from_storage(storage)
+    retriever = index.as_retriever(similarity_top_k=50)
+
+    deck = genanki.Deck(uuid.uuid4().int >> 64, "Study-Bot Deck")
+    for node in index.docstore.docs.values():
+        qa = retriever.query(f"Turn this into Q&A flashcards:\n\n{node.text}").response
+        for line in qa.splitlines():
+            if "?" in line:
+                q, a = line.split("?", 1)
+                note = genanki.Note(model=genanki.BASIC_MODEL, fields=[q.strip()+"?", a.strip()])
+                deck.add_note(note)
-        qa = retriever.query(f"Turn this into Q&A flashcards:\n\n{node.text}").response
-        for line in qa.splitlines():
-            if "?" in line:
-                q, a = line.split("?", 1)
-                note = genanki.Note(model=genanki.BASIC_MODEL, fields=[q.strip()+"?", a.strip()])
-                deck.add_note(note)
+        # Use LLM directly for Q&A generation instead of retriever
+        llm = index.service_context.llm
+        qa_prompt = f"Generate 3-5 question-answer pairs from this text. Format each as 'Q: question? A: answer':\n\n{node.text}"
+        qa_response = llm.complete(qa_prompt).text
+
+        for line in qa_response.splitlines():
+            if line.startswith("Q:") and "A:" in line:
+                try:
+                    q_part, a_part = line.split("A:", 1)
+                    question = q_part.replace("Q:", "").strip()
+                    answer = a_part.strip()
+                    if question and answer:
+                        note = genanki.Note(model=genanki.BASIC_MODEL, fields=[question, answer])
+                        deck.add_note(note)
+                except ValueError:
+                    continue  # Skip malformed lines
-        qa = retriever.query(f"Turn this into Q&A flashcards:\n\n{node.text}").response
-        for line in qa.splitlines():
-            if "?" in line:
-                q, a = line.split("?", 1)
-                note = genanki.Note(model=genanki.BASIC_MODEL, fields=[q.strip()+"?", a.strip()])
-                deck.add_note(note)
+        # Use LLM directly for Q&A generation instead of retriever
+        llm = index.service_context.llm
+        qa_prompt = f"Generate 3-5 question-answer pairs from this text. Format each as 'Q: question? A: answer':\n\n{node.text}"
+        qa_response = llm.complete(qa_prompt).text
+
+        for line in qa_response.splitlines():
+            if line.startswith("Q:") and "A:" in line:
+                try:
+                    q_part, a_part = line.split("A:", 1)
+                    question = q_part.replace("Q:", "").strip()
+                    answer = a_part.strip()
+                    if question and answer:
+                        note = genanki.Note(model=genanki.BASIC_MODEL, fields=[question, answer])
+                        deck.add_note(note)
+                except ValueError:
+                    continue  # Skip malformed lines
+
+    genanki.Package(deck).write_to_file("study.apkg")
+    print("study.apkg ready – import into Anki")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Dev/src/study_tools/ingest.py b/Dev/src/study_tools/ingest.py
@@ -0,0 +1,17 @@
+"""Simple document count utility."""
+
+from pathlib import Path
+
+from .utils import load_config
+
+
+def main():
+    from llama_index.core import SimpleDirectoryReader
+    cfg = load_config()
+    docs_dir = Path(cfg["paths"]["docs_dir"])
+    docs = SimpleDirectoryReader(str(docs_dir)).load_data()
+    print(f"Loaded {len(docs)} docs")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Dev/src/study_tools/reset.py b/Dev/src/study_tools/reset.py
@@ -0,0 +1,25 @@
+"""Remove generated data."""
+
+import shutil
+from pathlib import Path
+
+from .utils import load_config
+
+
+def main():
+    cfg = load_config()
+    paths = cfg["paths"]
+    for key in ("chroma_dir", "cache_dir"):
+        p = Path(paths[key])
+        if p.exists():
+            shutil.rmtree(p)
+            print(f"Deleted {p}")
+    for f in ("summary.md", "summary.pdf", "study.apkg"):
-    for key in ("chroma_dir", "cache_dir"):
-        p = Path(paths[key])
-        if p.exists():
-            shutil.rmtree(p)
-            print(f"Deleted {p}")
-    for f in ("summary.md", "summary.pdf", "study.apkg"):
+    for key in ("chroma_dir", "cache_dir"):
+        if key not in paths:
+            continue
+        p = Path(paths[key])
+        if p.exists():
+            shutil.rmtree(p)
+            print(f"Deleted {p}")
+    for f in ("summary.md", "summary.pdf", "study.apkg"):
-    for key in ("chroma_dir", "cache_dir"):
-        p = Path(paths[key])
-        if p.exists():
-            shutil.rmtree(p)
-            print(f"Deleted {p}")
-    for f in ("summary.md", "summary.pdf", "study.apkg"):
+    for key in ("chroma_dir", "cache_dir"):
+        if key not in paths:
+            continue
+        p = Path(paths[key])
+        if p.exists():
+            shutil.rmtree(p)
+            print(f"Deleted {p}")
+    for f in ("summary.md", "summary.pdf", "study.apkg"):
+        fp = Path(f)
+        if fp.exists():
+            fp.unlink()
+            print(f"Deleted {fp}")
+
+
+if __name__ == "__main__":
+    main()