-
Notifications
You must be signed in to change notification settings - Fork 1
Adjust PDF workflow #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # Python artifacts | ||
| __pycache__/ | ||
| *.py[cod] | ||
| *.egg-info/ | ||
| # Virtual envs | ||
| venv/ | ||
| .env | ||
|
|
||
| # Data | ||
| chroma/ | ||
| cache/ | ||
| *.apkg | ||
|
|
||
| # study PDFs kept local | ||
| Dev/data/ | ||
|
|
||
| # Other | ||
| .idea/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Contributing | ||
|
|
||
| 1. Install dependencies using Poetry or `requirements.txt`. | ||
| 2. Follow the existing module structure under `src/study_tools`. | ||
| 3. Add tests for new functionality in `Dev/tests`. | ||
| 4. Run `ruff`, `black`, and `pytest` before submitting a PR. | ||
| 5. Document changes in `docs/changelog.md` and update `TODO.md` if needed. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Study Tools Dev Package | ||
|
|
||
| This `Dev/` directory houses the refactored implementation of the **Universal Study Tutor**. The old prototype remains in `messy_start/` for reference. | ||
| Course PDFs should be placed in `Dev/data/`, which is ignored by Git. | ||
|
|
||
| ## Features | ||
| - Configurable PDF ingestion and chunking | ||
| - Async summarisation using local Mistral and OpenAI GPT‑4o | ||
| - CLI tools for building the index, chat, flashcards and maintenance | ||
| - Learning Unit JSON schema with status counters and categories | ||
| - Externalised configuration via `config.yaml` | ||
| - Course PDFs stored locally in `Dev/data/` (see `docs/MIGRATE_LARGE_FILES.md`) | ||
|
|
||
| ## Quickstart | ||
| ```bash | ||
| python -m pip install -r requirements.txt | ||
| python -m study_tools.build_index | ||
| python -m study_tools.summarize | ||
| python -m study_tools.cli_chat | ||
| ``` | ||
|
|
||
| See `docs/overview.md` for more details. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| agents: | ||
| - name: Ingestor | ||
| role: Split PDFs into sentence-aware chunks and store them in Qdrant. | ||
| - name: Summariser | ||
| role: Summarise chunks using GPT-4o and cache results. | ||
| - name: Tagger | ||
| role: Classify chunks into categories with local Mistral. | ||
| - name: LUManager | ||
| role: Persist Learning Units with status counters and relations. | ||
| - name: Chat | ||
| role: Interactive Q&A and tutoring over the stored materials. | ||
| - name: FlashcardBuilder | ||
| role: Generate Anki-compatible decks from summaries. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| paths: | ||
| docs_dir: data | ||
| chroma_dir: chroma | ||
| cache_dir: cache | ||
| chunking: | ||
| chunk_size: 1024 | ||
| chunk_overlap: 128 | ||
| pages_per_group: 2 | ||
| page_overlap: 1 | ||
| chunk_group_limit: 6000 | ||
| models: | ||
| default: gpt-4o | ||
| tagging: mistral-7b-instruct | ||
| summarizer: gpt-4o | ||
| context_windows: | ||
| gpt-4o: 128000 | ||
| gpt-4-turbo: 128000 | ||
| gpt-4: 8192 | ||
| gpt-3.5-turbo: 16385 | ||
| mistral-7b-instruct: 32768 | ||
| limits: | ||
| tokens_per_minute: 40000 | ||
| token_margin: 512 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # Handling Large PDF Files | ||
|
|
||
| Place course PDFs inside `Dev/data/` which is ignored by Git. They are not versioned by default. | ||
|
|
||
| If repository limits become a problem later, you can retroactively move PDFs into Git LFS with: | ||
|
|
||
| ```bash | ||
| git lfs migrate import '*.pdf' | ||
| ``` | ||
|
|
||
| Otherwise keep the files locally and back them up to Google Drive or GCS as needed. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # TODO Backlog | ||
|
|
||
| ## P0 | ||
| - Centralised configuration loader (`utils.load_config`). | ||
| - Remove hard coded paths; read from `config.yaml`. | ||
| - Store PDFs in `Dev/data/` (optionally migrate to Git LFS later). | ||
|
|
||
| ## P1 | ||
| - OCR fallback and duplicate detection during ingestion. | ||
| - Implement KnowledgeNode graph with status counters. | ||
| - Tagging pipeline using local Mistral model. | ||
| - CLI commands via `python -m study_tools <command>`. | ||
|
|
||
| ## P2 | ||
| - Evaluation harness (ROUGE-L, entity overlap, manual rubric). | ||
| - Streamlit MVP for progress view. | ||
|
|
||
| ## P3 | ||
| - Difficulty-graded exam question generator (IRT). | ||
| - Anki `*.apkg` exporter with AnkiConnect. | ||
|
|
||
| ## P4 | ||
| - Visual progress dashboard and Obsidian vault export. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Changelog | ||
|
|
||
| ## 2025-07-03 | ||
| - Initial refactor: new `Dev/` package created. | ||
| - Configuration moved to `config.yaml`. | ||
| - PDFs now stored in `Dev/data/`; Git LFS usage is optional. | ||
| - Migrated documentation and created skeleton tests. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # Overview | ||
|
|
||
| The Dev package implements the second iteration of the study bot based on the **Hybrid‑Edge** architecture: | ||
|
|
||
| - **Local tagging** with Mistral‑7B‑Instruct classifies text chunks into categories. | ||
| - **GPT‑4o/4.1** performs heavy summarisation and tutoring logic. | ||
| - **SQLite** stores metadata and Learning Units. **Qdrant** provides vector search. | ||
| - Outputs are plain JSON which are rendered to Markdown files. | ||
|
|
||
| Course PDFs belong in `Dev/data/` and are not tracked in Git. | ||
|
|
||
| Scripts read defaults from `config.yaml` so chunk sizes and model names are easily changed. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| [tool.poetry] | ||
| name = "study-tools" | ||
| version = "0.2.0" | ||
| description = "Universal Study Tutor" | ||
| authors = ["Study Bot Team"] | ||
| packages = [{include = "study_tools", from = "src"}] | ||
|
|
||
| [tool.poetry.dependencies] | ||
| python = "^3.12" | ||
| llama-index-core = "*" | ||
| llama-index-llms-openai = "*" | ||
| chromadb = "*" | ||
| tiktoken = "*" | ||
| tenacity = "*" | ||
| qdrant-client = "*" | ||
| genanki = "*" | ||
| tqdm = "*" | ||
| pyyaml = "*" | ||
|
Comment on lines
+10
to
+18
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Pin dependency versions for reproducible builds. Using "*" for all dependencies makes builds non-reproducible and can lead to dependency conflicts. Consider pinning to specific version ranges. -llama-index-core = "*"
-llama-index-llms-openai = "*"
-chromadb = "*"
-tiktoken = "*"
-tenacity = "*"
-qdrant-client = "*"
-genanki = "*"
-tqdm = "*"
-pyyaml = "*"
+llama-index-core = "^0.10.0"
+llama-index-llms-openai = "^0.1.0"
+chromadb = "^0.4.0"
+tiktoken = "^0.5.0"
+tenacity = "^8.0.0"
+qdrant-client = "^1.7.0"
+genanki = "^2.1.0"
+tqdm = "^4.65.0"
+pyyaml = "^6.0.0"🤖 Prompt for AI Agents |
||
|
|
||
| [tool.poetry.group.dev.dependencies] | ||
| pytest = "*" | ||
| ruff = "*" | ||
| black = "*" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| llama-index-core | ||
| llama-index-llms-openai | ||
| chromadb | ||
| tiktoken | ||
| tenacity | ||
| qdrant-client | ||
| genanki | ||
| tqdm | ||
| pyyaml |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| """Study Tools package.""" | ||
|
|
||
| __all__ = [ | ||
| "build_index", | ||
| "summarize", | ||
| "cli_chat", | ||
| "flashcards", | ||
| "ingest", | ||
| "reset", | ||
| "utils", | ||
| ] |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,67 @@ | ||||||||||||||||||||||||||||||||||||||
| """PDF ingestion and vector index creation.""" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||||||
| import shutil | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| # Heavy imports are done inside functions to allow importing this module without | ||||||||||||||||||||||||||||||||||||||
| # optional dependencies. | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| from .utils import load_config | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int): | ||||||||||||||||||||||||||||||||||||||
| import fitz # PyMuPDF | ||||||||||||||||||||||||||||||||||||||
| from llama_index.core import Document | ||||||||||||||||||||||||||||||||||||||
| doc = fitz.open(pdf_path) | ||||||||||||||||||||||||||||||||||||||
| for i in range(0, len(doc), pages_per_group - overlap): | ||||||||||||||||||||||||||||||||||||||
| end = min(i + pages_per_group, len(doc)) | ||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+16
to
+17
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potential infinite loop with invalid overlap configuration. The range calculation Add input validation to prevent this edge case: def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
+ if overlap >= pages_per_group:
+ raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})")
+ if pages_per_group <= 0 or overlap < 0:
+ raise ValueError("pages_per_group must be positive and overlap must be non-negative")
import fitz # PyMuPDF📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||
| text = "\n\n".join(doc[pg].get_text() for pg in range(i, end)) | ||||||||||||||||||||||||||||||||||||||
| meta = { | ||||||||||||||||||||||||||||||||||||||
| "file_path": str(pdf_path), | ||||||||||||||||||||||||||||||||||||||
| "file_name": pdf_path.name, | ||||||||||||||||||||||||||||||||||||||
| "page_start": i + 1, | ||||||||||||||||||||||||||||||||||||||
| "page_end": end, | ||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||
| yield Document(text=text, metadata=meta) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def main(): | ||||||||||||||||||||||||||||||||||||||
| from llama_index.core import VectorStoreIndex, StorageContext, Document | ||||||||||||||||||||||||||||||||||||||
| from llama_index.core.node_parser import SentenceSplitter | ||||||||||||||||||||||||||||||||||||||
| from llama_index.vector_stores.qdrant import QdrantVectorStore | ||||||||||||||||||||||||||||||||||||||
| from qdrant_client import QdrantClient | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| cfg = load_config() | ||||||||||||||||||||||||||||||||||||||
| paths = cfg["paths"] | ||||||||||||||||||||||||||||||||||||||
| docs_dir = Path(paths["docs_dir"]) | ||||||||||||||||||||||||||||||||||||||
| chroma_dir = Path(paths["chroma_dir"]) | ||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Misleading variable name: using Qdrant but named The variable is named - chroma_dir = Path(paths["chroma_dir"])
+ vector_store_dir = Path(paths["chroma_dir"]) # Consider renaming config key to "vector_store_dir"And update all references: - if chroma_dir.exists():
- shutil.rmtree(chroma_dir)
+ if vector_store_dir.exists():
+ shutil.rmtree(vector_store_dir)- client = QdrantClient(path=str(chroma_dir))
+ client = QdrantClient(path=str(vector_store_dir))- storage.persist(persist_dir=str(chroma_dir))
+ storage.persist(persist_dir=str(vector_store_dir))📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||
| chunk = cfg["chunking"] | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| if chroma_dir.exists(): | ||||||||||||||||||||||||||||||||||||||
| shutil.rmtree(chroma_dir) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| docs = [] | ||||||||||||||||||||||||||||||||||||||
| for pdf in docs_dir.rglob("*.pdf"): | ||||||||||||||||||||||||||||||||||||||
| docs.extend( | ||||||||||||||||||||||||||||||||||||||
| extract_pages( | ||||||||||||||||||||||||||||||||||||||
| pdf, | ||||||||||||||||||||||||||||||||||||||
| chunk["pages_per_group"], | ||||||||||||||||||||||||||||||||||||||
| chunk["page_overlap"], | ||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| splitter = SentenceSplitter( | ||||||||||||||||||||||||||||||||||||||
| chunk_size=chunk["chunk_size"], | ||||||||||||||||||||||||||||||||||||||
| chunk_overlap=chunk["chunk_overlap"], | ||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||
| nodes = splitter.get_nodes_from_documents(docs) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| client = QdrantClient(path=str(chroma_dir)) | ||||||||||||||||||||||||||||||||||||||
| store = QdrantVectorStore(client, collection_name="study") | ||||||||||||||||||||||||||||||||||||||
| storage = StorageContext.from_defaults(vector_store=store) | ||||||||||||||||||||||||||||||||||||||
| VectorStoreIndex(nodes, storage_context=storage) | ||||||||||||||||||||||||||||||||||||||
| storage.persist(persist_dir=str(chroma_dir)) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| if __name__ == "__main__": | ||||||||||||||||||||||||||||||||||||||
| main() | ||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| """CLI chat interface.""" | ||
|
|
||
| import argparse | ||
| from pathlib import Path | ||
|
|
||
| # heavy imports done in main() | ||
|
|
||
| from .utils import load_config | ||
|
|
||
|
|
||
| def main(): | ||
| from llama_index.core import StorageContext, load_index_from_storage | ||
| from llama_index.llms.openai import OpenAI | ||
| from llama_index.vector_stores.qdrant import QdrantVectorStore | ||
| from qdrant_client import QdrantClient | ||
|
|
||
| cfg = load_config() | ||
| llm = OpenAI(model=cfg["models"]["summarizer"]) | ||
| chroma_path = cfg["paths"]["chroma_dir"] | ||
| client = QdrantClient(path=chroma_path) | ||
| store = QdrantVectorStore(client, collection_name="study") | ||
| storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store) | ||
| index = load_index_from_storage(storage) | ||
| engine = index.as_chat_engine(chat_mode="condense_question", llm=llm, verbose=True) | ||
|
|
||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("question", nargs="*") | ||
| args = parser.parse_args() | ||
|
|
||
| if args.question: | ||
| q = " ".join(args.question) | ||
| print(engine.chat(q).response) | ||
| else: | ||
| print("Ask questions (blank to exit)") | ||
| while True: | ||
| q = input("? ") | ||
| if not q.strip(): | ||
| break | ||
| print(engine.chat(q).response) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,39 @@ | ||||||||||||||||||||||||||||||||||||||||||||||
| """Generate Anki deck from summaries.""" | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| import uuid | ||||||||||||||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| # heavy imports in main() | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| from .utils import load_config | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| def main(): | ||||||||||||||||||||||||||||||||||||||||||||||
| import genanki | ||||||||||||||||||||||||||||||||||||||||||||||
| from llama_index.core import StorageContext, load_index_from_storage | ||||||||||||||||||||||||||||||||||||||||||||||
| from llama_index.vector_stores.qdrant import QdrantVectorStore | ||||||||||||||||||||||||||||||||||||||||||||||
| from qdrant_client import QdrantClient | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| cfg = load_config() | ||||||||||||||||||||||||||||||||||||||||||||||
| chroma_path = cfg["paths"]["chroma_dir"] | ||||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Fix naming inconsistency: "chroma_dir" should be "qdrant_dir". The code uses Qdrant vector store but references a "chroma_dir" configuration key, which is inconsistent. - chroma_path = cfg["paths"]["chroma_dir"]
+ chroma_path = cfg["paths"]["qdrant_dir"]📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||
| client = QdrantClient(path=chroma_path) | ||||||||||||||||||||||||||||||||||||||||||||||
| store = QdrantVectorStore(client, collection_name="study") | ||||||||||||||||||||||||||||||||||||||||||||||
| storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store) | ||||||||||||||||||||||||||||||||||||||||||||||
| index = load_index_from_storage(storage) | ||||||||||||||||||||||||||||||||||||||||||||||
| retriever = index.as_retriever(similarity_top_k=50) | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| deck = genanki.Deck(uuid.uuid4().int >> 64, "Study-Bot Deck") | ||||||||||||||||||||||||||||||||||||||||||||||
| for node in index.docstore.docs.values(): | ||||||||||||||||||||||||||||||||||||||||||||||
| qa = retriever.query(f"Turn this into Q&A flashcards:\n\n{node.text}").response | ||||||||||||||||||||||||||||||||||||||||||||||
| for line in qa.splitlines(): | ||||||||||||||||||||||||||||||||||||||||||||||
| if "?" in line: | ||||||||||||||||||||||||||||||||||||||||||||||
| q, a = line.split("?", 1) | ||||||||||||||||||||||||||||||||||||||||||||||
| note = genanki.Note(model=genanki.BASIC_MODEL, fields=[q.strip()+"?", a.strip()]) | ||||||||||||||||||||||||||||||||||||||||||||||
| deck.add_note(note) | ||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+27
to
+32
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix incorrect usage of retriever for Q&A generation. Using a retriever's - qa = retriever.query(f"Turn this into Q&A flashcards:\n\n{node.text}").response
- for line in qa.splitlines():
- if "?" in line:
- q, a = line.split("?", 1)
- note = genanki.Note(model=genanki.BASIC_MODEL, fields=[q.strip()+"?", a.strip()])
- deck.add_note(note)
+ # Use LLM directly for Q&A generation instead of retriever
+ llm = index.service_context.llm
+ qa_prompt = f"Generate 3-5 question-answer pairs from this text. Format each as 'Q: question? A: answer':\n\n{node.text}"
+ qa_response = llm.complete(qa_prompt).text
+
+ for line in qa_response.splitlines():
+ if line.startswith("Q:") and "A:" in line:
+ try:
+ q_part, a_part = line.split("A:", 1)
+ question = q_part.replace("Q:", "").strip()
+ answer = a_part.strip()
+ if question and answer:
+ note = genanki.Note(model=genanki.BASIC_MODEL, fields=[question, answer])
+ deck.add_note(note)
+ except ValueError:
+ continue # Skip malformed lines📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| genanki.Package(deck).write_to_file("study.apkg") | ||||||||||||||||||||||||||||||||||||||||||||||
| print("study.apkg ready – import into Anki") | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| if __name__ == "__main__": | ||||||||||||||||||||||||||||||||||||||||||||||
| main() | ||||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| """Simple document count utility.""" | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| from .utils import load_config | ||
|
|
||
|
|
||
| def main(): | ||
| from llama_index.core import SimpleDirectoryReader | ||
| cfg = load_config() | ||
| docs_dir = Path(cfg["paths"]["docs_dir"]) | ||
| docs = SimpleDirectoryReader(str(docs_dir)).load_data() | ||
| print(f"Loaded {len(docs)} docs") | ||
|
|
||
|
Comment on lines
+8
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Add error handling and improve code structure. The code works but could benefit from better error handling and structure improvements. Apply this diff to improve error handling and code structure: def main():
- from llama_index.core import SimpleDirectoryReader
- cfg = load_config()
- docs_dir = Path(cfg["paths"]["docs_dir"])
- docs = SimpleDirectoryReader(str(docs_dir)).load_data()
- print(f"Loaded {len(docs)} docs")
+ try:
+ from llama_index.core import SimpleDirectoryReader
+ cfg = load_config()
+ docs_dir = Path(cfg["paths"]["docs_dir"])
+
+ if not docs_dir.exists():
+ print(f"Error: Documents directory does not exist: {docs_dir}")
+ return
+
+ docs = SimpleDirectoryReader(str(docs_dir)).load_data()
+ print(f"Loaded {len(docs)} docs")
+ except KeyError as e:
+ print(f"Error: Missing configuration key: {e}")
+ except Exception as e:
+ print(f"Error loading documents: {e}")Consider moving the import to the top of the file for better visibility: +from llama_index.core import SimpleDirectoryReader
from pathlib import Path
from .utils import load_config🤖 Prompt for AI Agents |
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,25 @@ | ||||||||||||||||||||||||||||||
| """Remove generated data.""" | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| import shutil | ||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| from .utils import load_config | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| def main(): | ||||||||||||||||||||||||||||||
| cfg = load_config() | ||||||||||||||||||||||||||||||
| paths = cfg["paths"] | ||||||||||||||||||||||||||||||
| for key in ("chroma_dir", "cache_dir"): | ||||||||||||||||||||||||||||||
| p = Path(paths[key]) | ||||||||||||||||||||||||||||||
| if p.exists(): | ||||||||||||||||||||||||||||||
| shutil.rmtree(p) | ||||||||||||||||||||||||||||||
| print(f"Deleted {p}") | ||||||||||||||||||||||||||||||
| for f in ("summary.md", "summary.pdf", "study.apkg"): | ||||||||||||||||||||||||||||||
|
Comment on lines
+12
to
+17
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. KeyError & safety guard around path lookup
- for key in ("chroma_dir", "cache_dir"):
- p = Path(paths[key])
+ for key in ("chroma_dir", "cache_dir"):
+ if key not in paths:
+ continue
+ p = Path(paths[key])Also consider 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||
| fp = Path(f) | ||||||||||||||||||||||||||||||
| if fp.exists(): | ||||||||||||||||||||||||||||||
| fp.unlink() | ||||||||||||||||||||||||||||||
| print(f"Deleted {fp}") | ||||||||||||||||||||||||||||||
|
Comment on lines
+17
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Hard-coded filenames tie the script to the CWD
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| if __name__ == "__main__": | ||||||||||||||||||||||||||||||
| main() | ||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify Python 3.12 requirement necessity.
Python 3.12 is quite recent and may limit adoption. Verify if this strict requirement is necessary or if Python 3.10+ would suffice.
🏁 Script executed:
Length of output: 784
Lower Python requirement to 3.10 (no 3.12-only features detected)
Our codebase uses PEP 604 union types (
Path | str) which require Python 3.10+, but no features exclusive to 3.11 or 3.12 were found. Please relax the constraint unless there’s another justification for requiring 3.12.• File needing update:
Suggested change:
🤖 Prompt for AI Agents