-
Notifications
You must be signed in to change notification settings - Fork 1
Adjust PDF workflow #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis update introduces a comprehensive refactor and modularization of the Universal Study Tutor project. It adds a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant ConfigLoader
participant PDFIngestor
participant VectorStore
participant LLM
participant FlashcardBuilder
User->>CLI: Run build_index / summarize / cli_chat / flashcards
CLI->>ConfigLoader: Load config.yaml
CLI->>PDFIngestor: Ingest PDFs (build_index)
PDFIngestor->>VectorStore: Store chunks
CLI->>LLM: Summarize chunks (summarize)
LLM->>VectorStore: Store summaries
CLI->>LLM: Chat with material (cli_chat)
CLI->>FlashcardBuilder: Generate flashcards (flashcards)
FlashcardBuilder->>VectorStore: Retrieve summaries
FlashcardBuilder->>User: Export Anki deck
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🔭 Outside diff range comments (1)
Dev/requirements.txt (1)
1-10: Pin dependency versions in Dev/requirements.txtTo ensure reproducible builds and guard against unintended breaking changes, add version constraints based on the latest PyPI releases:
File: Dev/requirements.txt
-llama-index-core -llama-index-llms-openai -chromadb -tiktoken -tenacity -qdrant-client -genanki -tqdm -pyyaml +llama-index-core>=0.12.46,<1.0.0 +llama-index-llms-openai>=0.4.7,<1.0.0 +chromadb>=1.0.15,<2.0.0 +tiktoken>=0.9.0,<1.0.0 +tenacity>=9.1.2,<10.0.0 +qdrant-client>=1.14.3,<2.0.0 +genanki>=0.13.1,<1.0.0 +tqdm>=4.67.1,<5.0.0 +pyyaml>=6.0.2,<7.0.0No security advisories were found for these packages at the time of review.
🧹 Nitpick comments (16)
README.md (1)
1-8: Consider restoring a minimal quick-start or installation snippetThe root README now only points to
Dev/README.md. While that keeps the file uncluttered, having a two-line quick-start (e.g. “pip install -e Dev && study-tools …”) helps newcomers decide whether to keep reading.Dev/docs/MIGRATE_LARGE_FILES.md (1)
11-11: Missing comma after a leading adverbInsert a comma after “Otherwise” to avoid a run-on sentence.
-Otherwise keep the files locally and back them up to Google Drive or GCS as needed. +Otherwise, keep the files locally and back them up to Google Drive or GCS as needed.Dev/src/study_tools/reset.py (1)
24-25: Expose a CLI viapython -m study_tools.resetWrapping
main()in aconsole_scriptsentry point (or enablingpython -m study_tools.reset) improves UX and avoids relying on file paths.Dev/docs/TODO.md (1)
5-5: Fix hyphenation for compound adjective.The term should be "hard-coded" when used as a compound adjective.
-- Remove hard coded paths; read from `config.yaml`. +- Remove hard-coded paths; read from `config.yaml`.Dev/src/study_tools/cli_chat.py (2)
4-4: Remove unused import.The
pathlib.Pathimport is not used in the code.-from pathlib import Path
11-39: Consider refactoring to reduce local variables.The function has 16 local variables, which slightly exceeds the typical limit of 15. While not critical, consider extracting the chat engine setup into a separate function for better modularity.
+def create_chat_engine(): + """Create and return a configured chat engine.""" + from llama_index.core import StorageContext, load_index_from_storage + from llama_index.llms.openai import OpenAI + from llama_index.vector_stores.qdrant import QdrantVectorStore + from qdrant_client import QdrantClient + + cfg = load_config() + llm = OpenAI(model=cfg["models"]["summarizer"]) + chroma_path = cfg["paths"]["chroma_dir"] + client = QdrantClient(path=chroma_path) + store = QdrantVectorStore(client, collection_name="study") + storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store) + index = load_index_from_storage(storage) + return index.as_chat_engine(chat_mode="condense_question", llm=llm, verbose=True) + def main(): - from llama_index.core import StorageContext, load_index_from_storage - from llama_index.llms.openai import OpenAI - from llama_index.vector_stores.qdrant import QdrantVectorStore - from qdrant_client import QdrantClient - - cfg = load_config() - llm = OpenAI(model=cfg["models"]["summarizer"]) - chroma_path = cfg["paths"]["chroma_dir"] - client = QdrantClient(path=chroma_path) - store = QdrantVectorStore(client, collection_name="study") - storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store) - index = load_index_from_storage(storage) - engine = index.as_chat_engine(chat_mode="condense_question", llm=llm, verbose=True) + engine = create_chat_engine()Dev/README.md (1)
8-8: Fix formatting: Use number '0' instead of letter 'O' in "GPT-4o".The model name should use the number '0' instead of the letter 'O' for consistency with OpenAI's official naming.
-- Async summarisation using local Mistral and OpenAI GPT‑4o +- Async summarisation using local Mistral and OpenAI GPT-4oDev/src/study_tools/flashcards.py (1)
4-4: Remove unused import.-from pathlib import PathDev/pyproject.toml (1)
21-23: Pin development dependency versions for consistency.-pytest = "*" -ruff = "*" -black = "*" +pytest = "^7.0.0" +ruff = "^0.1.0" +black = "^23.0.0"Dev/src/study_tools/build_index.py (3)
22-23: Page numbering inconsistency in metadata.The metadata shows
page_start: i + 1but the range starts fromi, creating a potential off-by-one confusion. Consider whether pages should be 1-indexed or 0-indexed consistently.If you want 1-indexed pages (more user-friendly), apply this diff:
meta = { "file_path": str(pdf_path), "file_name": pdf_path.name, "page_start": i + 1, - "page_end": end, + "page_end": end, # This should be end, not end + 1, since range is exclusive }
29-29: Remove unused import.The static analysis correctly identifies that
Documentis imported but never used in the main function.- from llama_index.core import VectorStoreIndex, StorageContext, Document + from llama_index.core import VectorStoreIndex, StorageContext
28-63: Consider refactoring to reduce complexity.The main function has 18 local variables, which exceeds the recommended limit and makes the function harder to maintain. Consider extracting some functionality into helper functions.
Example refactoring:
def _setup_vector_store(vector_store_dir: Path): """Set up and return Qdrant vector store components.""" if vector_store_dir.exists(): shutil.rmtree(vector_store_dir) client = QdrantClient(path=str(vector_store_dir)) store = QdrantVectorStore(client, collection_name="study") storage = StorageContext.from_defaults(vector_store=store) return storage def _collect_documents(docs_dir: Path, chunk_config: dict): """Collect and process documents from PDF files.""" docs = [] for pdf in docs_dir.rglob("*.pdf"): docs.extend( extract_pages( pdf, chunk_config["pages_per_group"], chunk_config["page_overlap"], ) ) return docsDev/src/study_tools/summarize.py (3)
67-67: Make batch size configurable.The batch size of 5 is hard-coded, which reduces flexibility. Consider making it configurable through the config file.
- groups = [nodes[i:i+5] for i in range(0, len(nodes), 5)] + batch_size = cfg.get("summarization", {}).get("batch_size", 5) + groups = [nodes[i:i+batch_size] for i in range(0, len(nodes), batch_size)]
71-71: Make language configurable.The language "English" is hard-coded, which limits the tool's international usability. Consider making it configurable.
- summaries.extend(summarise_group(llm, g, "English", cache_dir)) + language = cfg.get("summarization", {}).get("language", "English") + summaries.extend(summarise_group(llm, g, language, cache_dir))
46-73: Consider refactoring to reduce complexity.The main function has 19 local variables, which exceeds the recommended limit. Consider extracting some functionality into helper functions.
Example refactoring:
def _setup_summarization_components(cfg): """Set up LLM, vector store, and index components.""" paths = cfg["paths"] cache_dir = Path(paths["cache_dir"]) cache_dir.mkdir(parents=True, exist_ok=True) llm = OpenAI(model=cfg["models"]["summarizer"]) client = QdrantClient(path=paths["chroma_dir"]) store = QdrantVectorStore(client, collection_name="study") storage = StorageContext.from_defaults(persist_dir=paths["chroma_dir"], vector_store=store) index = load_index_from_storage(storage) return llm, index, cache_dir def _extract_and_group_nodes(index, batch_size=5): """Extract text nodes and group them for processing.""" nodes = [n for n in index.docstore.docs.values() if isinstance(n, TextNode)] groups = [nodes[i:i+batch_size] for i in range(0, len(nodes), batch_size)] return groupsmessy_start/README.md (1)
54-54: Minor grammar improvement suggested.The phrase "retrieves relevant chunks for user questions" could be more natural as "retrieves relevant chunks of user questions" as suggested by the language tool.
-4. **Chat/Q&A** – retrieves relevant chunks for user questions. +4. **Chat/Q&A** – retrieves relevant chunks of user questions.However, both phrasings are grammatically correct. The current version emphasizes that chunks are retrieved "for" (in response to) user questions, which may be more accurate to the functionality. Consider keeping it as is if the current meaning is preferred.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (32)
messy_start/Corperate_Finance_download/01 General Information 2025F-3.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/02 Risk Return 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/03 Portfolio Theory CAPM 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/04 Corporate Valuation 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/05 Cost of Capital 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/06 Corporate Finance Issuance of Securities 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/07 Capital Structure 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/08 Risk Management 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/09 Corporate Governance 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/10 Efficient Markets and Behavioral Finance 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/11 Empirical Research 2025F.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Aviva Investors Student 2022.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Cohn, J. B., and Wardlaw, M. I., 2016, Financing constraints and workplace safety, The Journal of Finance, 71(5), 2017-2058.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Cox Communications Inc Student 2019.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/ESG Financing Constraints and Workplace Safety Solution 2019.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/ESG Financing Constraints and Workplace Safety Student 2019.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/First session Solution-4.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/First session-4.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Formulary.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Fourth session Solution-6.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Fourth session-5.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Mock Exam - Solution-1.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Mock Exam-1.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/NormDist.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Second session Solution-4.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Second session-3.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Third session Solution-5.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Third session-5.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/UBS Presentation Uni SG May 2025.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Valuation - Consolidated Edison and Chemex AG Solution 2019.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Valuation - Consolidated Edison and Chemex AG Student 2019.pdfis excluded by!**/*.pdfmessy_start/Corperate_Finance_download/Valuing Snap after the IPO quiet period Student 2019.pdfis excluded by!**/*.pdf
📒 Files selected for processing (23)
.gitignore(1 hunks)Dev/CONTRIBUTING.md(1 hunks)Dev/README.md(1 hunks)Dev/agents.yaml(1 hunks)Dev/config.yaml(1 hunks)Dev/docs/MIGRATE_LARGE_FILES.md(1 hunks)Dev/docs/TODO.md(1 hunks)Dev/docs/changelog.md(1 hunks)Dev/docs/overview.md(1 hunks)Dev/pyproject.toml(1 hunks)Dev/requirements.txt(1 hunks)Dev/src/study_tools/__init__.py(1 hunks)Dev/src/study_tools/build_index.py(1 hunks)Dev/src/study_tools/cli_chat.py(1 hunks)Dev/src/study_tools/flashcards.py(1 hunks)Dev/src/study_tools/ingest.py(1 hunks)Dev/src/study_tools/reset.py(1 hunks)Dev/src/study_tools/summarize.py(1 hunks)Dev/src/study_tools/utils.py(1 hunks)Dev/tests/test_imports.py(1 hunks)README.md(1 hunks)messy_start/README.md(1 hunks)messy_start/schema/learning_unit.schema.json(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
Dev/src/study_tools/ingest.py (1)
Dev/src/study_tools/utils.py (1)
load_config(8-14)
Dev/src/study_tools/reset.py (6)
Dev/src/study_tools/utils.py (1)
load_config(8-14)Dev/src/study_tools/flashcards.py (1)
main(11-35)Dev/src/study_tools/cli_chat.py (1)
main(11-39)Dev/src/study_tools/ingest.py (1)
main(8-13)Dev/src/study_tools/summarize.py (1)
main(46-73)Dev/src/study_tools/build_index.py (1)
main(28-63)
Dev/src/study_tools/cli_chat.py (5)
Dev/src/study_tools/utils.py (1)
load_config(8-14)Dev/src/study_tools/flashcards.py (1)
main(11-35)Dev/src/study_tools/ingest.py (1)
main(8-13)Dev/src/study_tools/summarize.py (1)
main(46-73)Dev/src/study_tools/build_index.py (1)
main(28-63)
Dev/src/study_tools/summarize.py (5)
Dev/src/study_tools/utils.py (1)
load_config(8-14)Dev/src/study_tools/cli_chat.py (1)
main(11-39)Dev/src/study_tools/ingest.py (1)
main(8-13)Dev/src/study_tools/reset.py (1)
main(9-21)Dev/src/study_tools/build_index.py (1)
main(28-63)
🪛 LanguageTool
Dev/docs/MIGRATE_LARGE_FILES.md
[uncategorized] ~10-~10: A comma may be missing after the conjunctive/linking adverb ‘Otherwise’.
Context: ...ash git lfs migrate import '*.pdf' ``` Otherwise keep the files locally and back them up...
(SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Dev/docs/overview.md
[formatting] ~6-~6: It looks like you are using the letter ‘O’ instead of the number ‘0’. Did you mean “40”?
Context: ...es text chunks into categories. - GPT‑4o/4.1 performs heavy summarisation and ...
(ZERO_O)
Dev/docs/TODO.md
[uncategorized] ~5-~5: This word is normally spelled with a hyphen.
Context: ... loader (utils.load_config). - Remove hard coded paths; read from config.yaml. - Store...
(HARD_CODE_COMPOUND)
Dev/README.md
[formatting] ~8-~8: It looks like you are using the letter ‘O’ instead of the number ‘0’. Did you mean “40”?
Context: ...tion using local Mistral and OpenAI GPT‑4o - CLI tools for building the index, cha...
(ZERO_O)
messy_start/README.md
[uncategorized] ~53-~53: The preposition ‘of’ seems more likely in this position.
Context: ...Chat/Q&A* – retrieves relevant chunks for user questions. 5. *Flashcard Builder...
(AI_HYDRA_LEO_REPLACE_FOR_OF)
🪛 Flake8 (7.2.0)
Dev/tests/test_imports.py
[error] 13-13: expected 2 blank lines, found 1
(E302)
Dev/src/study_tools/build_index.py
[error] 29-29: 'llama_index.core.Document' imported but unused
(F401)
Dev/src/study_tools/flashcards.py
[error] 4-4: 'pathlib.Path' imported but unused
(F401)
Dev/src/study_tools/cli_chat.py
[error] 4-4: 'pathlib.Path' imported but unused
(F401)
🪛 Pylint (3.3.7)
Dev/src/study_tools/build_index.py
[refactor] 28-28: Too many local variables (18/15)
(R0914)
Dev/src/study_tools/flashcards.py
[refactor] 11-11: Too many local variables (19/15)
(R0914)
Dev/src/study_tools/cli_chat.py
[refactor] 11-11: Too many local variables (16/15)
(R0914)
Dev/src/study_tools/summarize.py
[refactor] 46-46: Too many local variables (19/15)
(R0914)
🪛 Ruff (0.11.9)
Dev/src/study_tools/build_index.py
29-29: llama_index.core.Document imported but unused
Remove unused import: llama_index.core.Document
(F401)
Dev/src/study_tools/flashcards.py
4-4: pathlib.Path imported but unused
Remove unused import: pathlib.Path
(F401)
Dev/src/study_tools/cli_chat.py
4-4: pathlib.Path imported but unused
Remove unused import: pathlib.Path
(F401)
🔇 Additional comments (12)
Dev/docs/changelog.md (1)
1-8: LGTM! Well-structured changelog documentation.The changelog follows standard conventions and clearly documents the major refactor changes, including the new package structure, configuration management, and data organization.
Dev/CONTRIBUTING.md (1)
1-8: LGTM! Comprehensive contributing guidelines.The guidelines clearly outline the development workflow, emphasize code quality tools, and promote good practices for testing and documentation. This will help maintain consistency across contributions.
.gitignore (1)
14-16: 👍 Good call excluding the data directoryIgnoring
Dev/data/aligns with the new large-file workflow and keeps clones lean.Dev/src/study_tools/__init__.py (1)
3-11: All exported sub-modules are present and importable
A directory check under Dev/src/study_tools confirms that build_index.py, summarize.py, cli_chat.py, flashcards.py, ingest.py, reset.py, and utils.py all exist. No missing entries in__all__.messy_start/schema/learning_unit.schema.json (1)
1-21: Well-structured JSON schema for learning units.The schema effectively defines the core data structure with appropriate required fields and clear property descriptions. The status encoding system (0=unknown, >0=seen, -1=known) provides a simple yet effective way to track learning progress.
Dev/config.yaml (1)
1-24: Comprehensive and well-structured configuration.The configuration effectively centralizes key system parameters with sensible defaults. The chunking parameters (1024 size, 128 overlap) and model selections align well with the hybrid architecture described in the system overview.
Dev/docs/overview.md (1)
1-13: Clear and accurate system overview.The documentation effectively describes the hybrid-edge architecture and component responsibilities. The mention of configuration externalization and Git exclusion of PDFs shows good architectural awareness.
Dev/docs/TODO.md (1)
1-24: Well-organized development roadmap.The TODO list effectively prioritizes development tasks with clear categories. The priorities align well with the system's architecture and current implementation state.
Dev/src/study_tools/cli_chat.py (1)
1-44: Well-implemented CLI chat interface.The code effectively implements both single-question and interactive chat modes with proper argument parsing. The lazy import strategy is good for startup performance, and the chat engine configuration is appropriate for the use case.
Dev/README.md (1)
1-23: Well-structured documentation.The README provides clear overview, features list, and quickstart instructions for the refactored implementation.
Dev/agents.yaml (1)
1-14: Well-structured agent configuration.The agent definitions clearly map to the system's modular architecture and align with the functionality described in the documentation.
messy_start/README.md (1)
1-57: Well-structured documentation for the legacy prototype.The README provides clear documentation of the original prototype implementation with good coverage of:
- Current state and key components
- Setup instructions with specific versions
- Usage examples with command-line instructions
- Architecture overview with component descriptions
- Migration path to the new refactored implementation
This serves as valuable documentation for understanding the evolution from the legacy prototype to the new
Dev/implementation.
|
|
||
| def test_imports(): | ||
| for m in modules: | ||
| importlib.import_module(m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix PEP 8 formatting issue.
The static analysis tool correctly identifies a missing blank line before the function definition.
Apply this diff to fix the formatting:
]
-
+
def test_imports():Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Flake8 (7.2.0)
[error] 13-13: expected 2 blank lines, found 1
(E302)
🤖 Prompt for AI Agents
In Dev/tests/test_imports.py around lines 12 to 15, there is a PEP 8 formatting
issue due to a missing blank line before the function definition test_imports.
Add a blank line before the def test_imports(): line to comply with PEP 8 style
guidelines.
| def main(): | ||
| from llama_index.core import SimpleDirectoryReader | ||
| cfg = load_config() | ||
| docs_dir = Path(cfg["paths"]["docs_dir"]) | ||
| docs = SimpleDirectoryReader(str(docs_dir)).load_data() | ||
| print(f"Loaded {len(docs)} docs") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and improve code structure.
The code works but could benefit from better error handling and structure improvements.
Apply this diff to improve error handling and code structure:
def main():
- from llama_index.core import SimpleDirectoryReader
- cfg = load_config()
- docs_dir = Path(cfg["paths"]["docs_dir"])
- docs = SimpleDirectoryReader(str(docs_dir)).load_data()
- print(f"Loaded {len(docs)} docs")
+ try:
+ from llama_index.core import SimpleDirectoryReader
+ cfg = load_config()
+ docs_dir = Path(cfg["paths"]["docs_dir"])
+
+ if not docs_dir.exists():
+ print(f"Error: Documents directory does not exist: {docs_dir}")
+ return
+
+ docs = SimpleDirectoryReader(str(docs_dir)).load_data()
+ print(f"Loaded {len(docs)} docs")
+ except KeyError as e:
+ print(f"Error: Missing configuration key: {e}")
+ except Exception as e:
+ print(f"Error loading documents: {e}")Consider moving the import to the top of the file for better visibility:
+from llama_index.core import SimpleDirectoryReader
from pathlib import Path
from .utils import load_config🤖 Prompt for AI Agents
In Dev/src/study_tools/ingest.py around lines 8 to 14, improve the code by
moving the import of SimpleDirectoryReader to the top of the file for better
visibility and apply error handling around the document loading process. Wrap
the loading logic in a try-except block to catch and log exceptions, ensuring
the program handles errors gracefully instead of failing silently or crashing.
| for f in ("summary.md", "summary.pdf", "study.apkg"): | ||
| fp = Path(f) | ||
| if fp.exists(): | ||
| fp.unlink() | ||
| print(f"Deleted {fp}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Hard-coded filenames tie the script to the CWD
summary.md, summary.pdf, and study.apkg are deleted relative to wherever the user runs the script, which may not be the project root. Read the root from cfg (e.g. project_root) or resolve relative to Path(__file__).parent.parent.
🤖 Prompt for AI Agents
In Dev/src/study_tools/reset.py around lines 17 to 21, the filenames are
currently hard-coded and deleted relative to the current working directory,
which may not be the project root. To fix this, determine the project root
directory by reading it from the configuration (e.g., cfg.project_root) or by
resolving it relative to the script location using Path(__file__).parent.parent.
Then, construct the full paths to the files by joining the project root with the
filenames before checking existence and deleting them.
| for key in ("chroma_dir", "cache_dir"): | ||
| p = Path(paths[key]) | ||
| if p.exists(): | ||
| shutil.rmtree(p) | ||
| print(f"Deleted {p}") | ||
| for f in ("summary.md", "summary.pdf", "study.apkg"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyError & safety guard around path lookup
cfg["paths"][key] will raise if the key is absent. A safer pattern:
- for key in ("chroma_dir", "cache_dir"):
- p = Path(paths[key])
+ for key in ("chroma_dir", "cache_dir"):
+ if key not in paths:
+ continue
+ p = Path(paths[key])Also consider shutil.rmtree(p, ignore_errors=True) to cope with permission issues.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for key in ("chroma_dir", "cache_dir"): | |
| p = Path(paths[key]) | |
| if p.exists(): | |
| shutil.rmtree(p) | |
| print(f"Deleted {p}") | |
| for f in ("summary.md", "summary.pdf", "study.apkg"): | |
| for key in ("chroma_dir", "cache_dir"): | |
| if key not in paths: | |
| continue | |
| p = Path(paths[key]) | |
| if p.exists(): | |
| shutil.rmtree(p) | |
| print(f"Deleted {p}") | |
| for f in ("summary.md", "summary.pdf", "study.apkg"): |
🤖 Prompt for AI Agents
In Dev/src/study_tools/reset.py around lines 12 to 17, the code directly
accesses paths using keys without checking if they exist, which can raise a
KeyError. Modify the code to safely retrieve the path using a method like
dict.get() with a default or check if the key exists before accessing it.
Additionally, update shutil.rmtree calls to include ignore_errors=True to handle
potential permission issues gracefully.
| from qdrant_client import QdrantClient | ||
|
|
||
| cfg = load_config() | ||
| chroma_path = cfg["paths"]["chroma_dir"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix naming inconsistency: "chroma_dir" should be "qdrant_dir".
The code uses Qdrant vector store but references a "chroma_dir" configuration key, which is inconsistent.
- chroma_path = cfg["paths"]["chroma_dir"]
+ chroma_path = cfg["paths"]["qdrant_dir"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chroma_path = cfg["paths"]["chroma_dir"] | |
| chroma_path = cfg["paths"]["qdrant_dir"] |
🤖 Prompt for AI Agents
In Dev/src/study_tools/flashcards.py at line 18, the configuration key
"chroma_dir" is incorrectly used while the code uses Qdrant vector store. Rename
the key from "chroma_dir" to "qdrant_dir" to maintain naming consistency with
the vector store being used.
| packages = [{include = "study_tools", from = "src"}] | ||
|
|
||
| [tool.poetry.dependencies] | ||
| python = "^3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify Python 3.12 requirement necessity.
Python 3.12 is quite recent and may limit adoption. Verify if this strict requirement is necessary or if Python 3.10+ would suffice.
🏁 Script executed:
#!/bin/bash
# Check if the codebase uses Python 3.12+ specific features
rg -n "match|case|\\|" --type py
rg -n "Self|TypeAlias|Literal" --type py
rg -n "from __future__ import annotations" --type pyLength of output: 784
Lower Python requirement to 3.10 (no 3.12-only features detected)
Our codebase uses PEP 604 union types (Path | str) which require Python 3.10+, but no features exclusive to 3.11 or 3.12 were found. Please relax the constraint unless there’s another justification for requiring 3.12.
• File needing update:
- Dev/pyproject.toml
Suggested change:
- python = "^3.12"
+ python = "^3.10"🤖 Prompt for AI Agents
In Dev/pyproject.toml at line 9, the Python version requirement is set to 3.12
but the code only uses features available from Python 3.10 onwards. Change the
python version specifier from "^3.12" to "^3.10" to relax the version constraint
accordingly.
| for i in range(0, len(doc), pages_per_group - overlap): | ||
| end = min(i + pages_per_group, len(doc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential infinite loop with invalid overlap configuration.
The range calculation range(0, len(doc), pages_per_group - overlap) could result in an infinite loop if overlap >= pages_per_group, as the step would be zero or negative.
Add input validation to prevent this edge case:
def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
+ if overlap >= pages_per_group:
+ raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})")
+ if pages_per_group <= 0 or overlap < 0:
+ raise ValueError("pages_per_group must be positive and overlap must be non-negative")
import fitz # PyMuPDF📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for i in range(0, len(doc), pages_per_group - overlap): | |
| end = min(i + pages_per_group, len(doc)) | |
| def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int): | |
| if overlap >= pages_per_group: | |
| raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})") | |
| if pages_per_group <= 0 or overlap < 0: | |
| raise ValueError("pages_per_group must be positive and overlap must be non-negative") | |
| import fitz # PyMuPDF | |
| # ... rest of extract_pages implementation ... | |
| for i in range(0, len(doc), pages_per_group - overlap): | |
| end = min(i + pages_per_group, len(doc)) | |
| … |
🤖 Prompt for AI Agents
In Dev/src/study_tools/build_index.py around lines 16 to 17, the loop step is
calculated as pages_per_group minus overlap, which can be zero or negative if
overlap is greater than or equal to pages_per_group, causing an infinite loop.
Add input validation before the loop to check if overlap is less than
pages_per_group and raise an appropriate error or handle the case to prevent the
loop from running with an invalid step value.
| cfg = load_config() | ||
| paths = cfg["paths"] | ||
| docs_dir = Path(paths["docs_dir"]) | ||
| chroma_dir = Path(paths["chroma_dir"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Misleading variable name: using Qdrant but named chroma_dir.
The variable is named chroma_dir but the code uses Qdrant for vector storage, not Chroma. This creates confusion and suggests a copy-paste error from a previous Chroma implementation.
- chroma_dir = Path(paths["chroma_dir"])
+ vector_store_dir = Path(paths["chroma_dir"]) # Consider renaming config key to "vector_store_dir"And update all references:
- if chroma_dir.exists():
- shutil.rmtree(chroma_dir)
+ if vector_store_dir.exists():
+ shutil.rmtree(vector_store_dir)- client = QdrantClient(path=str(chroma_dir))
+ client = QdrantClient(path=str(vector_store_dir))- storage.persist(persist_dir=str(chroma_dir))
+ storage.persist(persist_dir=str(vector_store_dir))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chroma_dir = Path(paths["chroma_dir"]) | |
| # Initialize the directory for vector store | |
| - chroma_dir = Path(paths["chroma_dir"]) | |
| + vector_store_dir = Path(paths["chroma_dir"]) # Consider renaming config key to "vector_store_dir" | |
| # Remove any existing data | |
| - if chroma_dir.exists(): | |
| - shutil.rmtree(chroma_dir) | |
| + if vector_store_dir.exists(): | |
| + shutil.rmtree(vector_store_dir) | |
| # Connect to Qdrant vector store | |
| - client = QdrantClient(path=str(chroma_dir)) | |
| + client = QdrantClient(path=str(vector_store_dir)) | |
| # Persist the embeddings/storage | |
| - storage.persist(persist_dir=str(chroma_dir)) | |
| + storage.persist(persist_dir=str(vector_store_dir)) |
🤖 Prompt for AI Agents
In Dev/src/study_tools/build_index.py at line 37, rename the variable
`chroma_dir` to a name that reflects its use with Qdrant, such as `qdrant_dir`.
Update all references to this variable throughout the file to maintain
consistency and avoid confusion about the vector storage backend.
| def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]: | ||
| results = [] | ||
| for node in nodes: | ||
| key = _sha(node) | ||
| fp = cache_dir / f"{key}.md" | ||
| if fp.exists(): | ||
| results.append(fp.read_text(encoding="utf-8")) | ||
| continue | ||
| prompt = f"Summarise in {lang}:\n\n{node.text}" | ||
| text = asyncio.run(_complete(llm, prompt)) | ||
| fp.write_text(text, encoding="utf-8") | ||
| results.append(text) | ||
| return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for file operations.
The function performs file I/O operations without proper error handling, which could cause the application to crash on filesystem issues.
def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
results = []
for node in nodes:
key = _sha(node)
fp = cache_dir / f"{key}.md"
- if fp.exists():
- results.append(fp.read_text(encoding="utf-8"))
- continue
+ try:
+ if fp.exists():
+ results.append(fp.read_text(encoding="utf-8"))
+ continue
+ except (OSError, IOError) as e:
+ print(f"Warning: Could not read cache file {fp}: {e}")
+ # Continue to regenerate summary
+
prompt = f"Summarise in {lang}:\n\n{node.text}"
text = asyncio.run(_complete(llm, prompt))
- fp.write_text(text, encoding="utf-8")
+ try:
+ fp.write_text(text, encoding="utf-8")
+ except (OSError, IOError) as e:
+ print(f"Warning: Could not write cache file {fp}: {e}")
results.append(text)
return results📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]: | |
| results = [] | |
| for node in nodes: | |
| key = _sha(node) | |
| fp = cache_dir / f"{key}.md" | |
| if fp.exists(): | |
| results.append(fp.read_text(encoding="utf-8")) | |
| continue | |
| prompt = f"Summarise in {lang}:\n\n{node.text}" | |
| text = asyncio.run(_complete(llm, prompt)) | |
| fp.write_text(text, encoding="utf-8") | |
| results.append(text) | |
| return results | |
| def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]: | |
| results = [] | |
| for node in nodes: | |
| key = _sha(node) | |
| fp = cache_dir / f"{key}.md" | |
| try: | |
| if fp.exists(): | |
| results.append(fp.read_text(encoding="utf-8")) | |
| continue | |
| except (OSError, IOError) as e: | |
| print(f"Warning: Could not read cache file {fp}: {e}") | |
| # Continue to regenerate summary | |
| prompt = f"Summarise in {lang}:\n\n{node.text}" | |
| text = asyncio.run(_complete(llm, prompt)) | |
| try: | |
| fp.write_text(text, encoding="utf-8") | |
| except (OSError, IOError) as e: | |
| print(f"Warning: Could not write cache file {fp}: {e}") | |
| results.append(text) | |
| return results |
🤖 Prompt for AI Agents
In Dev/src/study_tools/summarize.py around lines 31 to 43, the function performs
file read and write operations without error handling, risking crashes on
filesystem errors. Add try-except blocks around the file reading and writing
code to catch and handle exceptions such as IOError or OSError. Log or handle
these exceptions appropriately to ensure the application continues running
smoothly even if file operations fail.
| results.append(fp.read_text(encoding="utf-8")) | ||
| continue | ||
| prompt = f"Summarise in {lang}:\n\n{node.text}" | ||
| text = asyncio.run(_complete(llm, prompt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid using asyncio.run in synchronous context.
Using asyncio.run inside a synchronous function can cause issues, especially if this function is called from an async context. It's better to make the entire function async or use a different approach.
Make the function async and handle the async context properly:
-def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
+async def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
results = []
for node in nodes:
key = _sha(node)
fp = cache_dir / f"{key}.md"
if fp.exists():
results.append(fp.read_text(encoding="utf-8"))
continue
prompt = f"Summarise in {lang}:\n\n{node.text}"
- text = asyncio.run(_complete(llm, prompt))
+ text = await _complete(llm, prompt)
fp.write_text(text, encoding="utf-8")
results.append(text)
return resultsThen update the main function to use asyncio.run at the top level:
- for g in tqdm(groups):
- summaries.extend(summarise_group(llm, g, "English", cache_dir))
+ async def process_groups():
+ for g in tqdm(groups):
+ summaries.extend(await summarise_group(llm, g, "English", cache_dir))
+
+ asyncio.run(process_groups())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| text = asyncio.run(_complete(llm, prompt)) | |
| # Dev/src/study_tools/summarize.py | |
| # … other imports … | |
| import asyncio | |
| from pathlib import Path | |
| from typing import Any, Sequence, List | |
| from tqdm import tqdm | |
| # make helper fully async | |
| async def summarise_group( | |
| llm: Any, | |
| nodes: Sequence[Any], | |
| lang: str, | |
| cache_dir: Path | |
| ) -> List[str]: | |
| results: List[str] = [] | |
| for node in nodes: | |
| key = _sha(node) | |
| fp = cache_dir / f"{key}.md" | |
| if fp.exists(): | |
| results.append(fp.read_text(encoding="utf-8")) | |
| continue | |
| prompt = f"Summarise in {lang}:\n\n{node.text}" | |
| text = await _complete(llm, prompt) | |
| fp.write_text(text, encoding="utf-8") | |
| results.append(text) | |
| return results | |
| def main(): | |
| # … setup llm, groups, cache_dir, summaries = [] … | |
| # run all groups under one top-level event loop | |
| async def process_groups(): | |
| for g in tqdm(groups): | |
| summaries.extend(await summarise_group( | |
| llm, g, "English", cache_dir | |
| )) | |
| asyncio.run(process_groups()) | |
| # … rest of main … |
🤖 Prompt for AI Agents
In Dev/src/study_tools/summarize.py at line 40, avoid calling asyncio.run inside
a synchronous function as it can cause issues if called from an async context.
Instead, convert the entire function containing this line to be async, replacing
the direct call to asyncio.run with an await expression. Then, ensure that
asyncio.run is only called once at the top-level entry point of the program to
execute the main async function.
Summary by CodeRabbit
New Features
Documentation
Chores
.gitignorefor common artifacts and IDE files.Tests