Skip to content

Conversation

@Jul352mf
Copy link
Owner

@Jul352mf Jul352mf commented Jul 5, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a modular study bot package with configurable PDF ingestion, chunking, summarization, chat, and flashcard generation.
    • Added CLI tools for index building, summarization, chat interaction, and data reset.
    • Implemented configuration and agent role management via YAML files.
    • Integrated support for local and cloud-based language models.
  • Documentation

    • Added comprehensive READMEs, usage guides, architecture overviews, migration instructions, and a changelog.
    • Introduced a JSON schema for Learning Units and a prioritized development TODO list.
  • Chores

    • Established dependency management with Poetry and requirements files.
    • Added .gitignore for common artifacts and IDE files.
  • Tests

    • Added a basic test to verify module imports.

@coderabbitai
Copy link

coderabbitai bot commented Jul 5, 2025

Walkthrough

This update introduces a comprehensive refactor and modularization of the Universal Study Tutor project. It adds a new Dev/ package with structured source code, configuration, documentation, and tests. Key features include PDF ingestion, summarization, chat, flashcard generation, and centralized configuration. Supporting documentation and configuration files are also included, while the legacy prototype is retained in a separate directory.

Changes

File(s) Change Summary
.gitignore Added to exclude Python, IDE, environment, data, and cache files from version control.
Dev/CONTRIBUTING.md Added contribution guidelines covering setup, testing, code style, and documentation.
Dev/README.md Added detailed README for the new Dev package, outlining features and quickstart instructions.
Dev/agents.yaml Added agent configuration YAML specifying six system agents and their roles.
Dev/config.yaml Added centralized YAML configuration for paths, chunking, models, and limits.
Dev/docs/MIGRATE_LARGE_FILES.md Added documentation for handling large PDF files and Git LFS usage.
Dev/docs/TODO.md Added prioritized development task backlog.
Dev/docs/changelog.md Added changelog documenting initial refactor and setup.
Dev/docs/overview.md Added overview of the Dev package architecture and workflow.
Dev/pyproject.toml, Dev/requirements.txt Added Poetry and pip dependency files for project and development requirements.
Dev/src/study_tools/init.py Added package initializer with explicit public API via __all__.
Dev/src/study_tools/build_index.py Added module for PDF ingestion and vector index creation.
Dev/src/study_tools/cli_chat.py Added CLI chat interface using vector store and LLM.
Dev/src/study_tools/flashcards.py Added script to generate Anki flashcards from summaries.
Dev/src/study_tools/ingest.py Added script to count and print loaded documents.
Dev/src/study_tools/reset.py Added script to remove generated data and files.
Dev/src/study_tools/summarize.py Added asynchronous summarization pipeline with caching.
Dev/src/study_tools/utils.py Added utility for loading and caching YAML configuration.
Dev/tests/test_imports.py Added test to verify all main modules can be imported.
README.md Simplified to reference new Dev package and its README.
messy_start/README.md Added README for legacy prototype with setup, usage, and architecture overview.
messy_start/schema/learning_unit.schema.json Added JSON schema for LearningUnit objects with required and optional fields.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant ConfigLoader
    participant PDFIngestor
    participant VectorStore
    participant LLM
    participant FlashcardBuilder

    User->>CLI: Run build_index / summarize / cli_chat / flashcards
    CLI->>ConfigLoader: Load config.yaml
    CLI->>PDFIngestor: Ingest PDFs (build_index)
    PDFIngestor->>VectorStore: Store chunks
    CLI->>LLM: Summarize chunks (summarize)
    LLM->>VectorStore: Store summaries
    CLI->>LLM: Chat with material (cli_chat)
    CLI->>FlashcardBuilder: Generate flashcards (flashcards)
    FlashcardBuilder->>VectorStore: Retrieve summaries
    FlashcardBuilder->>User: Export Anki deck
Loading

Poem

🐇
In the warren of code, we hop anew,
With configs, docs, and tests in view.
Summaries, flashcards, chat at your call—
Study tools modular, ready for all!
From legacy roots to a fresh Dev sprout,
Rabbits rejoice—let learning hop out!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🔭 Outside diff range comments (1)
Dev/requirements.txt (1)

1-10: Pin dependency versions in Dev/requirements.txt

To ensure reproducible builds and guard against unintended breaking changes, add version constraints based on the latest PyPI releases:

File: Dev/requirements.txt

-llama-index-core
-llama-index-llms-openai
-chromadb
-tiktoken
-tenacity
-qdrant-client
-genanki
-tqdm
-pyyaml
+llama-index-core>=0.12.46,<1.0.0
+llama-index-llms-openai>=0.4.7,<1.0.0
+chromadb>=1.0.15,<2.0.0
+tiktoken>=0.9.0,<1.0.0
+tenacity>=9.1.2,<10.0.0
+qdrant-client>=1.14.3,<2.0.0
+genanki>=0.13.1,<1.0.0
+tqdm>=4.67.1,<5.0.0
+pyyaml>=6.0.2,<7.0.0

No security advisories were found for these packages at the time of review.

🧹 Nitpick comments (16)
README.md (1)

1-8: Consider restoring a minimal quick-start or installation snippet

The root README now only points to Dev/README.md. While that keeps the file uncluttered, having a two-line quick-start (e.g. “pip install -e Dev && study-tools …”) helps newcomers decide whether to keep reading.

Dev/docs/MIGRATE_LARGE_FILES.md (1)

11-11: Missing comma after a leading adverb

Insert a comma after “Otherwise” to avoid a run-on sentence.

-Otherwise keep the files locally and back them up to Google Drive or GCS as needed.
+Otherwise, keep the files locally and back them up to Google Drive or GCS as needed.
Dev/src/study_tools/reset.py (1)

24-25: Expose a CLI via python -m study_tools.reset

Wrapping main() in a console_scripts entry point (or enabling python -m study_tools.reset) improves UX and avoids relying on file paths.

Dev/docs/TODO.md (1)

5-5: Fix hyphenation for compound adjective.

The term should be "hard-coded" when used as a compound adjective.

-- Remove hard coded paths; read from `config.yaml`.
+- Remove hard-coded paths; read from `config.yaml`.
Dev/src/study_tools/cli_chat.py (2)

4-4: Remove unused import.

The pathlib.Path import is not used in the code.

-from pathlib import Path

11-39: Consider refactoring to reduce local variables.

The function has 16 local variables, which slightly exceeds the typical limit of 15. While not critical, consider extracting the chat engine setup into a separate function for better modularity.

+def create_chat_engine():
+    """Create and return a configured chat engine."""
+    from llama_index.core import StorageContext, load_index_from_storage
+    from llama_index.llms.openai import OpenAI
+    from llama_index.vector_stores.qdrant import QdrantVectorStore
+    from qdrant_client import QdrantClient
+    
+    cfg = load_config()
+    llm = OpenAI(model=cfg["models"]["summarizer"])
+    chroma_path = cfg["paths"]["chroma_dir"]
+    client = QdrantClient(path=chroma_path)
+    store = QdrantVectorStore(client, collection_name="study")
+    storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store)
+    index = load_index_from_storage(storage)
+    return index.as_chat_engine(chat_mode="condense_question", llm=llm, verbose=True)
+
 def main():
-    from llama_index.core import StorageContext, load_index_from_storage
-    from llama_index.llms.openai import OpenAI
-    from llama_index.vector_stores.qdrant import QdrantVectorStore
-    from qdrant_client import QdrantClient
-
-    cfg = load_config()
-    llm = OpenAI(model=cfg["models"]["summarizer"])
-    chroma_path = cfg["paths"]["chroma_dir"]
-    client = QdrantClient(path=chroma_path)
-    store = QdrantVectorStore(client, collection_name="study")
-    storage = StorageContext.from_defaults(persist_dir=chroma_path, vector_store=store)
-    index = load_index_from_storage(storage)
-    engine = index.as_chat_engine(chat_mode="condense_question", llm=llm, verbose=True)
+    engine = create_chat_engine()
Dev/README.md (1)

8-8: Fix formatting: Use number '0' instead of letter 'O' in "GPT-4o".

The model name should use the number '0' instead of the letter 'O' for consistency with OpenAI's official naming.

-- Async summarisation using local Mistral and OpenAI GPT‑4o
+- Async summarisation using local Mistral and OpenAI GPT-4o
Dev/src/study_tools/flashcards.py (1)

4-4: Remove unused import.

-from pathlib import Path
Dev/pyproject.toml (1)

21-23: Pin development dependency versions for consistency.

-pytest = "*"
-ruff = "*"
-black = "*"
+pytest = "^7.0.0"
+ruff = "^0.1.0"
+black = "^23.0.0"
Dev/src/study_tools/build_index.py (3)

22-23: Page numbering inconsistency in metadata.

The metadata shows page_start: i + 1 but the range starts from i, creating a potential off-by-one confusion. Consider whether pages should be 1-indexed or 0-indexed consistently.

If you want 1-indexed pages (more user-friendly), apply this diff:

        meta = {
            "file_path": str(pdf_path),
            "file_name": pdf_path.name,
            "page_start": i + 1,
-            "page_end": end,
+            "page_end": end,  # This should be end, not end + 1, since range is exclusive
        }

29-29: Remove unused import.

The static analysis correctly identifies that Document is imported but never used in the main function.

-    from llama_index.core import VectorStoreIndex, StorageContext, Document
+    from llama_index.core import VectorStoreIndex, StorageContext

28-63: Consider refactoring to reduce complexity.

The main function has 18 local variables, which exceeds the recommended limit and makes the function harder to maintain. Consider extracting some functionality into helper functions.

Example refactoring:

def _setup_vector_store(vector_store_dir: Path):
    """Set up and return Qdrant vector store components."""
    if vector_store_dir.exists():
        shutil.rmtree(vector_store_dir)
    
    client = QdrantClient(path=str(vector_store_dir))
    store = QdrantVectorStore(client, collection_name="study")
    storage = StorageContext.from_defaults(vector_store=store)
    return storage

def _collect_documents(docs_dir: Path, chunk_config: dict):
    """Collect and process documents from PDF files."""
    docs = []
    for pdf in docs_dir.rglob("*.pdf"):
        docs.extend(
            extract_pages(
                pdf,
                chunk_config["pages_per_group"],
                chunk_config["page_overlap"],
            )
        )
    return docs
Dev/src/study_tools/summarize.py (3)

67-67: Make batch size configurable.

The batch size of 5 is hard-coded, which reduces flexibility. Consider making it configurable through the config file.

-    groups = [nodes[i:i+5] for i in range(0, len(nodes), 5)]
+    batch_size = cfg.get("summarization", {}).get("batch_size", 5)
+    groups = [nodes[i:i+batch_size] for i in range(0, len(nodes), batch_size)]

71-71: Make language configurable.

The language "English" is hard-coded, which limits the tool's international usability. Consider making it configurable.

-        summaries.extend(summarise_group(llm, g, "English", cache_dir))
+        language = cfg.get("summarization", {}).get("language", "English")
+        summaries.extend(summarise_group(llm, g, language, cache_dir))

46-73: Consider refactoring to reduce complexity.

The main function has 19 local variables, which exceeds the recommended limit. Consider extracting some functionality into helper functions.

Example refactoring:

def _setup_summarization_components(cfg):
    """Set up LLM, vector store, and index components."""
    paths = cfg["paths"]
    cache_dir = Path(paths["cache_dir"])
    cache_dir.mkdir(parents=True, exist_ok=True)
    
    llm = OpenAI(model=cfg["models"]["summarizer"])
    
    client = QdrantClient(path=paths["chroma_dir"])
    store = QdrantVectorStore(client, collection_name="study")
    storage = StorageContext.from_defaults(persist_dir=paths["chroma_dir"], vector_store=store)
    index = load_index_from_storage(storage)
    
    return llm, index, cache_dir

def _extract_and_group_nodes(index, batch_size=5):
    """Extract text nodes and group them for processing."""
    nodes = [n for n in index.docstore.docs.values() if isinstance(n, TextNode)]
    groups = [nodes[i:i+batch_size] for i in range(0, len(nodes), batch_size)]
    return groups
messy_start/README.md (1)

54-54: Minor grammar improvement suggested.

The phrase "retrieves relevant chunks for user questions" could be more natural as "retrieves relevant chunks of user questions" as suggested by the language tool.

-4. **Chat/Q&A** – retrieves relevant chunks for user questions.
+4. **Chat/Q&A** – retrieves relevant chunks of user questions.

However, both phrasings are grammatically correct. The current version emphasizes that chunks are retrieved "for" (in response to) user questions, which may be more accurate to the functionality. Consider keeping it as is if the current meaning is preferred.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9069cfe and 998842e.

⛔ Files ignored due to path filters (32)
  • messy_start/Corperate_Finance_download/01 General Information 2025F-3.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/02 Risk Return 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/03 Portfolio Theory CAPM 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/04 Corporate Valuation 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/05 Cost of Capital 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/06 Corporate Finance Issuance of Securities 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/07 Capital Structure 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/08 Risk Management 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/09 Corporate Governance 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/10 Efficient Markets and Behavioral Finance 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/11 Empirical Research 2025F.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Aviva Investors Student 2022.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Cohn, J. B., and Wardlaw, M. I., 2016, Financing constraints and workplace safety, The Journal of Finance, 71(5), 2017-2058.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Cox Communications Inc Student 2019.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/ESG Financing Constraints and Workplace Safety Solution 2019.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/ESG Financing Constraints and Workplace Safety Student 2019.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/First session Solution-4.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/First session-4.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Formulary.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Fourth session Solution-6.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Fourth session-5.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Mock Exam - Solution-1.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Mock Exam-1.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/NormDist.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Second session Solution-4.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Second session-3.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Third session Solution-5.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Third session-5.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/UBS Presentation Uni SG May 2025.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Valuation - Consolidated Edison and Chemex AG Solution 2019.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Valuation - Consolidated Edison and Chemex AG Student 2019.pdf is excluded by !**/*.pdf
  • messy_start/Corperate_Finance_download/Valuing Snap after the IPO quiet period Student 2019.pdf is excluded by !**/*.pdf
📒 Files selected for processing (23)
  • .gitignore (1 hunks)
  • Dev/CONTRIBUTING.md (1 hunks)
  • Dev/README.md (1 hunks)
  • Dev/agents.yaml (1 hunks)
  • Dev/config.yaml (1 hunks)
  • Dev/docs/MIGRATE_LARGE_FILES.md (1 hunks)
  • Dev/docs/TODO.md (1 hunks)
  • Dev/docs/changelog.md (1 hunks)
  • Dev/docs/overview.md (1 hunks)
  • Dev/pyproject.toml (1 hunks)
  • Dev/requirements.txt (1 hunks)
  • Dev/src/study_tools/__init__.py (1 hunks)
  • Dev/src/study_tools/build_index.py (1 hunks)
  • Dev/src/study_tools/cli_chat.py (1 hunks)
  • Dev/src/study_tools/flashcards.py (1 hunks)
  • Dev/src/study_tools/ingest.py (1 hunks)
  • Dev/src/study_tools/reset.py (1 hunks)
  • Dev/src/study_tools/summarize.py (1 hunks)
  • Dev/src/study_tools/utils.py (1 hunks)
  • Dev/tests/test_imports.py (1 hunks)
  • README.md (1 hunks)
  • messy_start/README.md (1 hunks)
  • messy_start/schema/learning_unit.schema.json (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
Dev/src/study_tools/ingest.py (1)
Dev/src/study_tools/utils.py (1)
  • load_config (8-14)
Dev/src/study_tools/reset.py (6)
Dev/src/study_tools/utils.py (1)
  • load_config (8-14)
Dev/src/study_tools/flashcards.py (1)
  • main (11-35)
Dev/src/study_tools/cli_chat.py (1)
  • main (11-39)
Dev/src/study_tools/ingest.py (1)
  • main (8-13)
Dev/src/study_tools/summarize.py (1)
  • main (46-73)
Dev/src/study_tools/build_index.py (1)
  • main (28-63)
Dev/src/study_tools/cli_chat.py (5)
Dev/src/study_tools/utils.py (1)
  • load_config (8-14)
Dev/src/study_tools/flashcards.py (1)
  • main (11-35)
Dev/src/study_tools/ingest.py (1)
  • main (8-13)
Dev/src/study_tools/summarize.py (1)
  • main (46-73)
Dev/src/study_tools/build_index.py (1)
  • main (28-63)
Dev/src/study_tools/summarize.py (5)
Dev/src/study_tools/utils.py (1)
  • load_config (8-14)
Dev/src/study_tools/cli_chat.py (1)
  • main (11-39)
Dev/src/study_tools/ingest.py (1)
  • main (8-13)
Dev/src/study_tools/reset.py (1)
  • main (9-21)
Dev/src/study_tools/build_index.py (1)
  • main (28-63)
🪛 LanguageTool
Dev/docs/MIGRATE_LARGE_FILES.md

[uncategorized] ~10-~10: A comma may be missing after the conjunctive/linking adverb ‘Otherwise’.
Context: ...ash git lfs migrate import '*.pdf' ``` Otherwise keep the files locally and back them up...

(SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)

Dev/docs/overview.md

[formatting] ~6-~6: It looks like you are using the letter ‘O’ instead of the number ‘0’. Did you mean “40”?
Context: ...es text chunks into categories. - GPT‑4o/4.1 performs heavy summarisation and ...

(ZERO_O)

Dev/docs/TODO.md

[uncategorized] ~5-~5: This word is normally spelled with a hyphen.
Context: ... loader (utils.load_config). - Remove hard coded paths; read from config.yaml. - Store...

(HARD_CODE_COMPOUND)

Dev/README.md

[formatting] ~8-~8: It looks like you are using the letter ‘O’ instead of the number ‘0’. Did you mean “40”?
Context: ...tion using local Mistral and OpenAI GPT‑4o - CLI tools for building the index, cha...

(ZERO_O)

messy_start/README.md

[uncategorized] ~53-~53: The preposition ‘of’ seems more likely in this position.
Context: ...Chat/Q&A* – retrieves relevant chunks for user questions. 5. *Flashcard Builder...

(AI_HYDRA_LEO_REPLACE_FOR_OF)

🪛 Flake8 (7.2.0)
Dev/tests/test_imports.py

[error] 13-13: expected 2 blank lines, found 1

(E302)

Dev/src/study_tools/build_index.py

[error] 29-29: 'llama_index.core.Document' imported but unused

(F401)

Dev/src/study_tools/flashcards.py

[error] 4-4: 'pathlib.Path' imported but unused

(F401)

Dev/src/study_tools/cli_chat.py

[error] 4-4: 'pathlib.Path' imported but unused

(F401)

🪛 Pylint (3.3.7)
Dev/src/study_tools/build_index.py

[refactor] 28-28: Too many local variables (18/15)

(R0914)

Dev/src/study_tools/flashcards.py

[refactor] 11-11: Too many local variables (19/15)

(R0914)

Dev/src/study_tools/cli_chat.py

[refactor] 11-11: Too many local variables (16/15)

(R0914)

Dev/src/study_tools/summarize.py

[refactor] 46-46: Too many local variables (19/15)

(R0914)

🪛 Ruff (0.11.9)
Dev/src/study_tools/build_index.py

29-29: llama_index.core.Document imported but unused

Remove unused import: llama_index.core.Document

(F401)

Dev/src/study_tools/flashcards.py

4-4: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)

Dev/src/study_tools/cli_chat.py

4-4: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)

🔇 Additional comments (12)
Dev/docs/changelog.md (1)

1-8: LGTM! Well-structured changelog documentation.

The changelog follows standard conventions and clearly documents the major refactor changes, including the new package structure, configuration management, and data organization.

Dev/CONTRIBUTING.md (1)

1-8: LGTM! Comprehensive contributing guidelines.

The guidelines clearly outline the development workflow, emphasize code quality tools, and promote good practices for testing and documentation. This will help maintain consistency across contributions.

.gitignore (1)

14-16: 👍 Good call excluding the data directory

Ignoring Dev/data/ aligns with the new large-file workflow and keeps clones lean.

Dev/src/study_tools/__init__.py (1)

3-11: All exported sub-modules are present and importable
A directory check under Dev/src/study_tools confirms that build_index.py, summarize.py, cli_chat.py, flashcards.py, ingest.py, reset.py, and utils.py all exist. No missing entries in __all__.

messy_start/schema/learning_unit.schema.json (1)

1-21: Well-structured JSON schema for learning units.

The schema effectively defines the core data structure with appropriate required fields and clear property descriptions. The status encoding system (0=unknown, >0=seen, -1=known) provides a simple yet effective way to track learning progress.

Dev/config.yaml (1)

1-24: Comprehensive and well-structured configuration.

The configuration effectively centralizes key system parameters with sensible defaults. The chunking parameters (1024 size, 128 overlap) and model selections align well with the hybrid architecture described in the system overview.

Dev/docs/overview.md (1)

1-13: Clear and accurate system overview.

The documentation effectively describes the hybrid-edge architecture and component responsibilities. The mention of configuration externalization and Git exclusion of PDFs shows good architectural awareness.

Dev/docs/TODO.md (1)

1-24: Well-organized development roadmap.

The TODO list effectively prioritizes development tasks with clear categories. The priorities align well with the system's architecture and current implementation state.

Dev/src/study_tools/cli_chat.py (1)

1-44: Well-implemented CLI chat interface.

The code effectively implements both single-question and interactive chat modes with proper argument parsing. The lazy import strategy is good for startup performance, and the chat engine configuration is appropriate for the use case.

Dev/README.md (1)

1-23: Well-structured documentation.

The README provides clear overview, features list, and quickstart instructions for the refactored implementation.

Dev/agents.yaml (1)

1-14: Well-structured agent configuration.

The agent definitions clearly map to the system's modular architecture and align with the functionality described in the documentation.

messy_start/README.md (1)

1-57: Well-structured documentation for the legacy prototype.

The README provides clear documentation of the original prototype implementation with good coverage of:

  • Current state and key components
  • Setup instructions with specific versions
  • Usage examples with command-line instructions
  • Architecture overview with component descriptions
  • Migration path to the new refactored implementation

This serves as valuable documentation for understanding the evolution from the legacy prototype to the new Dev/ implementation.

Comment on lines +12 to +15

def test_imports():
for m in modules:
importlib.import_module(m)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix PEP 8 formatting issue.

The static analysis tool correctly identifies a missing blank line before the function definition.

Apply this diff to fix the formatting:

 ]
-
+

 def test_imports():

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Flake8 (7.2.0)

[error] 13-13: expected 2 blank lines, found 1

(E302)

🤖 Prompt for AI Agents
In Dev/tests/test_imports.py around lines 12 to 15, there is a PEP 8 formatting
issue due to a missing blank line before the function definition test_imports.
Add a blank line before the def test_imports(): line to comply with PEP 8 style
guidelines.

Comment on lines +8 to +14
def main():
from llama_index.core import SimpleDirectoryReader
cfg = load_config()
docs_dir = Path(cfg["paths"]["docs_dir"])
docs = SimpleDirectoryReader(str(docs_dir)).load_data()
print(f"Loaded {len(docs)} docs")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and improve code structure.

The code works but could benefit from better error handling and structure improvements.

Apply this diff to improve error handling and code structure:

 def main():
-    from llama_index.core import SimpleDirectoryReader
-    cfg = load_config()
-    docs_dir = Path(cfg["paths"]["docs_dir"])
-    docs = SimpleDirectoryReader(str(docs_dir)).load_data()
-    print(f"Loaded {len(docs)} docs")
+    try:
+        from llama_index.core import SimpleDirectoryReader
+        cfg = load_config()
+        docs_dir = Path(cfg["paths"]["docs_dir"])
+        
+        if not docs_dir.exists():
+            print(f"Error: Documents directory does not exist: {docs_dir}")
+            return
+            
+        docs = SimpleDirectoryReader(str(docs_dir)).load_data()
+        print(f"Loaded {len(docs)} docs")
+    except KeyError as e:
+        print(f"Error: Missing configuration key: {e}")
+    except Exception as e:
+        print(f"Error loading documents: {e}")

Consider moving the import to the top of the file for better visibility:

+from llama_index.core import SimpleDirectoryReader
 from pathlib import Path

 from .utils import load_config
🤖 Prompt for AI Agents
In Dev/src/study_tools/ingest.py around lines 8 to 14, improve the code by
moving the import of SimpleDirectoryReader to the top of the file for better
visibility and apply error handling around the document loading process. Wrap
the loading logic in a try-except block to catch and log exceptions, ensuring
the program handles errors gracefully instead of failing silently or crashing.

Comment on lines +17 to +21
for f in ("summary.md", "summary.pdf", "study.apkg"):
fp = Path(f)
if fp.exists():
fp.unlink()
print(f"Deleted {fp}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded filenames tie the script to the CWD

summary.md, summary.pdf, and study.apkg are deleted relative to wherever the user runs the script, which may not be the project root. Read the root from cfg (e.g. project_root) or resolve relative to Path(__file__).parent.parent.

🤖 Prompt for AI Agents
In Dev/src/study_tools/reset.py around lines 17 to 21, the filenames are
currently hard-coded and deleted relative to the current working directory,
which may not be the project root. To fix this, determine the project root
directory by reading it from the configuration (e.g., cfg.project_root) or by
resolving it relative to the script location using Path(__file__).parent.parent.
Then, construct the full paths to the files by joining the project root with the
filenames before checking existence and deleting them.

Comment on lines +12 to +17
for key in ("chroma_dir", "cache_dir"):
p = Path(paths[key])
if p.exists():
shutil.rmtree(p)
print(f"Deleted {p}")
for f in ("summary.md", "summary.pdf", "study.apkg"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

KeyError & safety guard around path lookup

cfg["paths"][key] will raise if the key is absent. A safer pattern:

-    for key in ("chroma_dir", "cache_dir"):
-        p = Path(paths[key])
+    for key in ("chroma_dir", "cache_dir"):
+        if key not in paths:
+            continue
+        p = Path(paths[key])

Also consider shutil.rmtree(p, ignore_errors=True) to cope with permission issues.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for key in ("chroma_dir", "cache_dir"):
p = Path(paths[key])
if p.exists():
shutil.rmtree(p)
print(f"Deleted {p}")
for f in ("summary.md", "summary.pdf", "study.apkg"):
for key in ("chroma_dir", "cache_dir"):
if key not in paths:
continue
p = Path(paths[key])
if p.exists():
shutil.rmtree(p)
print(f"Deleted {p}")
for f in ("summary.md", "summary.pdf", "study.apkg"):
🤖 Prompt for AI Agents
In Dev/src/study_tools/reset.py around lines 12 to 17, the code directly
accesses paths using keys without checking if they exist, which can raise a
KeyError. Modify the code to safely retrieve the path using a method like
dict.get() with a default or check if the key exists before accessing it.
Additionally, update shutil.rmtree calls to include ignore_errors=True to handle
potential permission issues gracefully.

from qdrant_client import QdrantClient

cfg = load_config()
chroma_path = cfg["paths"]["chroma_dir"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix naming inconsistency: "chroma_dir" should be "qdrant_dir".

The code uses Qdrant vector store but references a "chroma_dir" configuration key, which is inconsistent.

-    chroma_path = cfg["paths"]["chroma_dir"]
+    chroma_path = cfg["paths"]["qdrant_dir"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chroma_path = cfg["paths"]["chroma_dir"]
chroma_path = cfg["paths"]["qdrant_dir"]
🤖 Prompt for AI Agents
In Dev/src/study_tools/flashcards.py at line 18, the configuration key
"chroma_dir" is incorrectly used while the code uses Qdrant vector store. Rename
the key from "chroma_dir" to "qdrant_dir" to maintain naming consistency with
the vector store being used.

packages = [{include = "study_tools", from = "src"}]

[tool.poetry.dependencies]
python = "^3.12"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify Python 3.12 requirement necessity.

Python 3.12 is quite recent and may limit adoption. Verify if this strict requirement is necessary or if Python 3.10+ would suffice.


🏁 Script executed:

#!/bin/bash
# Check if the codebase uses Python 3.12+ specific features
rg -n "match|case|\\|" --type py
rg -n "Self|TypeAlias|Literal" --type py
rg -n "from __future__ import annotations" --type py

Length of output: 784


Lower Python requirement to 3.10 (no 3.12-only features detected)

Our codebase uses PEP 604 union types (Path | str) which require Python 3.10+, but no features exclusive to 3.11 or 3.12 were found. Please relax the constraint unless there’s another justification for requiring 3.12.

• File needing update:

  • Dev/pyproject.toml

Suggested change:

- python = "^3.12"
+ python = "^3.10"
🤖 Prompt for AI Agents
In Dev/pyproject.toml at line 9, the Python version requirement is set to 3.12
but the code only uses features available from Python 3.10 onwards. Change the
python version specifier from "^3.12" to "^3.10" to relax the version constraint
accordingly.

Comment on lines +16 to +17
for i in range(0, len(doc), pages_per_group - overlap):
end = min(i + pages_per_group, len(doc))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential infinite loop with invalid overlap configuration.

The range calculation range(0, len(doc), pages_per_group - overlap) could result in an infinite loop if overlap >= pages_per_group, as the step would be zero or negative.

Add input validation to prevent this edge case:

def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
+    if overlap >= pages_per_group:
+        raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})")
+    if pages_per_group <= 0 or overlap < 0:
+        raise ValueError("pages_per_group must be positive and overlap must be non-negative")
    import fitz  # PyMuPDF
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in range(0, len(doc), pages_per_group - overlap):
end = min(i + pages_per_group, len(doc))
def extract_pages(pdf_path: Path, pages_per_group: int, overlap: int):
if overlap >= pages_per_group:
raise ValueError(f"Overlap ({overlap}) must be less than pages_per_group ({pages_per_group})")
if pages_per_group <= 0 or overlap < 0:
raise ValueError("pages_per_group must be positive and overlap must be non-negative")
import fitz # PyMuPDF
# ... rest of extract_pages implementation ...
for i in range(0, len(doc), pages_per_group - overlap):
end = min(i + pages_per_group, len(doc))
🤖 Prompt for AI Agents
In Dev/src/study_tools/build_index.py around lines 16 to 17, the loop step is
calculated as pages_per_group minus overlap, which can be zero or negative if
overlap is greater than or equal to pages_per_group, causing an infinite loop.
Add input validation before the loop to check if overlap is less than
pages_per_group and raise an appropriate error or handle the case to prevent the
loop from running with an invalid step value.

cfg = load_config()
paths = cfg["paths"]
docs_dir = Path(paths["docs_dir"])
chroma_dir = Path(paths["chroma_dir"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Misleading variable name: using Qdrant but named chroma_dir.

The variable is named chroma_dir but the code uses Qdrant for vector storage, not Chroma. This creates confusion and suggests a copy-paste error from a previous Chroma implementation.

-    chroma_dir = Path(paths["chroma_dir"])
+    vector_store_dir = Path(paths["chroma_dir"])  # Consider renaming config key to "vector_store_dir"

And update all references:

-    if chroma_dir.exists():
-        shutil.rmtree(chroma_dir)
+    if vector_store_dir.exists():
+        shutil.rmtree(vector_store_dir)
-    client = QdrantClient(path=str(chroma_dir))
+    client = QdrantClient(path=str(vector_store_dir))
-    storage.persist(persist_dir=str(chroma_dir))
+    storage.persist(persist_dir=str(vector_store_dir))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chroma_dir = Path(paths["chroma_dir"])
# Initialize the directory for vector store
- chroma_dir = Path(paths["chroma_dir"])
+ vector_store_dir = Path(paths["chroma_dir"]) # Consider renaming config key to "vector_store_dir"
# Remove any existing data
- if chroma_dir.exists():
- shutil.rmtree(chroma_dir)
+ if vector_store_dir.exists():
+ shutil.rmtree(vector_store_dir)
# Connect to Qdrant vector store
- client = QdrantClient(path=str(chroma_dir))
+ client = QdrantClient(path=str(vector_store_dir))
# Persist the embeddings/storage
- storage.persist(persist_dir=str(chroma_dir))
+ storage.persist(persist_dir=str(vector_store_dir))
🤖 Prompt for AI Agents
In Dev/src/study_tools/build_index.py at line 37, rename the variable
`chroma_dir` to a name that reflects its use with Qdrant, such as `qdrant_dir`.
Update all references to this variable throughout the file to maintain
consistency and avoid confusion about the vector storage backend.

Comment on lines +31 to +43
def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
results = []
for node in nodes:
key = _sha(node)
fp = cache_dir / f"{key}.md"
if fp.exists():
results.append(fp.read_text(encoding="utf-8"))
continue
prompt = f"Summarise in {lang}:\n\n{node.text}"
text = asyncio.run(_complete(llm, prompt))
fp.write_text(text, encoding="utf-8")
results.append(text)
return results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for file operations.

The function performs file I/O operations without proper error handling, which could cause the application to crash on filesystem issues.

def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
     results = []
     for node in nodes:
         key = _sha(node)
         fp = cache_dir / f"{key}.md"
-        if fp.exists():
-            results.append(fp.read_text(encoding="utf-8"))
-            continue
+        try:
+            if fp.exists():
+                results.append(fp.read_text(encoding="utf-8"))
+                continue
+        except (OSError, IOError) as e:
+            print(f"Warning: Could not read cache file {fp}: {e}")
+            # Continue to regenerate summary
+        
         prompt = f"Summarise in {lang}:\n\n{node.text}"
         text = asyncio.run(_complete(llm, prompt))
-        fp.write_text(text, encoding="utf-8")
+        try:
+            fp.write_text(text, encoding="utf-8")
+        except (OSError, IOError) as e:
+            print(f"Warning: Could not write cache file {fp}: {e}")
         results.append(text)
     return results
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
results = []
for node in nodes:
key = _sha(node)
fp = cache_dir / f"{key}.md"
if fp.exists():
results.append(fp.read_text(encoding="utf-8"))
continue
prompt = f"Summarise in {lang}:\n\n{node.text}"
text = asyncio.run(_complete(llm, prompt))
fp.write_text(text, encoding="utf-8")
results.append(text)
return results
def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
results = []
for node in nodes:
key = _sha(node)
fp = cache_dir / f"{key}.md"
try:
if fp.exists():
results.append(fp.read_text(encoding="utf-8"))
continue
except (OSError, IOError) as e:
print(f"Warning: Could not read cache file {fp}: {e}")
# Continue to regenerate summary
prompt = f"Summarise in {lang}:\n\n{node.text}"
text = asyncio.run(_complete(llm, prompt))
try:
fp.write_text(text, encoding="utf-8")
except (OSError, IOError) as e:
print(f"Warning: Could not write cache file {fp}: {e}")
results.append(text)
return results
🤖 Prompt for AI Agents
In Dev/src/study_tools/summarize.py around lines 31 to 43, the function performs
file read and write operations without error handling, risking crashes on
filesystem errors. Add try-except blocks around the file reading and writing
code to catch and handle exceptions such as IOError or OSError. Log or handle
these exceptions appropriately to ensure the application continues running
smoothly even if file operations fail.

results.append(fp.read_text(encoding="utf-8"))
continue
prompt = f"Summarise in {lang}:\n\n{node.text}"
text = asyncio.run(_complete(llm, prompt))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid using asyncio.run in synchronous context.

Using asyncio.run inside a synchronous function can cause issues, especially if this function is called from an async context. It's better to make the entire function async or use a different approach.

Make the function async and handle the async context properly:

-def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
+async def summarise_group(llm: Any, nodes: Sequence[Any], lang: str, cache_dir: Path) -> List[str]:
     results = []
     for node in nodes:
         key = _sha(node)
         fp = cache_dir / f"{key}.md"
         if fp.exists():
             results.append(fp.read_text(encoding="utf-8"))
             continue
         prompt = f"Summarise in {lang}:\n\n{node.text}"
-        text = asyncio.run(_complete(llm, prompt))
+        text = await _complete(llm, prompt)
         fp.write_text(text, encoding="utf-8")
         results.append(text)
     return results

Then update the main function to use asyncio.run at the top level:

-    for g in tqdm(groups):
-        summaries.extend(summarise_group(llm, g, "English", cache_dir))
+    async def process_groups():
+        for g in tqdm(groups):
+            summaries.extend(await summarise_group(llm, g, "English", cache_dir))
+    
+    asyncio.run(process_groups())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
text = asyncio.run(_complete(llm, prompt))
# Dev/src/study_tools/summarize.py
# … other imports …
import asyncio
from pathlib import Path
from typing import Any, Sequence, List
from tqdm import tqdm
# make helper fully async
async def summarise_group(
llm: Any,
nodes: Sequence[Any],
lang: str,
cache_dir: Path
) -> List[str]:
results: List[str] = []
for node in nodes:
key = _sha(node)
fp = cache_dir / f"{key}.md"
if fp.exists():
results.append(fp.read_text(encoding="utf-8"))
continue
prompt = f"Summarise in {lang}:\n\n{node.text}"
text = await _complete(llm, prompt)
fp.write_text(text, encoding="utf-8")
results.append(text)
return results
def main():
# … setup llm, groups, cache_dir, summaries = [] …
# run all groups under one top-level event loop
async def process_groups():
for g in tqdm(groups):
summaries.extend(await summarise_group(
llm, g, "English", cache_dir
))
asyncio.run(process_groups())
# … rest of main …
🤖 Prompt for AI Agents
In Dev/src/study_tools/summarize.py at line 40, avoid calling asyncio.run inside
a synchronous function as it can cause issues if called from an async context.
Instead, convert the entire function containing this line to be async, replacing
the direct call to asyncio.run with an await expression. Then, ensure that
asyncio.run is only called once at the top-level entry point of the program to
execute the main async function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants