Skip to content

Conversation

@sashimikun
Copy link

This commit introduces a comprehensive background sync system for automatic index updates:

  • Add sync_scheduler.py with SyncScheduler, SyncMetadataStore, and IndexSyncManager classes for managing periodic sync operations
  • Implement metadata tracking for sync status, commit hashes, and timing with persistent JSON storage
  • Add incremental sync logic that detects repository changes via git fetch/diff and only re-indexes when updates are detected
  • Add REST API endpoints for sync management:
    • POST/GET /api/sync/projects - Add/list sync projects
    • GET/PUT/DELETE /api/sync/projects/{repo_type}/{owner}/{repo}
    • POST .../trigger - Manual sync trigger
    • GET .../check - Check for updates without syncing
    • GET /api/sync/status - Overall scheduler status
  • Integrate scheduler with FastAPI lifespan for automatic start/stop on application startup/shutdown
  • Add DEEPWIKI_SYNC_ENABLED env var to disable sync if needed

This commit introduces a comprehensive background sync system for
automatic index updates:

- Add sync_scheduler.py with SyncScheduler, SyncMetadataStore, and
  IndexSyncManager classes for managing periodic sync operations
- Implement metadata tracking for sync status, commit hashes, and
  timing with persistent JSON storage
- Add incremental sync logic that detects repository changes via
  git fetch/diff and only re-indexes when updates are detected
- Add REST API endpoints for sync management:
  - POST/GET /api/sync/projects - Add/list sync projects
  - GET/PUT/DELETE /api/sync/projects/{repo_type}/{owner}/{repo}
  - POST .../trigger - Manual sync trigger
  - GET .../check - Check for updates without syncing
  - GET /api/sync/status - Overall scheduler status
- Integrate scheduler with FastAPI lifespan for automatic
  start/stop on application startup/shutdown
- Add DEEPWIKI_SYNC_ENABLED env var to disable sync if needed
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sashimikun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and automated system for keeping repository indexes up-to-date. By periodically checking for changes in configured repositories and intelligently re-indexing only when necessary, it significantly enhances the application's ability to maintain a current knowledge base without manual intervention. This feature improves data accuracy and operational efficiency by ensuring that the system always reflects the latest state of the tracked codebases.

Highlights

  • Periodic Index Synchronization: Introduces a comprehensive background synchronization system to automatically update repository indexes, ensuring data freshness.
  • Core Components: Adds sync_scheduler.py containing SyncScheduler for managing operations, SyncMetadataStore for persistent metadata tracking, and IndexSyncManager for handling actual sync logic.
  • Incremental Sync Logic: Implements intelligent incremental synchronization that detects repository changes via git fetch/diff and only re-indexes when updates are identified.
  • New REST API Endpoints: Exposes a suite of new API endpoints under /api/sync/ for managing sync projects (add, list, get, update, delete), manually triggering syncs, checking for updates, and retrieving overall scheduler status.
  • FastAPI Lifespan Integration: Integrates the sync scheduler with FastAPI's lifespan context manager, ensuring the scheduler starts automatically on application startup and stops gracefully on shutdown.
  • Configurable Sync: Adds a DEEPWIKI_SYNC_ENABLED environment variable, allowing the periodic synchronization feature to be easily disabled if not required.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature: periodic index synchronization. The implementation is comprehensive, covering metadata storage, a background scheduler, and API endpoints for management. The use of FastAPI's lifespan for scheduler management is a great choice. However, I've identified several critical and high-severity issues that need to be addressed. These include a major security vulnerability regarding the storage of access tokens, a critical bug that will cause file-system collisions, and several instances of blocking calls in async code that will severely degrade server performance. Please review the detailed comments for specifics and suggestions.


root_path = get_adalflow_default_root_path()
repo_path = os.path.join(root_path, "repos", metadata.repo)
db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the repo_path, the db_path is constructed using only metadata.repo. This will cause database file collisions for repositories with the same name but from different owners or providers. The database filename should be made unique to avoid data corruption.

Suggested change
db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl")
db_path = os.path.join(root_path, "databases", f"{metadata.repo_type}_{metadata.owner}_{metadata.repo}.pkl")

Dict with 'has_updates', 'remote_commit', 'changed_files' keys
"""
root_path = get_adalflow_default_root_path()
repo_path = os.path.join(root_path, "repos", metadata.repo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The repo_path is constructed using only metadata.repo. This will cause file-system collisions if two different projects (e.g., from different owners or repo providers) have the same repository name. The path should be made unique using repo_type and owner as well.

Suggested change
repo_path = os.path.join(root_path, "repos", metadata.repo)
repo_path = os.path.join(root_path, "repos", metadata.repo_type, metadata.owner, metadata.repo)

enabled: bool = True
created_at: Optional[str] = None
updated_at: Optional[str] = None
access_token: Optional[str] = None # Stored securely, not exposed in API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Storing the access_token in plaintext in a JSON file is a major security vulnerability. If this file is compromised, the access token can be used to access private repositories. You should use a secure secret management system (like HashiCorp Vault, AWS Secrets Manager, etc.) or at least encrypt the token before storing it on disk.

Comment on lines +217 to +223
result = subprocess.run(
["git", "fetch", "origin"],
cwd=repo_path,
capture_output=True,
text=True,
timeout=60
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This subprocess.run call is blocking. Since this method can be called from an async context, it will block the event loop, severely impacting performance. It should be run in a thread pool using asyncio.to_thread. This applies to all other subprocess.run calls in this class.

Suggested change
result = subprocess.run(
["git", "fetch", "origin"],
cwd=repo_path,
capture_output=True,
text=True,
timeout=60
)
result = await asyncio.to_thread(
subprocess.run,
["git", "fetch", "origin"],
cwd=repo_path,
capture_output=True,
text=True,
timeout=60
)


try:
# Check for updates
update_info = self.check_for_updates(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

check_for_updates is a blocking method due to its use of subprocess.run. Calling it directly in an async function blocks the event loop. It should be run in a thread pool. The same applies to other blocking calls in this function like _pull_latest_changes (line 383), download_repo (line 389), read_all_documents (line 399), and _get_local_commit_hash (line 420).

Suggested change
update_info = self.check_for_updates(metadata)
update_info = await asyncio.to_thread(self.check_for_updates, metadata)

repo_url: str
owner: str
repo: str
repo_type: str # github, gitlab, bitbucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better type safety within the application, it's recommended to use typing.Literal for the repo_type field. This will help prevent bugs caused by incorrect repository type strings.

Suggested change
repo_type: str # github, gitlab, bitbucket
repo_type: Literal["github", "gitlab", "bitbucket"] # github, gitlab, bitbucket

Comment on lines +165 to +168
from api.sync_scheduler import (
get_scheduler, start_scheduler, stop_scheduler,
SyncStatus, SyncMetadata
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several unused imports here. start_scheduler and stop_scheduler are imported locally within the lifespan function, and SyncStatus and SyncMetadata are not used in this file. Removing them will clean up the code and avoid potential confusion.

repo_url: str = Field(..., description="Full URL of the repository")
owner: str = Field(..., description="Repository owner/organization")
repo: str = Field(..., description="Repository name")
repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better validation and API documentation, it's recommended to use typing.Literal for the repo_type field. This will ensure only valid repository types are accepted and will be reflected in the OpenAPI schema, improving robustness.

Suggested change
repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")
repo_type: Literal["github", "gitlab", "bitbucket"] = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")

self.check_interval = check_interval_seconds
self.metadata_store = SyncMetadataStore()
self.sync_manager = IndexSyncManager(self.metadata_store)
self._running = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid other parts of the application accessing the 'private' _running attribute directly, you should expose it via a public property. This improves encapsulation. Please add the following property to the SyncScheduler class, for instance after the __init__ method:

@property
def is_running(self) -> bool:
    """Returns True if the scheduler is running."""
    return self._running

self.sync_manager = IndexSyncManager(self.metadata_store)
self._running = False
self._task: Optional[asyncio.Task] = None
self._manual_sync_queue: asyncio.Queue = asyncio.Queue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _manual_sync_queue is initialized here and checked in the scheduler loop, but it's never populated. The trigger_sync method calls sync_manager.sync_project directly. This appears to be unused or incomplete logic and should either be fully implemented or removed to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants