-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: Add periodic index synchronization feature [Opus5.1] #403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add periodic index synchronization feature [Opus5.1] #403
Conversation
This commit introduces a comprehensive background sync system for
automatic index updates:
- Add sync_scheduler.py with SyncScheduler, SyncMetadataStore, and
IndexSyncManager classes for managing periodic sync operations
- Implement metadata tracking for sync status, commit hashes, and
timing with persistent JSON storage
- Add incremental sync logic that detects repository changes via
git fetch/diff and only re-indexes when updates are detected
- Add REST API endpoints for sync management:
- POST/GET /api/sync/projects - Add/list sync projects
- GET/PUT/DELETE /api/sync/projects/{repo_type}/{owner}/{repo}
- POST .../trigger - Manual sync trigger
- GET .../check - Check for updates without syncing
- GET /api/sync/status - Overall scheduler status
- Integrate scheduler with FastAPI lifespan for automatic
start/stop on application startup/shutdown
- Add DEEPWIKI_SYNC_ENABLED env var to disable sync if needed
Summary of ChangesHello @sashimikun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and automated system for keeping repository indexes up-to-date. By periodically checking for changes in configured repositories and intelligently re-indexing only when necessary, it significantly enhances the application's ability to maintain a current knowledge base without manual intervention. This feature improves data accuracy and operational efficiency by ensuring that the system always reflects the latest state of the tracked codebases. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable feature: periodic index synchronization. The implementation is comprehensive, covering metadata storage, a background scheduler, and API endpoints for management. The use of FastAPI's lifespan for scheduler management is a great choice. However, I've identified several critical and high-severity issues that need to be addressed. These include a major security vulnerability regarding the storage of access tokens, a critical bug that will cause file-system collisions, and several instances of blocking calls in async code that will severely degrade server performance. Please review the detailed comments for specifics and suggestions.
|
|
||
| root_path = get_adalflow_default_root_path() | ||
| repo_path = os.path.join(root_path, "repos", metadata.repo) | ||
| db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the repo_path, the db_path is constructed using only metadata.repo. This will cause database file collisions for repositories with the same name but from different owners or providers. The database filename should be made unique to avoid data corruption.
| db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl") | |
| db_path = os.path.join(root_path, "databases", f"{metadata.repo_type}_{metadata.owner}_{metadata.repo}.pkl") |
| Dict with 'has_updates', 'remote_commit', 'changed_files' keys | ||
| """ | ||
| root_path = get_adalflow_default_root_path() | ||
| repo_path = os.path.join(root_path, "repos", metadata.repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repo_path is constructed using only metadata.repo. This will cause file-system collisions if two different projects (e.g., from different owners or repo providers) have the same repository name. The path should be made unique using repo_type and owner as well.
| repo_path = os.path.join(root_path, "repos", metadata.repo) | |
| repo_path = os.path.join(root_path, "repos", metadata.repo_type, metadata.owner, metadata.repo) |
| enabled: bool = True | ||
| created_at: Optional[str] = None | ||
| updated_at: Optional[str] = None | ||
| access_token: Optional[str] = None # Stored securely, not exposed in API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing the access_token in plaintext in a JSON file is a major security vulnerability. If this file is compromised, the access token can be used to access private repositories. You should use a secure secret management system (like HashiCorp Vault, AWS Secrets Manager, etc.) or at least encrypt the token before storing it on disk.
| result = subprocess.run( | ||
| ["git", "fetch", "origin"], | ||
| cwd=repo_path, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=60 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This subprocess.run call is blocking. Since this method can be called from an async context, it will block the event loop, severely impacting performance. It should be run in a thread pool using asyncio.to_thread. This applies to all other subprocess.run calls in this class.
| result = subprocess.run( | |
| ["git", "fetch", "origin"], | |
| cwd=repo_path, | |
| capture_output=True, | |
| text=True, | |
| timeout=60 | |
| ) | |
| result = await asyncio.to_thread( | |
| subprocess.run, | |
| ["git", "fetch", "origin"], | |
| cwd=repo_path, | |
| capture_output=True, | |
| text=True, | |
| timeout=60 | |
| ) |
|
|
||
| try: | ||
| # Check for updates | ||
| update_info = self.check_for_updates(metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_for_updates is a blocking method due to its use of subprocess.run. Calling it directly in an async function blocks the event loop. It should be run in a thread pool. The same applies to other blocking calls in this function like _pull_latest_changes (line 383), download_repo (line 389), read_all_documents (line 399), and _get_local_commit_hash (line 420).
| update_info = self.check_for_updates(metadata) | |
| update_info = await asyncio.to_thread(self.check_for_updates, metadata) |
| repo_url: str | ||
| owner: str | ||
| repo: str | ||
| repo_type: str # github, gitlab, bitbucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better type safety within the application, it's recommended to use typing.Literal for the repo_type field. This will help prevent bugs caused by incorrect repository type strings.
| repo_type: str # github, gitlab, bitbucket | |
| repo_type: Literal["github", "gitlab", "bitbucket"] # github, gitlab, bitbucket |
| from api.sync_scheduler import ( | ||
| get_scheduler, start_scheduler, stop_scheduler, | ||
| SyncStatus, SyncMetadata | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| repo_url: str = Field(..., description="Full URL of the repository") | ||
| owner: str = Field(..., description="Repository owner/organization") | ||
| repo: str = Field(..., description="Repository name") | ||
| repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better validation and API documentation, it's recommended to use typing.Literal for the repo_type field. This will ensure only valid repository types are accepted and will be reflected in the OpenAPI schema, improving robustness.
| repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)") | |
| repo_type: Literal["github", "gitlab", "bitbucket"] = Field(default="github", description="Type of repository (github, gitlab, bitbucket)") |
| self.check_interval = check_interval_seconds | ||
| self.metadata_store = SyncMetadataStore() | ||
| self.sync_manager = IndexSyncManager(self.metadata_store) | ||
| self._running = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid other parts of the application accessing the 'private' _running attribute directly, you should expose it via a public property. This improves encapsulation. Please add the following property to the SyncScheduler class, for instance after the __init__ method:
@property
def is_running(self) -> bool:
"""Returns True if the scheduler is running."""
return self._running| self.sync_manager = IndexSyncManager(self.metadata_store) | ||
| self._running = False | ||
| self._task: Optional[asyncio.Task] = None | ||
| self._manual_sync_queue: asyncio.Queue = asyncio.Queue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit introduces a comprehensive background sync system for automatic index updates: