feat: Add periodic index synchronization feature [Opus5.1] #403

sashimikun · 2025-11-24T19:15:08Z

This commit introduces a comprehensive background sync system for automatic index updates:

Add sync_scheduler.py with SyncScheduler, SyncMetadataStore, and IndexSyncManager classes for managing periodic sync operations
Implement metadata tracking for sync status, commit hashes, and timing with persistent JSON storage
Add incremental sync logic that detects repository changes via git fetch/diff and only re-indexes when updates are detected
Add REST API endpoints for sync management:
- POST/GET /api/sync/projects - Add/list sync projects
- GET/PUT/DELETE /api/sync/projects/{repo_type}/{owner}/{repo}
- POST .../trigger - Manual sync trigger
- GET .../check - Check for updates without syncing
- GET /api/sync/status - Overall scheduler status
Integrate scheduler with FastAPI lifespan for automatic start/stop on application startup/shutdown
Add DEEPWIKI_SYNC_ENABLED env var to disable sync if needed

This commit introduces a comprehensive background sync system for automatic index updates: - Add sync_scheduler.py with SyncScheduler, SyncMetadataStore, and IndexSyncManager classes for managing periodic sync operations - Implement metadata tracking for sync status, commit hashes, and timing with persistent JSON storage - Add incremental sync logic that detects repository changes via git fetch/diff and only re-indexes when updates are detected - Add REST API endpoints for sync management: - POST/GET /api/sync/projects - Add/list sync projects - GET/PUT/DELETE /api/sync/projects/{repo_type}/{owner}/{repo} - POST .../trigger - Manual sync trigger - GET .../check - Check for updates without syncing - GET /api/sync/status - Overall scheduler status - Integrate scheduler with FastAPI lifespan for automatic start/stop on application startup/shutdown - Add DEEPWIKI_SYNC_ENABLED env var to disable sync if needed

gemini-code-assist · 2025-11-24T19:15:29Z

Summary of Changes

Hello @sashimikun, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and automated system for keeping repository indexes up-to-date. By periodically checking for changes in configured repositories and intelligently re-indexing only when necessary, it significantly enhances the application's ability to maintain a current knowledge base without manual intervention. This feature improves data accuracy and operational efficiency by ensuring that the system always reflects the latest state of the tracked codebases.

Highlights

Periodic Index Synchronization: Introduces a comprehensive background synchronization system to automatically update repository indexes, ensuring data freshness.
Core Components: Adds sync_scheduler.py containing SyncScheduler for managing operations, SyncMetadataStore for persistent metadata tracking, and IndexSyncManager for handling actual sync logic.
Incremental Sync Logic: Implements intelligent incremental synchronization that detects repository changes via git fetch/diff and only re-indexes when updates are identified.
New REST API Endpoints: Exposes a suite of new API endpoints under /api/sync/ for managing sync projects (add, list, get, update, delete), manually triggering syncs, checking for updates, and retrieving overall scheduler status.
FastAPI Lifespan Integration: Integrates the sync scheduler with FastAPI's lifespan context manager, ensuring the scheduler starts automatically on application startup and stops gracefully on shutdown.
Configurable Sync: Adds a DEEPWIKI_SYNC_ENABLED environment variable, allowing the periodic synchronization feature to be easily disabled if not required.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and valuable feature: periodic index synchronization. The implementation is comprehensive, covering metadata storage, a background scheduler, and API endpoints for management. The use of FastAPI's lifespan for scheduler management is a great choice. However, I've identified several critical and high-severity issues that need to be addressed. These include a major security vulnerability regarding the storage of access tokens, a critical bug that will cause file-system collisions, and several instances of blocking calls in async code that will severely degrade server performance. Please review the detailed comments for specifics and suggestions.

gemini-code-assist · 2025-11-24T19:17:55Z

api/sync_scheduler.py

+
+        root_path = get_adalflow_default_root_path()
+        repo_path = os.path.join(root_path, "repos", metadata.repo)
+        db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl")


Similar to the repo_path, the db_path is constructed using only metadata.repo. This will cause database file collisions for repositories with the same name but from different owners or providers. The database filename should be made unique to avoid data corruption.

Suggested change

db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl")

db_path = os.path.join(root_path, "databases", f"{metadata.repo_type}_{metadata.owner}_{metadata.repo}.pkl")

gemini-code-assist · 2025-11-24T19:17:55Z

api/sync_scheduler.py

+            Dict with 'has_updates', 'remote_commit', 'changed_files' keys
+        """
+        root_path = get_adalflow_default_root_path()
+        repo_path = os.path.join(root_path, "repos", metadata.repo)


The repo_path is constructed using only metadata.repo. This will cause file-system collisions if two different projects (e.g., from different owners or repo providers) have the same repository name. The path should be made unique using repo_type and owner as well.

Suggested change

repo_path = os.path.join(root_path, "repos", metadata.repo)

repo_path = os.path.join(root_path, "repos", metadata.repo_type, metadata.owner, metadata.repo)

gemini-code-assist · 2025-11-24T19:17:55Z

api/sync_scheduler.py

+    enabled: bool = True
+    created_at: Optional[str] = None
+    updated_at: Optional[str] = None
+    access_token: Optional[str] = None  # Stored securely, not exposed in API


Storing the access_token in plaintext in a JSON file is a major security vulnerability. If this file is compromised, the access token can be used to access private repositories. You should use a secure secret management system (like HashiCorp Vault, AWS Secrets Manager, etc.) or at least encrypt the token before storing it on disk.

gemini-code-assist · 2025-11-24T19:17:55Z

api/sync_scheduler.py

+            result = subprocess.run(
+                ["git", "fetch", "origin"],
+                cwd=repo_path,
+                capture_output=True,
+                text=True,
+                timeout=60
+            )


This subprocess.run call is blocking. Since this method can be called from an async context, it will block the event loop, severely impacting performance. It should be run in a thread pool using asyncio.to_thread. This applies to all other subprocess.run calls in this class.

Suggested change

result = subprocess.run(

["git", "fetch", "origin"],

cwd=repo_path,

capture_output=True,

text=True,

timeout=60

)

result = await asyncio.to_thread(

subprocess.run,

["git", "fetch", "origin"],

cwd=repo_path,

capture_output=True,

text=True,

timeout=60

)

gemini-code-assist · 2025-11-24T19:17:56Z

api/sync_scheduler.py

+
+        try:
+            # Check for updates
+            update_info = self.check_for_updates(metadata)


check_for_updates is a blocking method due to its use of subprocess.run. Calling it directly in an async function blocks the event loop. It should be run in a thread pool. The same applies to other blocking calls in this function like _pull_latest_changes (line 383), download_repo (line 389), read_all_documents (line 399), and _get_local_commit_hash (line 420).

Suggested change

update_info = self.check_for_updates(metadata)

update_info = await asyncio.to_thread(self.check_for_updates, metadata)

gemini-code-assist · 2025-11-24T19:17:56Z

api/sync_scheduler.py

+    repo_url: str
+    owner: str
+    repo: str
+    repo_type: str  # github, gitlab, bitbucket


For better type safety within the application, it's recommended to use typing.Literal for the repo_type field. This will help prevent bugs caused by incorrect repository type strings.

Suggested change

repo_type: str # github, gitlab, bitbucket

repo_type: Literal["github", "gitlab", "bitbucket"] # github, gitlab, bitbucket

gemini-code-assist · 2025-11-24T19:17:56Z

api/api.py

+from api.sync_scheduler import (
+    get_scheduler, start_scheduler, stop_scheduler,
+    SyncStatus, SyncMetadata
+)


There are several unused imports here. start_scheduler and stop_scheduler are imported locally within the lifespan function, and SyncStatus and SyncMetadata are not used in this file. Removing them will clean up the code and avoid potential confusion.

gemini-code-assist · 2025-11-24T19:17:56Z

api/api.py

+    repo_url: str = Field(..., description="Full URL of the repository")
+    owner: str = Field(..., description="Repository owner/organization")
+    repo: str = Field(..., description="Repository name")
+    repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")


For better validation and API documentation, it's recommended to use typing.Literal for the repo_type field. This will ensure only valid repository types are accepted and will be reflected in the OpenAPI schema, improving robustness.

Suggested change

repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")

repo_type: Literal["github", "gitlab", "bitbucket"] = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")

gemini-code-assist · 2025-11-24T19:17:57Z

api/sync_scheduler.py

+        self.check_interval = check_interval_seconds
+        self.metadata_store = SyncMetadataStore()
+        self.sync_manager = IndexSyncManager(self.metadata_store)
+        self._running = False


To avoid other parts of the application accessing the 'private' _running attribute directly, you should expose it via a public property. This improves encapsulation. Please add the following property to the SyncScheduler class, for instance after the __init__ method:

@property def is_running(self) -> bool: """Returns True if the scheduler is running.""" return self._running

gemini-code-assist · 2025-11-24T19:17:57Z

api/sync_scheduler.py

+        self.sync_manager = IndexSyncManager(self.metadata_store)
+        self._running = False
+        self._task: Optional[asyncio.Task] = None
+        self._manual_sync_queue: asyncio.Queue = asyncio.Queue()


The _manual_sync_queue is initialized here and checked in the scheduler loop, but it's never populated. The trigger_sync method calls sync_manager.sync_project directly. This appears to be unused or incomplete logic and should either be fully implemented or removed to avoid confusion.

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

	db_path = os.path.join(root_path, "databases", f"{metadata.repo}.pkl")
	db_path = os.path.join(root_path, "databases", f"{metadata.repo_type}_{metadata.owner}_{metadata.repo}.pkl")

	repo_path = os.path.join(root_path, "repos", metadata.repo)
	repo_path = os.path.join(root_path, "repos", metadata.repo_type, metadata.owner, metadata.repo)

	update_info = self.check_for_updates(metadata)
	update_info = await asyncio.to_thread(self.check_for_updates, metadata)

	repo_type: str # github, gitlab, bitbucket
	repo_type: Literal["github", "gitlab", "bitbucket"] # github, gitlab, bitbucket

	repo_type: str = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")
	repo_type: Literal["github", "gitlab", "bitbucket"] = Field(default="github", description="Type of repository (github, gitlab, bitbucket)")

feat: Add periodic index synchronization feature [Opus5.1] #403

Are you sure you want to change the base?

feat: Add periodic index synchronization feature [Opus5.1] #403

Uh oh!

Conversation

sashimikun commented Nov 24, 2025

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants