-
Notifications
You must be signed in to change notification settings - Fork 0
Integrate harness.py functionality into codegen-on-oss package #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@CodiumAI-Agent /review |
Reviewer's GuideThis pull request integrates harness.py functionality into the codegen-on-oss package by introducing new modules for codebase analysis ( File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
@sourcery-ai review |
WalkthroughThis update introduces a comprehensive codebase analysis and context management system for the Codegen on OSS package. It adds new modules for codebase analysis, snapshot management, and a FastAPI-based REST API server, along with corresponding CLI and example usage scripts. The README is rewritten to serve as a detailed user guide, and new dependencies for FastAPI, Pydantic, and Uvicorn are declared. The core functionalities now include repository analysis, agent execution, snapshot creation and retrieval, and RESTful API endpoints for these operations, with local and remote (bucket store) support for snapshot storage. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant FastAPI_Server
participant Harness
participant Snapshot
User->>CLI: Run "serve" or example script
CLI->>FastAPI_Server: Start server (if "serve")
User->>FastAPI_Server: POST /analyze/repository (repo info)
FastAPI_Server->>Harness: Create & analyze codebase
Harness-->>FastAPI_Server: Analysis results
FastAPI_Server-->>User: Return analysis results
User->>FastAPI_Server: POST /snapshot/create (harness key)
FastAPI_Server->>Snapshot: Create snapshot from harness
Snapshot-->>FastAPI_Server: Snapshot ID
FastAPI_Server-->>User: Return snapshot ID
User->>FastAPI_Server: POST /agent/run (prompt, harness key)
FastAPI_Server->>Harness: Run agent on codebase
Harness-->>FastAPI_Server: Agent result, diff, edited files
FastAPI_Server-->>User: Return agent run result
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
@codecov-ai-reviewer review |
|
@coderabbitai review |
|
On it! We are reviewing the PR and will provide feedback shortly. |
|
/gemini review
|
|
Hey! 👋 I see one of the checks failed. I am on it! 🫡 |
✅ Actions performedReview triggered.
|
|
/review |
|
/improve |
|
/korbit-review |
|
@codegen Implement and upgrade this PR with above Considerations and suggestions from @CodiumAI-Agent, @sourcery-ai, @codecov-ai-reviewer, @coderabbitai, /gemini, /korbit-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces significant enhancements to the codegen-on-oss package, integrating codebase analysis, context management, and agent execution capabilities. The changes are well-structured and documented, providing a solid foundation for future development. However, there are a few areas that could benefit from further refinement to ensure robustness and maintainability.
Summary of Findings
- Error Handling in API Endpoints: The API endpoints in
context_server/server.pyuse a genericexcept Exception as eblock. While this catches all errors, it can mask specific issues and make debugging more difficult. Consider using more specific exception handling to provide better error messages and improve the server's reliability. - Snapshot Storage Consistency: The
CodebaseContextSnapshotclass attempts to load snapshots from both remote (BucketStore) and local storage. If a snapshot exists in both locations, the code prioritizes loading from the BucketStore. Ensure that the snapshot data is consistent across both storage locations to prevent unexpected behavior. - File Extension Handling: The
_get_file_statisticsmethod inharness_integration.pycalculates file statistics based on file extensions. Ensure that this method handles edge cases, such as files without extensions or files with multiple extensions, to provide accurate file statistics.
Merge Readiness
The pull request introduces valuable new features and is generally well-implemented. However, addressing the error handling and snapshot storage consistency issues is crucial for ensuring the stability and reliability of the codebase. I recommend addressing the high severity issues before merging. I am unable to directly approve this pull request, and users should have others review and approve this code before merging.
| except Exception as e: | ||
| logger.error(f"Repository analysis failed: {e}") | ||
| raise HTTPException(status_code=500, detail=str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a broad except Exception as e can mask underlying issues. It's better to catch specific exceptions to handle them appropriately and provide more informative error messages. For example, you could catch RepoNotFoundError, AnalysisError, etc.
What do you think about refining this to catch specific exceptions?
except RepoNotFoundError as e:
logger.error(f"Repository not found: {e}")
raise HTTPException(status_code=404, detail=str(e))
except AnalysisError as e:
logger.error(f"Repository analysis failed: {e}")
raise HTTPException(status_code=500, detail=str(e))
except Exception as e:
logger.error(f"Repository analysis failed: {e}")
raise HTTPException(status_code=500, detail=str(e))| logger.warning(f"Failed to load snapshot {key}: {e}") | ||
| except Exception as e: | ||
| logger.warning(f"Failed to list snapshots from bucket store: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, using a broad except Exception as e can hide potential issues. Consider catching specific exceptions like BucketStoreError, SnapshotNotFoundError, etc., to provide better error handling and logging.
What do you think about refining this to catch specific exceptions?
except BucketStoreError as e:
logger.warning(f"Failed to list snapshots from bucket store: {e}")
except Exception as e:
logger.warning(f"Failed to list snapshots from bucket store: {e}")| ext = Path(file.path).suffix | ||
| if ext in file_extensions: | ||
| file_extensions[ext] += 1 | ||
| else: | ||
| file_extensions[ext] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider handling cases where a file might not have an extension. Currently, Path(file.path).suffix will return an empty string if there is no extension, which is fine, but it might be useful to log these instances, or handle them differently.
What do you think about adding a check for an empty extension, and perhaps logging it, or assigning a default extension like .unknown?
| ext = Path(file.path).suffix | |
| if ext in file_extensions: | |
| file_extensions[ext] += 1 | |
| else: | |
| file_extensions[ext] = 1 | |
| ext = Path(file.path).suffix | |
| if not ext: | |
| logger.warning(f"File {file.path} has no extension.") | |
| ext = ".unknown" # Or handle it as you see fit | |
| if ext in file_extensions: |
| if self.bucket_store: | ||
| try: | ||
| self.snapshot_data = self._load_remote(snapshot_id) | ||
| logger.info(f"Loaded snapshot {snapshot_id} from remote storage") | ||
| return self.snapshot_data | ||
| except Exception as e: | ||
| logger.warning(f"Failed to load snapshot from remote: {e}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a snapshot exists in both the bucket store and local storage, the bucket store version will always be loaded. This might be fine, but it's worth considering whether you want to add a check to see if the local version is newer, or give the user an option to prefer local snapshots.
What do you think about adding a check for a newer local snapshot, or a user preference?
if self.bucket_store:
try:
self.snapshot_data = self._load_remote(snapshot_id)
logger.info(f"Loaded snapshot {snapshot_id} from remote storage")
return self.snapshot_data
except Exception as e:
logger.warning(f"Failed to load snapshot from remote: {e}")
# Check if local snapshot is newer
try:
local_data = self._load_local(snapshot_id)
if self.bucket_store and Path(f"./snapshots/snapshot_{snapshot_id}.json").stat().st_mtime > datetime.fromisoformat(self.snapshot_data['created_at']).timestamp():
logger.info(f"Local snapshot {snapshot_id} is newer, loading from local storage")
return local_data
except:
pass|
You are on the CodeRabbit Free Plan. In order to use the Chat feature, please upgrade to CodeRabbit Pro. |
PR Reviewer Guide 🔍(Review updated until commit 18a0737)Here are some key observations to aid the review process:
|
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
It looks like you've requested a lot of reviews from different bots! I'm just chiming in to say that I'm here to help if you have any specific questions about my previous review or need further assistance with the code. Let me know how I can be of service! |
|
I'll review PR #14 "Integrate harness.py functionality into codegen-on-oss package" right away! I'll examine the changes and provide feedback shortly. |
PR Code Suggestions ✨Latest suggestions up to 18a0737
Previous suggestionsSuggestions up to commit 18a0737
Suggestions up to commit dd9b5d0
Suggestions up to commit dd9b5d0
Suggestions up to commit 66cd331
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PR DescriptionThis pull request introduces a new suite of tools for comprehensive codebase analysis, context management, and AI agent execution. It aims to enhance the existing codegen-on-oss package by providing capabilities for in-depth repository analysis, snapshotting codebase states, and running AI agents with saved context, all accessible through a REST API. Click to see moreKey Technical ChangesKey changes include the addition of Architecture DecisionsThe architecture adopts a modular design, separating analysis, snapshotting, and server functionalities into distinct modules. The use of FastAPI enables easy API definition and interaction. The choice of JSON for snapshot storage facilitates portability and readability. The integration with Dependencies and InteractionsThis PR introduces new dependencies: Risk ConsiderationsPotential risks include increased memory usage due to storing active harnesses in the server, lack of authentication/authorization on the API endpoints, and potential race conditions during snapshot creation/loading. Error handling and input validation should be carefully reviewed. The performance of large repository analysis and snapshotting needs to be evaluated. Notable Implementation DetailsThe |
|
|
||
| class CodebaseAnalysisHarness: | ||
| """ | ||
| A comprehensive harness for analyzing codebases, generating diffs, and tracking file changes. | ||
| Integrates core functionality from the original harness.py with enhanced analysis capabilities. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CodebaseAnalysisHarness class should implement proper cleanup mechanisms. Consider adding a __del__ method or context manager support (via __enter__ and __exit__) to ensure proper resource cleanup, especially for the codebase object which might hold file handles. This prevents resource leaks in long-running applications.
| class CodebaseAnalysisHarness: | |
| """ | |
| A comprehensive harness for analyzing codebases, generating diffs, and tracking file changes. | |
| Integrates core functionality from the original harness.py with enhanced analysis capabilities. | |
| """ | |
| class CodebaseAnalysisHarness: | |
| def __enter__(self): | |
| return self | |
| def __exit__(self, exc_type, exc_val, exc_tb): | |
| if hasattr(self, 'codebase'): | |
| self.codebase.cleanup() # Implement appropriate cleanup | |
| def __del__(self): | |
| if hasattr(self, 'codebase'): | |
| self.codebase.cleanup() |
| "file_count": file_count, | ||
| "file_extensions": file_extensions, | ||
| } | ||
|
|
||
| def get_diff(self, base: Optional[str] = None) -> str: | ||
| """ | ||
| Get the diff between the current state and a base commit. | ||
| Args: | ||
| base: The base commit to compare against (defaults to self.base_commit) | ||
| Returns: | ||
| A string containing the diff | ||
| """ | ||
| base_commit = base or self.base_commit | ||
| if not base_commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_diff method should have input validation for the base commit. Also, consider adding a size limit for the diff to prevent memory issues with very large diffs. Add error handling for git command failures.
| "file_count": file_count, | |
| "file_extensions": file_extensions, | |
| } | |
| def get_diff(self, base: Optional[str] = None) -> str: | |
| """ | |
| Get the diff between the current state and a base commit. | |
| Args: | |
| base: The base commit to compare against (defaults to self.base_commit) | |
| Returns: | |
| A string containing the diff | |
| """ | |
| base_commit = base or self.base_commit | |
| if not base_commit: | |
| def get_diff(self, base: Optional[str] = None, max_diff_size: int = 1024 * 1024) -> str: | |
| base_commit = base or self.base_commit | |
| if not base_commit: | |
| logger.warning("No base commit specified for diff generation") | |
| return "" | |
| try: | |
| diff = self.codebase.get_diff(base=base_commit) | |
| if len(diff) > max_diff_size: | |
| raise ValueError(f"Diff size exceeds maximum allowed size of {max_diff_size} bytes") | |
| return diff | |
| except Exception as e: | |
| logger.error(f"Failed to generate diff: {e}") | |
| raise |
| prompt: str = Field(..., description="Prompt to send to the agent") | ||
| model: Optional[str] = Field(None, description="Model to use for the agent") | ||
| metadata: Optional[Dict[str, Any]] = Field(None, description="Metadata for the agent run") | ||
|
|
||
|
|
||
| class SnapshotInfo(BaseModel): | ||
| snapshot_id: str = Field(..., description="ID of the snapshot") | ||
| created_at: str = Field(..., description="Creation timestamp") | ||
| repo_info: Dict[str, Any] = Field(..., description="Repository information") | ||
|
|
||
|
|
||
| # Create FastAPI app | ||
| app = FastAPI( | ||
| title="Code Context Retrieval Server", | ||
| description="API for codebase analysis, context management, and agent execution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server lacks proper authentication and rate limiting. In a production environment, this could lead to security vulnerabilities and resource exhaustion. Consider adding authentication middleware and rate limiting to protect the API endpoints.
| prompt: str = Field(..., description="Prompt to send to the agent") | |
| model: Optional[str] = Field(None, description="Model to use for the agent") | |
| metadata: Optional[Dict[str, Any]] = Field(None, description="Metadata for the agent run") | |
| class SnapshotInfo(BaseModel): | |
| snapshot_id: str = Field(..., description="ID of the snapshot") | |
| created_at: str = Field(..., description="Creation timestamp") | |
| repo_info: Dict[str, Any] = Field(..., description="Repository information") | |
| # Create FastAPI app | |
| app = FastAPI( | |
| title="Code Context Retrieval Server", | |
| description="API for codebase analysis, context management, and agent execution", | |
| from fastapi import FastAPI, HTTPException, Query, Depends | |
| from fastapi.security import APIKeyHeader | |
| from slowapi import Limiter | |
| from slowapi.util import get_remote_address | |
| api_key_header = APIKeyHeader(name="X-API-Key") | |
| limiter = Limiter(key_func=get_remote_address) | |
| app = FastAPI( | |
| title="Code Context Retrieval Server", | |
| description="API for codebase analysis, context management, and agent execution", | |
| version="0.1.0" | |
| ) | |
| app.state.limiter = limiter | |
| async def verify_api_key(api_key: str = Depends(api_key_header)): | |
| if not is_valid_api_key(api_key): # Implement your validation logic | |
| raise HTTPException(status_code=403, detail="Invalid API key") | |
| return api_key |
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global storage of active harnesses (active_harnesses dict) could lead to memory leaks in a long-running server. Implement a cache eviction strategy and maximum capacity to prevent unbounded memory growth.
| ) | |
| from cachetools import TTLCache | |
| from datetime import timedelta | |
| # Cache harnesses with a 1-hour TTL and maximum of 100 entries | |
| active_harnesses: TTLCache = TTLCache( | |
| maxsize=100, | |
| ttl=timedelta(hours=1).total_seconds() | |
| ) | |
| def cleanup_inactive_harnesses(): | |
| """Manually cleanup inactive harnesses""" | |
| for key in list(active_harnesses.keys()): | |
| if key not in active_harnesses: # Check if already evicted | |
| continue | |
| harness = active_harnesses[key] | |
| if harness.is_inactive(): # Implement inactivity check | |
| del active_harnesses[key] |
|
|
||
|
|
||
| class CodebaseContextSnapshot: | ||
| """ | ||
| Allows saving and restoring codebase state and analysis results. | ||
| Integrates with S3-compatible storage via BucketStore. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot functionality doesn't handle concurrent access safely. When using this in a multi-threaded environment (like the FastAPI server), there could be race conditions when creating/loading snapshots. Consider adding proper locking mechanisms.
| class CodebaseContextSnapshot: | |
| """ | |
| Allows saving and restoring codebase state and analysis results. | |
| Integrates with S3-compatible storage via BucketStore. | |
| """ | |
| from threading import Lock | |
| class CodebaseContextSnapshot: | |
| _snapshot_locks = {} | |
| _global_lock = Lock() | |
| def __init__(self, ...): | |
| with self._global_lock: | |
| self._snapshot_lock = self._snapshot_locks.setdefault(self.snapshot_id, Lock()) | |
| def create_snapshot(self, local_path: Optional[Union[str, Path]] = None) -> str: | |
| with self._snapshot_lock: | |
| # Existing snapshot creation code | |
| pass |
|
/korbit-review |
|
@codegen Implement and upgrade this PR with above Considerations and suggestions from @CodiumAI-Agent, @sourcery-ai, @codecov-ai-reviewer, @coderabbitai, /gemini, /korbit-review |
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. Persistent review updated to latest commit 18a0737 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces significant enhancements to the codegen-on-oss package, including codebase analysis, context management, and a REST API server. The changes are well-structured and documented. However, there are some areas where error handling and code clarity could be improved.
Summary of Findings
- Git Executable Not Found: The code introduces a
GitExecutableNotFoundErrorexception and attempts to locate the git executable usingshutil.which("git"). However, the exception is not handled in the main execution flow, which could lead to unhandled exceptions if git is not found. - Type Hinting: In
codebase_context.py, several type hints are missing or could be more specific, especially for class attributes. Adding more precise type hints would improve code readability and maintainability. - Error Handling in
get_directory: Incodebase_context.py, theget_directorymethod raises anAssertionErrorwhenallow_externalis not enabled and the directory is not a subdirectory. Consider using a more specific exception type and providing more context in the error message. - Inconsistent Use of List Comprehensions vs. Loops: In
codebase_context.py, there's inconsistent use of list comprehensions and loops. Standardizing the approach would improve code readability. - Potential UnicodeDecodeError: In
codebase_context.py, the_process_diff_filesmethod catchesUnicodeDecodeErrorbut only logs a warning. Consider adding more robust error handling or skipping the file.
Merge Readiness
The pull request is not quite ready for merging. There are a few issues that need to be addressed before merging, particularly the error handling for the GitExecutableNotFoundError and the AssertionError in get_directory. Additionally, improving the type hinting in codebase_context.py would enhance the code's maintainability. I am unable to directly approve this pull request, and recommend that another reviewer approve this code after the changes have been made.
| absolute_path = self.to_absolute(directory_path) | ||
| if not self.is_subdir(absolute_path) and not self.config.allow_external: | ||
| assert False, f"Directory {absolute_path} is not part of repo path {self.repo_path}" | ||
| raise AssertionError( | ||
| f"Directory {absolute_path} is not part of repo path {self.repo_path}" | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a more specific exception type (e.g., ValueError) and providing more context in the error message. Also, the return None is unreachable due to the AssertionError being raised.
raise ValueError(
f"Directory {absolute_path} is not part of repo path {self.repo_path} and external paths are disallowed."
)| if not git_executable: | ||
| raise GitExecutableNotFoundError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| transaction_manager: TransactionManager | ||
| pending_syncs: list[DiffLite] # Diffs that have been applied to disk, but not the graph (to be used for sync graph) | ||
| all_syncs: list[DiffLite] # All diffs that have been applied to the graph (to be used for graph reset) | ||
| pending_syncs: list[ | ||
| DiffLite | ||
| ] # Diffs that have been applied to disk, but not the graph (to be used for sync graph) | ||
| all_syncs: list[ | ||
| DiffLite | ||
| ] # All diffs that have been applied to the graph (to be used for graph reset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints for these class attributes to improve code readability and maintainability.
transaction_manager: TransactionManager
pending_syncs: list[DiffLite] = field(default_factory=list) # Diffs that have been applied to disk, but not the graph (to be used for sync graph)
all_syncs: list[DiffLite] = field(default_factory=list) # All diffs that have been applied to the graph (to be used for graph reset)| # Start the dependency manager. This may or may not run asynchronously, depending on the implementation | ||
| if self.dependency_manager is not None: | ||
| # Check if its inital start or a reparse | ||
| if not self.dependency_manager.ready() and not self.dependency_manager.error(): | ||
| if ( | ||
| not self.dependency_manager.ready() | ||
| and not self.dependency_manager.error() | ||
| ): | ||
| # TODO: We do not reparse dependencies during syncs as it is expensive. We should probably add a flag for this | ||
| logger.info("> Starting dependency manager") | ||
| self.dependency_manager.start(async_start=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider handling the UnicodeDecodeError more robustly. Simply logging a warning might not be sufficient. Perhaps skip the file or use a different encoding.
except UnicodeDecodeError:
logger.warning(
f"Can't read file at:{filepath} since it contains non-unicode characters. File will be ignored!"
)
continue|
Persistent review updated to latest commit 18a0737 |
|
I'll review PR #14 "Integrate harness.py functionality into codegen-on-oss package" right away. Let me examine the changes and provide a thorough review. |
|
It looks like you're managing a complex review process with multiple bots, which is great for thoroughness! I'm here to help clarify any of my previous suggestions or address new concerns that have arisen. Please let me know if you have any specific questions for me. |
PR DescriptionThis pull request introduces significant enhancements to the codegen-on-oss package, transforming it into a more comprehensive toolkit for codebase analysis, context management, and AI-driven code modification. The primary goal is to provide a robust and modular system for understanding and manipulating large codebases, facilitating tasks such as automated refactoring, security auditing, and AI-assisted code generation. Click to see moreKey Technical ChangesKey changes include:
Architecture DecisionsThe architecture emphasizes modularity and separation of concerns. The codebase analysis is decoupled from the context management, allowing for independent use and extension. The introduction of a FastAPI server follows a microservices pattern, enabling remote access to codebase analysis and manipulation capabilities. The use of Dependencies and InteractionsThis pull request introduces new dependencies, including Risk ConsiderationsPotential risks include:
Notable Implementation DetailsNotable implementation details include:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've completed my review and didn't find any issues... but I did find this horse.
.''
._.-.___.' (`\
//( ( `'
'/ )\ ).__. )
' <' `\ ._/'\
` \ \Files scanned
| File Path | Reviewed |
|---|---|
| codegen-on-oss/codegen_on_oss/context_server/server.py | ✅ |
| codegen-on-oss/codegen_on_oss/snapshot/context_snapshot.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @codegen-sh[bot] - I've reviewed your changes - here's some feedback:
- Consider making server dependencies like FastAPI and Uvicorn optional extras for users who only need the core analysis features.
- Ensure the core analysis and snapshotting logic remains easily usable independently of the new API server.
Here's what I looked at during the review
- 🟡 General issues: 4 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| # Global storage for active harnesses and snapshots | ||
| active_harnesses: dict[str, CodebaseAnalysisHarness] = {} | ||
| bucket_store: BucketStore | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Global mutable bucket_store variable usage.
Mutable globals in async contexts can cause race conditions; encapsulate bucket_store or use dependency injection for safer access.
Suggested implementation:
# Global storage for active harnesses and snapshots
active_harnesses: dict[str, CodebaseAnalysisHarness] = {}app = FastAPI(
title="Code Context Retrieval Server",
description="API for codebase analysis, context management, and agent execution",
version="0.1.0",
)
@app.on_event("startup")
async def startup_event():
# Initialize the bucket_store and attach it to the app state for safe access.
app.state.bucket_store = BucketStore()Ensure that endpoints or functions that previously referenced the bucket_store global now reference it via dependency injection or by accessing it from the request object, for example:
from fastapi import Request
@app.get("/some-endpoint")
async def some_endpoint(request: Request):
bucket_store = request.app.state.bucket_store
# Use bucket_store here
Adjust other parts of the code to remove direct dependencies on the global variable.
| Returns: | ||
| The result of the agent run | ||
| """ | ||
| tags = [str(value) for value in self.metadata.values() if value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Dynamic tag generation from metadata.
Check that metadata keys and values don’t generate unexpected or overly verbose tags when passed to the agent.
| - **Performance Profiling**: Log metrics for each parsing run | ||
| - **Error Logging**: Log errors to help pinpoint parsing failures | ||
|
|
||
| For more details on the original functionality, see the "Package Structure" and "Getting Started" sections below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Reference to sections below might be ambiguous.
With sections now titled (e.g., “Original Package Structure”), please update this reference to point to the exact section containing the original functionality details.
Suggested implementation:
For more details on the original functionality, see the "Original Functionality" section above.
If additional references in the document refer to "Package Structure" and "Getting Started" for original functionality details, you may want to update those as well to reduce ambiguity.
| # Event Handler with codebase snapshotting | ||
|
|
||
| This project is designed to using Modal snapshotting to provide parsed codebase instances with minimal latency, make it more manageable to write event based handlers. | ||
| This project is designed to using Modal snapshotting to provide parsed codebase instances with minimal latency, make it more manageable to write event based handlers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Grammatical error in sentence.
Replace “designed to using” with “designed to use” or “designed for using.”
Suggested implementation:
This project is designed to use Modal snapshotting to provide parsed codebase instances with minimal latency, making it more manageable to write event based handlers.
The change updates the wording from "designed to using" to "designed to use" and also adjusts "make" to "making" for grammatical consistency.
| from codegen.sdk.codebase.config_parser import ( | ||
| ConfigParser, | ||
| get_config_parser_for_language, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider keeping simple one-line expressions and type annotations on one line to reduce vertical noise.
Consider keeping simple one‐line expressions and type annotations on one line when they’re not too long. In several places simple function calls or annotations are broken over multiple lines and cause extra vertical noise. For example, instead of:
self.config_parser = get_config_parser_for_language(
context.programming_language,
self
)use the concise version:
self.config_parser = get_config_parser_for_language(context.programming_language, self)Similarly, inline conditionals that are simple can remain on one line. Only use multi‐line formatting when the call or expression exceeds your line-length limit or when grouping parameters enhances clarity. This reduces perceived complexity without losing functionality.
|
|
||
| # Filter out test and tutorial functions first | ||
| functions = [f for f in codebase.functions if not any(pattern in f.name.lower() for pattern in ["test", "tutorial"]) and not any(pattern in f.filepath.lower() for pattern in ["test", "tutorial"])] | ||
| functions = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Invert any/all to simplify comparisons [×2] (
invert-any-all) - Use named expression to simplify assignment and conditional (
use-named-expression)
| modal_cls = modal.Cls.from_name( | ||
| app_name="Events", name="CustomEventHandlersAPI" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)
| logger.info(f"Found violation: {violation}") | ||
| # Check for react-dev-overlay imports | ||
| for i, line in enumerate(lines): | ||
| line_num = i + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Move assignments closer to their usage (move-assign)
| line_num = i + 1 | |
| line_num = i + 1 |
| comment = "## Development Import Violations Found\n\n" | ||
| comment += "The following files contain imports of development modules that should not be used in production code:\n\n" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace assignment and augmented assignment with single assignment (merge-assign-and-aug-assign)
| comment = "## Development Import Violations Found\n\n" | |
| comment += "The following files contain imports of development modules that should not be used in production code:\n\n" | |
| comment = ( | |
| "## Development Import Violations Found\n\n" | |
| + "The following files contain imports of development modules that should not be used in production code:\n\n" | |
| ) |
| except ValueError: | ||
| raise rest.ApiException(status=0, reason=(f"Failed to parse `{string}` as datetime object")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Explicitly raise from a previous error (raise-from-previous-error)
| except ValueError: | |
| raise rest.ApiException(status=0, reason=(f"Failed to parse `{string}` as datetime object")) | |
| except ValueError as e: | |
| raise rest.ApiException( | |
| status=0, reason=(f"Failed to parse `{string}` as datetime object") | |
| ) from e |
|
Tip For best results, initiate chat on the files or code changes. A summary of the edits CodeRabbit can apply:
A plan of each step that CodeRabbit will take in order to execute the edits for each file (with inline rough draft snippets): ✏️ codegen-on-oss/codegen_on_oss/bucket_store.pyImmediately after the existing import of boto3 (after line 5), insert the following imports: import json
import os
import tempfile
from typing import Any, Dict, ListAfter the end of the upload_file method (just below line 27), insert the put_json method: def put_json(self, data: Any, remote_path: str) -> str:
"""
Uploads JSON-serializable data to S3 and returns the key.
"""
local_file = tempfile.NamedTemporaryFile(delete=False, suffix=".json")
try:
with open(local_file.name, "w", encoding="utf-8") as f:
json.dump(data, f)
return self.upload_file(local_file.name, remote_path)
finally:
try:
os.remove(local_file.name)
except Exception:
passImmediately after put_json, insert the get_json method: def get_json(self, remote_path: str) -> Any:
"""
Downloads JSON data from S3 and returns the parsed object.
"""
local_file = tempfile.NamedTemporaryFile(delete=False, suffix=".json")
try:
self.s3_client.download_file(
self.bucket_name,
remote_path,
local_file.name,
)
with open(local_file.name, "r", encoding="utf-8") as f:
return json.load(f)
finally:
try:
os.remove(local_file.name)
except Exception:
passImmediately after get_json, insert the list_keys method: def list_keys(self, prefix: str = "") -> List[str]:
"""
Lists keys in the S3 bucket under the given prefix.
"""
paginator = self.s3_client.get_paginator("list_objects_v2")
keys: List[str] = []
for page in paginator.paginate(Bucket=self.bucket_name, Prefix=f"{self.key_prefix}/{prefix}"):
for obj in page.get("Contents", []):
keys.append(obj["Key"])
return keys✏️ codegen-on-oss/codegen_on_oss/context_server/server.pyReplace the existing import block (lines 5–15) with the following to include re, asyncio, datetime, typing hints, and Pydantic validators: import json
import re
import asyncio
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List, Optional
from fastapi import FastAPI, HTTPException, Query
from loguru import logger
from pydantic import BaseModel, Field, validator
from codegen_on_oss.analysis.harness_integration import CodebaseAnalysisHarness
from codegen_on_oss.bucket_store import BucketStore
from codegen_on_oss.snapshot.context_snapshot import CodebaseContextSnapshotImmediately after the global declarations on lines 47–48 for # Track when harnesses were last accessed and expire after inactivity
last_accessed: Dict[str, datetime] = {}
MAX_INACTIVE_TIME = 3600 # secondsAfter the existing @app.on_event("startup")
async def setup_cleanup_task():
"""Launch background task to remove inactive harnesses."""
asyncio.create_task(cleanup_inactive_harnesses())
async def cleanup_inactive_harnesses():
"""Periodically purge harnesses idle longer than MAX_INACTIVE_TIME."""
while True:
now = datetime.now()
stale = [
key
for key, t in last_accessed.items()
if (now - t).total_seconds() > MAX_INACTIVE_TIME
]
for key in stale:
logger.info(f"Removing inactive harness: {key}")
active_harnesses.pop(key, None)
last_accessed.pop(key, None)
await asyncio.sleep(300) # run every 5 minutesWithin @validator("repo_full_name")
def validate_repo_full_name(cls, v):
if not re.match(r"^[A-Za-z0-9_.-]+/[A-Za-z0-9_.-]+$", v):
raise ValueError("repo_full_name must be in format owner/repo")
return v
@validator("commit")
def validate_commit(cls, v):
if v and not re.match(r"^[0-9a-f]{7,40}$", v):
raise ValueError("commit must be a valid git hash")
return v
@validator("language")
def validate_language(cls, v):
allowed = {"python", "javascript", "typescript", "java", "go"}
low = v.lower()
if low not in allowed:
raise ValueError(f"language must be one of: {', '.join(sorted(allowed))}")
return lowIn
last_accessed[harness_key] = datetime.now()
raise HTTPException(status_code=500, detail="Repository analysis failed") from eReplace the body of if harness_key not in active_harnesses:
raise HTTPException(status_code=404, detail=f"Harness {harness_key} not found")
last_accessed[harness_key] = datetime.now()
try:
harness = active_harnesses[harness_key]
if not harness.analysis_results:
harness.analyze_codebase()
return harness.analysis_results.get("file_stats", {})
except Exception as e:
logger.error(f"Failed to get file stats: {e}")
raise HTTPException(status_code=500, detail="Failed to get file statistics")In
last_accessed[harness_key] = datetime.now()
raise HTTPException(status_code=500, detail="Failed to create snapshot") from eIn
last_accessed[harness_key] = datetime.now()
raise HTTPException(status_code=500, detail="Agent execution failed") from e✏️ codegen-on-oss/codegen_on_oss/cli.pyImmediately below line 167 (the comment "# Import here to avoid circular imports"), insert: from codegen_on_oss.context_server.server import create_appAfter the newly added import, insert the application factory invocation so that the FastAPI app instance is created. Add: app = create_app()Replace the existing uvicorn.run block (lines 170–176) with a call that uses the app instance instead of a module path string. Change: uvicorn.run(
"codegen_on_oss.context_server:app",
host=host,
port=port,
reload=reload,
log_level=log_level,
)to: uvicorn.run(
app,
host=host,
port=port,
reload=reload,
log_level=log_level,
)Ensure there are no leftover references to the string-based uvicorn invocation. Remove or comment out any redundant uvicorn.run calls that use module path strings. Save the file and test the CLI serve command by running ✏️ codegen-on-oss/codegen_on_oss/analysis/harness_integration.pyReplace lines 182–184 with the following block, preserving the existing logger call and indentation (8 spaces): except Exception as agent_error:
logger.error(f"Agent run failed with error: {agent_error}")
raise RuntimeError("Agent execution failed") from agent_error✏️ codegen-on-oss/codegen_on_oss/snapshot/context_snapshot.pyWithin the if snapshot_path.exists():
with open(snapshot_path) as f:
return json.load(f)with the following try/except to catch and log JSON decode errors, then continue to the next directory: if snapshot_path.exists():
try:
with open(snapshot_path, "r") as f:
return json.load(f)
except json.JSONDecodeError as e:
logger.warning(f"Invalid JSON in snapshot file {snapshot_path}: {e}")
continueA summary of the context that CodeRabbit is considering across the codebase:
|
PR Review: Integrate harness.py functionality into codegen-on-oss packageI've reviewed the changes in this PR that integrate functionality from Strengths
Issues and Suggestions
Recommendations
Overall, this PR represents a significant enhancement to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (6)
codegen-on-oss/codegen_on_oss/context_server/server.py (6)
239-246:⚠️ Potential issueAdd null check for harness.metadata
If the harness was created without metadata,
harness.metadata.update(...)will raise an AttributeError.Add a null check for
harness.metadata:# Update metadata if provided if request.metadata: + if harness.metadata is None: + harness.metadata = {} harness.metadata.update(request.metadata)
97-99: 🛠️ Refactor suggestionStore harness with wrapper object for TTL tracking
To work with the proposed cleanup mechanism, update how harnesses are stored:
# Store the harness for later use - active_harnesses[harness_key] = harness + active_harnesses[harness_key] = HarnessEntry(harness)
89-104: 🛠️ Refactor suggestionUse specific exceptions instead of broad exception handling
Using a broad
except Exception as ecan mask underlying issues. It's better to catch specific exceptions to provide more informative error messages.Catch specific exceptions:
try: # Create a new harness for the repository harness = CodebaseAnalysisHarness.from_repo( repo_full_name=repo_info.repo_full_name, commit=repo_info.commit, language=repo_info.language, ) # Store the harness for later use active_harnesses[harness_key] = harness # Perform analysis results = harness.analyze_codebase() - except Exception as e: + except ValueError as e: # For invalid input parameters + logger.error(f"Invalid repository parameters: {e}") + raise HTTPException(status_code=400, detail=str(e)) from e + except FileNotFoundError as e: # If repository or files can't be found + logger.error(f"Repository not found: {e}") + raise HTTPException(status_code=404, detail=str(e)) from e + except Exception as e: # Fallback for other exceptions logger.error(f"Repository analysis failed: {e}") raise HTTPException(status_code=500, detail=str(e)) from eApply similar improvements to other exception handling blocks in the file.
46-48:⚠️ Potential issueImplement a cleanup mechanism for harnesses
The global storage of active harnesses could lead to memory leaks in a long-running server. There's no mechanism to remove harnesses that are no longer needed.
Implement a cache eviction strategy with a time-to-live (TTL) mechanism:
-# Global storage for active harnesses and snapshots -active_harnesses: dict[str, CodebaseAnalysisHarness] = {} +from datetime import datetime, timedelta + +# Global storage for active harnesses and snapshots +class HarnessEntry: + def __init__(self, harness: CodebaseAnalysisHarness): + self.harness = harness + self.last_accessed = datetime.now() + + def access(self): + self.last_accessed = datetime.now() + return self.harness + +active_harnesses: dict[str, HarnessEntry] = {} + +def cleanup_inactive_harnesses(): + """Remove harnesses that haven't been accessed in the last hour""" + now = datetime.now() + expired_keys = [ + key for key, entry in active_harnesses.items() + if (now - entry.last_accessed) > timedelta(hours=1) + ] + for key in expired_keys: + logger.info(f"Removing inactive harness: {key}") + del active_harnesses[key]Then, update each access to harnesses:
- harness = active_harnesses[harness_key] + harness = active_harnesses[harness_key].access()And add a periodic task to run the cleanup:
@app.on_event("startup") async def startup_event(): """Initialize resources on server startup.""" global bucket_store + + # Schedule periodic cleanup of inactive harnesses + from apscheduler.schedulers.asyncio import AsyncIOScheduler + scheduler = AsyncIOScheduler() + scheduler.add_job(cleanup_inactive_harnesses, 'interval', minutes=15) + scheduler.start() + logger.info("Scheduled periodic cleanup of inactive harnesses")
40-44: 🛠️ Refactor suggestionAdd authentication and rate limiting
The server lacks proper authentication and rate limiting. In a production environment, this could lead to security vulnerabilities and resource exhaustion.
Consider adding authentication middleware and rate limiting:
-from fastapi import FastAPI, HTTPException, Query +from fastapi import FastAPI, HTTPException, Query, Depends +from fastapi.security import APIKeyHeader +from slowapi import Limiter +from slowapi.util import get_remote_address +api_key_header = APIKeyHeader(name="X-API-Key") +limiter = Limiter(key_func=get_remote_address) app = FastAPI( title="Code Context Retrieval Server", description="API for codebase analysis, context management, and agent execution", version="0.1.0", ) +app.state.limiter = limiter +def verify_api_key(api_key: str = Depends(api_key_header)): + # Implement your API key validation logic + # This is a simple example, replace with your actual validation + valid_api_keys = ["your-secret-api-key"] # Store securely, e.g., in environment variables + if api_key not in valid_api_keys: + raise HTTPException(status_code=403, detail="Invalid API key") + return api_keyThen, add the authentication and rate limiting to the endpoints:
-@app.post("/analyze/repository", response_model=dict[str, Any]) +@app.post("/analyze/repository", response_model=dict[str, Any], dependencies=[Depends(verify_api_key)]) +@limiter.limit("10/minute") async def analyze_repository(repo_info: RepositoryInfo):
50-60:⚠️ Potential issueFix BucketStore initialization
The
BucketStoreis being initialized without a bucket name parameter, but the constructor requires one. This will cause an error when the server starts up.Consider reading from environment variables:
@app.on_event("startup") async def startup_event(): """Initialize resources on server startup.""" global bucket_store try: - bucket_store = BucketStore() + import os + bucket_name = os.environ.get("BUCKET_NAME") + if bucket_name: + bucket_store = BucketStore(bucket_name=bucket_name) + logger.info(f"Initialized bucket store for remote storage using bucket: {bucket_name}") + else: + logger.warning("BUCKET_NAME environment variable not set") + logger.info("Continuing without remote storage capabilities") logger.info("Initialized bucket store for remote storage") except Exception as e: logger.warning(f"Failed to initialize bucket store: {e}") logger.info("Continuing without remote storage capabilities")
🧹 Nitpick comments (3)
codegen-on-oss/codegen_on_oss/context_server/server.py (3)
159-204: Simplify snapshot listing logic by extracting helper functionsThe
/snapshot/listendpoint contains complex nested try-except logic which makes it hard to read and maintain.Extract helper functions to simplify the logic:
+def load_remote_snapshot(key: str, bucket_store: BucketStore) -> SnapshotInfo | None: + """Load a snapshot from remote storage.""" + try: + data = bucket_store.get_json(key) + return SnapshotInfo( + snapshot_id=data["snapshot_id"], + created_at=data["created_at"], + repo_info=data["repo_info"], + ) + except Exception as e: + logger.warning(f"Failed to load snapshot {key}: {e}") + return None + + +def load_local_snapshot(snapshot_file: Path) -> SnapshotInfo | None: + """Load a snapshot from local storage.""" + try: + with open(snapshot_file) as f: + data = json.load(f) + return SnapshotInfo( + snapshot_id=data["snapshot_id"], + created_at=data["created_at"], + repo_info=data["repo_info"], + ) + except Exception as e: + logger.warning(f"Failed to load local snapshot {snapshot_file}: {e}") + return None @app.get("/snapshot/list", response_model=list[SnapshotInfo]) async def list_snapshots(): """ List all available snapshots. """ snapshots = [] # List snapshots from bucket store if available if bucket_store: try: keys = bucket_store.list_keys(prefix="snapshots/") - for key in keys: - try: - data = bucket_store.get_json(key) - snapshots.append( - SnapshotInfo( - snapshot_id=data["snapshot_id"], - created_at=data["created_at"], - repo_info=data["repo_info"], - ) - ) - except Exception as e: - logger.warning(f"Failed to load snapshot {key}: {e}") + remote_snapshots = [load_remote_snapshot(key, bucket_store) for key in keys] + snapshots.extend([s for s in remote_snapshots if s]) except Exception as e: logger.warning(f"Failed to list snapshots from bucket store: {e}") # List local snapshots for directory in [Path("./snapshots"), Path("./data/snapshots")]: if directory.exists(): - for snapshot_file in directory.glob("snapshot_*.json"): - try: - with open(snapshot_file) as f: - data = json.load(f) - snapshots.append( - SnapshotInfo( - snapshot_id=data["snapshot_id"], - created_at=data["created_at"], - repo_info=data["repo_info"], - ) - ) - except Exception as e: - logger.warning( - f"Failed to load local snapshot {snapshot_file}: {e}" - ) + local_snapshots = [load_local_snapshot(f) for f in directory.glob("snapshot_*.json")] + snapshots.extend([s for s in local_snapshots if s]) return snapshots
146-152: Improve error handling in snapshot creationThe current error handling is too broad. Catch specific exceptions for better error reporting.
try: snapshot = CodebaseContextSnapshot(harness=harness, bucket_store=bucket_store) snapshot_id = snapshot.create_snapshot(local_path=local_path) - except Exception as e: + except FileNotFoundError as e: + logger.error(f"Snapshot creation failed - directory not found: {e}") + raise HTTPException(status_code=404, detail=f"Directory not found: {str(e)}") from e + except PermissionError as e: + logger.error(f"Snapshot creation failed - permission denied: {e}") + raise HTTPException(status_code=403, detail="Permission denied writing snapshot") from e + except Exception as e: logger.error(f"Snapshot creation failed: {e}") raise HTTPException(status_code=500, detail=str(e)) from e
212-223: Provide more specific error details in snapshot loadingThe current error handling returns a 404 for all types of errors, but some might be server errors:
try: snapshot = CodebaseContextSnapshot( snapshot_id=snapshot_id, bucket_store=bucket_store ) data = snapshot.load_snapshot() - except Exception as e: - logger.error(f"Failed to load snapshot {snapshot_id}: {e}") - raise HTTPException( - status_code=404, detail=f"Snapshot {snapshot_id} not found" - ) from e + except FileNotFoundError as e: + logger.error(f"Failed to load snapshot {snapshot_id}: {e}") + raise HTTPException( + status_code=404, detail=f"Snapshot {snapshot_id} not found" + ) from e + except Exception as e: + logger.error(f"Error loading snapshot {snapshot_id}: {e}") + raise HTTPException( + status_code=500, detail=f"Error loading snapshot: {str(e)}" + ) from e else: return data
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
codegen-on-oss/codegen_on_oss/context_server/server.py(1 hunks)codegen-on-oss/codegen_on_oss/snapshot/context_snapshot.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- codegen-on-oss/codegen_on_oss/snapshot/context_snapshot.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
codegen-on-oss/codegen_on_oss/context_server/server.py (3)
codegen-on-oss/codegen_on_oss/analysis/harness_integration.py (4)
CodebaseAnalysisHarness(15-207)from_repo(41-69)analyze_codebase(71-97)run_agent(156-184)codegen-on-oss/codegen_on_oss/bucket_store.py (1)
BucketStore(11-26)codegen-on-oss/codegen_on_oss/snapshot/context_snapshot.py (3)
CodebaseContextSnapshot(47-262)create_snapshot(75-113)load_snapshot(115-147)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Build 3.13 ubuntu-22.04-arm
- GitHub Check: Build 3.12 ubuntu-22.04-arm
| async def analyze_repository(repo_info: RepositoryInfo): | ||
| """ | ||
| Analyze a repository and return the results. | ||
| Creates a new CodebaseAnalysisHarness for the repository and performs analysis. | ||
| """ | ||
| harness_key = f"{repo_info.repo_full_name}:{repo_info.commit or 'latest'}" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation for repository name and commit hash
There's no validation for repository name format or commit hash, which could lead to security issues like path traversal or command injection.
Add regex-based validation to the RepositoryInfo model:
from pydantic import BaseModel, Field
+from pydantic import validator
+import re
class RepositoryInfo(BaseModel):
repo_full_name: str = Field(..., description="Full repository name (owner/repo)")
commit: str | None = Field(None, description="Commit hash to analyze")
language: str = Field("python", description="Primary language of the repository")
+ @validator('repo_full_name')
+ def validate_repo_name(cls, v):
+ if not re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', v):
+ raise ValueError('Invalid repository name format. Must be in the format "owner/repo"')
+ return v
+
+ @validator('commit')
+ def validate_commit(cls, v):
+ if v is not None and not re.match(r'^[0-9a-f]{7,40}$', v):
+ raise ValueError('Invalid commit hash format. Must be a valid Git commit hash')
+ return v📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def analyze_repository(repo_info: RepositoryInfo): | |
| """ | |
| Analyze a repository and return the results. | |
| Creates a new CodebaseAnalysisHarness for the repository and performs analysis. | |
| """ | |
| harness_key = f"{repo_info.repo_full_name}:{repo_info.commit or 'latest'}" | |
| from pydantic import BaseModel, Field | |
| from pydantic import validator | |
| import re | |
| class RepositoryInfo(BaseModel): | |
| repo_full_name: str = Field(..., description="Full repository name (owner/repo)") | |
| commit: str | None = Field(None, description="Commit hash to analyze") | |
| language: str = Field("python", description="Primary language of the repository") | |
| @validator('repo_full_name') | |
| def validate_repo_name(cls, v): | |
| if not re.match(r'^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$', v): | |
| raise ValueError('Invalid repository name format. Must be in the format "owner/repo"') | |
| return v | |
| @validator('commit') | |
| def validate_commit(cls, v): | |
| if v is not None and not re.match(r'^[0-9a-f]{7,40}$', v): | |
| raise ValueError('Invalid commit hash format. Must be a valid Git commit hash') | |
| return v |
🤖 Prompt for AI Agents (early access)
In codegen-on-oss/codegen_on_oss/context_server/server.py around lines 81 to 88, add regex-based validation to the RepositoryInfo model to ensure the repository name and commit hash conform to expected safe formats. This prevents security risks such as path traversal or command injection by validating inputs before processing. Update the RepositoryInfo class to include regex validators for repo_full_name and commit fields.
| harness = active_harnesses[harness_key] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Update harness access for TTL tracking
Update harness retrieval to work with the proposed TTL tracking:
- harness = active_harnesses[harness_key]
+ harness = active_harnesses[harness_key].access()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| harness = active_harnesses[harness_key] | |
| - harness = active_harnesses[harness_key] | |
| + harness = active_harnesses[harness_key].access() |
🤖 Prompt for AI Agents (early access)
In codegen-on-oss/codegen_on_oss/context_server/server.py around lines 122 to 123, update the way the harness is accessed from active_harnesses to integrate TTL tracking. Modify the retrieval logic to ensure that accessing the harness also updates or respects its TTL, such as by refreshing its timestamp or using a method that handles TTL internally, instead of directly indexing the dictionary.
| harness = active_harnesses[harness_key] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Update harness access for TTL tracking
Update harness retrieval to work with the proposed TTL tracking:
- harness = active_harnesses[harness_key]
+ harness = active_harnesses[harness_key].access()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| harness = active_harnesses[harness_key] | |
| harness = active_harnesses[harness_key].access() |
🤖 Prompt for AI Agents (early access)
In codegen-on-oss/codegen_on_oss/context_server/server.py around lines 237 to 238, the harness is accessed directly from active_harnesses without considering TTL tracking. Update the harness retrieval logic to integrate with the TTL tracking mechanism, ensuring that accessing the harness also refreshes or updates its TTL as per the new design. Modify the code to retrieve the harness in a way that supports TTL updates, such as using a method or function that handles TTL refresh on access.
| harness = active_harnesses[harness_key] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Update harness access for TTL tracking
Update harness retrieval to work with the proposed TTL tracking:
- harness = active_harnesses[harness_key]
+ harness = active_harnesses[harness_key].access()Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents (early access)
In codegen-on-oss/codegen_on_oss/context_server/server.py at lines 144-145, update the way the harness is accessed from active_harnesses to integrate TTL tracking. Modify the retrieval logic to ensure that accessing the harness also refreshes or updates its TTL, so the harness does not expire prematurely. Implement this by replacing the direct dictionary access with a method or mechanism that updates the TTL upon retrieval.
| async def run_agent( | ||
| request: AgentRunRequest, | ||
| harness_key: str = Query(..., description="Key of the active harness"), | ||
| ): | ||
| """ | ||
| Run an agent on the codebase with the given prompt. | ||
| """ | ||
| if harness_key not in active_harnesses: | ||
| raise HTTPException(status_code=404, detail=f"Harness {harness_key} not found") | ||
|
|
||
| harness = active_harnesses[harness_key] | ||
|
|
||
| try: | ||
| # Update metadata if provided | ||
| if request.metadata: | ||
| harness.metadata.update(request.metadata) | ||
|
|
||
| # Run the agent | ||
| result = harness.run_agent(prompt=request.prompt, model=request.model) | ||
| except Exception as e: | ||
| logger.error(f"Agent run failed: {e}") | ||
| raise HTTPException(status_code=500, detail=str(e)) from e | ||
| else: | ||
| return { | ||
| "harness_key": harness_key, | ||
| "result": result, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Sanitize error messages in API responses
The current implementation exposes raw exception messages in API responses, which might contain sensitive information.
Sanitize the error messages:
+def sanitize_error_message(message: str) -> str:
+ """Remove potentially sensitive information from error messages."""
+ # Remove file paths
+ sanitized = re.sub(r'(/[^\s:;]*)+', '[PATH]', str(message))
+ # Remove URLs
+ sanitized = re.sub(r'https?://[^\s]+', '[URL]', sanitized)
+ return sanitized
@app.post("/agent/run", response_model=dict[str, Any])
async def run_agent(
request: AgentRunRequest,
harness_key: str = Query(..., description="Key of the active harness"),
):
"""
Run an agent on the codebase with the given prompt.
"""
if harness_key not in active_harnesses:
raise HTTPException(status_code=404, detail=f"Harness {harness_key} not found")
harness = active_harnesses[harness_key].access()
try:
# Update metadata if provided
if request.metadata:
if harness.metadata is None:
harness.metadata = {}
harness.metadata.update(request.metadata)
# Run the agent
result = harness.run_agent(prompt=request.prompt, model=request.model)
except Exception as e:
logger.error(f"Agent run failed: {e}")
- raise HTTPException(status_code=500, detail=str(e)) from e
+ raise HTTPException(status_code=500, detail=sanitize_error_message(str(e))) from e
else:
return {
"harness_key": harness_key,
"result": result,
}Apply similar sanitization to other error messages in the API responses throughout the file.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents (early access)
In codegen-on-oss/codegen_on_oss/context_server/server.py around lines 227 to 253, the current error handling exposes raw exception messages in API responses, risking sensitive information leakage. Modify the except block to replace the raw exception message with a sanitized, generic error message before raising the HTTPException. Additionally, review other error handling in this file and apply similar sanitization to all API response error messages to ensure no sensitive details are exposed.
BucketStore Implementation IssueThe Here's a suggested implementation to add the missing methods: import json
from typing import Any
def get_json(self, key: str) -> dict[str, Any]:
"""Get a JSON object from the bucket."""
response = self.s3_client.get_object(Bucket=self.bucket_name, Key=key)
return json.loads(response['Body'].read().decode('utf-8'))
def put_json(self, key: str, data: dict[str, Any]) -> None:
"""Put a JSON object in the bucket."""
self.s3_client.put_object(
Bucket=self.bucket_name,
Key=key,
Body=json.dumps(data).encode('utf-8'),
ContentType='application/json'
)
def list_keys(self, prefix: str = "") -> list[str]:
"""List all keys in the bucket with the given prefix."""
response = self.s3_client.list_objects_v2(
Bucket=self.bucket_name,
Prefix=prefix
)
return [obj['Key'] for obj in response.get('Contents', [])]This will ensure compatibility with the |
Testing RecommendationsI noticed there are no tests for the new functionality added in this PR. Adding tests would help ensure the correctness and robustness of the implementation. Here are some suggestions for tests that could be added:
These tests would help catch any issues early and ensure the functionality works as expected across different scenarios. |
Error Handling ImprovementsWhile the PR includes custom exception classes and basic error handling, there are a few areas where error handling could be improved:
These improvements would make the code more robust and provide better feedback to users when things go wrong. |
|
I've completed a thorough review of PR #14 "Integrate harness.py functionality into codegen-on-oss package". The PR is well-structured and effectively integrates the functionality from I've left several detailed comments on the PR with specific feedback and suggestions for improvement:
Overall, this is a valuable addition to the package that will enhance its capabilities for codebase analysis and context management. With the suggested improvements, it will be even more robust and user-friendly. Let me know if you'd like me to elaborate on any specific aspect of the review or if you have any questions! 💻 View my work • React 👍 or 👎 |
27f0eca to
f4656a2
Compare
This PR integrates the functionality from
src/codegen/extensions/swebench/harness.pyinto thecodegen-on-osspackage to enhance its codebase analysis and context saving capabilities.Changes Made
1. Created New Modules
CodebaseAnalysisHarness (
codegen_on_oss/analysis/harness_integration.py)harness.pyCodebaseContextSnapshot (
codegen_on_oss/snapshot/context_snapshot.py)CodeContextRetrievalServer (
codegen_on_oss/context_server/server.py)2. Updated Package Structure
pyproject.toml__init__.pyfiles to expose the new modulesserveto start the CodeContextRetrievalServer3. Documentation
Key Features
How to Use
You can now use the enhanced
codegen-on-osspackage as a backend CodeContextRetrievalServer by:💻 View my work • About Codegen
Summary by Sourcery
Integrate harness.py functionality into the codegen-on-oss package, enhancing its codebase analysis and context management capabilities
New Features:
Enhancements:
Build:
CI:
Documentation:
Summary by CodeRabbit