Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ jobs:
require: write
username: ${{ github.triggering_actor }}
error-if-missing: true
# Allow the codegen-sh bot to bypass permission check
allow-bot: true
bot-list: 'codegen-sh[bot]'
Comment on lines 19 to +24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security concern: Adding bot bypass should be carefully reviewed. Consider adding more granular controls for bot permissions and documenting the security implications of allowing bot access. Also, consider using GitHub environments to manage access control more securely.

Suggested change
require: write
username: ${{ github.triggering_actor }}
error-if-missing: true
# Allow the codegen-sh bot to bypass permission check
allow-bot: true
bot-list: 'codegen-sh[bot]'
allow-bot: true
bot-list: 'codegen-sh[bot]'
# Add security constraints
allowed-actions: ['pull_request']
required-checks: ['unit-tests']

Comment on lines 19 to +24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Concern: The addition of the bot access bypass ('allow-bot' and 'bot-list') parameters should be carefully reviewed. Ensure that 'codegen-sh[bot]' has the minimum required permissions and that the bot's access is properly audited.

Suggested change
require: write
username: ${{ github.triggering_actor }}
error-if-missing: true
# Allow the codegen-sh bot to bypass permission check
allow-bot: true
bot-list: 'codegen-sh[bot]'
# Add explicit permission restrictions and audit logging
- uses: your-permission-check-action@v1
with:
require: write
username: ${{ github.triggering_actor }}
error-if-missing: true
allow-bot: true
bot-list: 'codegen-sh[bot]'
audit-log: true
allowed-operations: ['push', 'pull_request']
max-files: 100


unit-tests:
needs: access-check
Expand Down
121 changes: 121 additions & 0 deletions codegen-on-oss/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ The **Codegen on OSS** package provides a modular pipeline that:
- **Parses repositories** using the codegen tool.
- **Profiles performance** and logs metrics for each parsing run.
- **Logs errors** to help pinpoint parsing failures or performance bottlenecks.
- **Analyzes codebases** with comprehensive metrics and context tracking.
- **Saves and restores codebase state** for later use.
- **Provides a REST API** for accessing all functionality.

______________________________________________________________________

Expand Down Expand Up @@ -335,3 +338,121 @@ codegen_on_oss.parser.ParseRunError: LOW_IMPORT_RESOLUTION_RATE
| Lightning-AI/lightning | codebase_init | 24.256577352999557 | 24.256577352999557 | 211.3604081 | 1535971328 | 966184960 | |
| Lightning-AI/lightning | post_init_validation | 0.137609629000508 | 24.394186982000065 | 211.5082702 | 1536241664 | 270336 | |
| Lightning-AI/lightning | TOTAL | 24.394700584999555 | 24.394700584999555 | 211.5088282 | 1536241664 | 0 | |

## New Features

### Codebase Analysis and Context Management

The package now includes powerful features for comprehensive codebase analysis and context management:

#### CodebaseAnalysisHarness

The `CodebaseAnalysisHarness` class in the `analysis` module provides:

- Comprehensive codebase analysis
- File structure tracking
- Diff generation and file tracking
- Integration with the core functionality from `harness.py`

```python
from codegen_on_oss.analysis.harness_integration import CodebaseAnalysisHarness

# Create a harness from a repository
harness = CodebaseAnalysisHarness.from_repo("owner/repo")

# Analyze the codebase
results = harness.analyze_codebase()

# Get a diff against a specific commit
diff = harness.diff_versus_commit("abc123")

# Extract modified files from a patch
files = harness.files_in_patch(diff)
```

#### CodebaseContextSnapshot

The `CodebaseContextSnapshot` class in the `snapshot` module allows:

- Saving and restoring codebase state
- Integration with S3-compatible storage via BucketStore
- Preserving analysis results and context

```python
from codegen_on_oss.snapshot.context_snapshot import CodebaseContextSnapshot
from codegen_on_oss.bucket_store import BucketStore

# Create a bucket store for S3 integration
bucket_store = BucketStore(
bucket_name="my-bucket",
endpoint_url="https://s3.amazonaws.com",
)

# Create a snapshot from a harness
snapshot = CodebaseContextSnapshot(harness, bucket_store)
snapshot_id = snapshot.create_snapshot()

# Load a snapshot later
loaded_snapshot = CodebaseContextSnapshot.load_snapshot(
snapshot_id,
bucket_store=bucket_store,
)
```

### Code Context Retrieval Server

The package now includes a FastAPI server that provides endpoints for analysis, context management, and agent execution:

```bash
# Start the server
cgparse serve --host 0.0.0.0 --port 8000
```

The server provides the following endpoints:

- `/analyze` - Analyze a codebase and return the results
- `/snapshot/create` - Create a snapshot of a codebase
- `/snapshot/list` - List available snapshots
- `/snapshot/load/{snapshot_id}` - Load a snapshot by ID
- `/agent/execute` - Execute an agent with the given context

Example API usage:

```python
import requests

# Analyze a codebase
response = requests.post(
"http://localhost:8000/analyze",
json={
"repository": {
"repo_full_name": "owner/repo",
"language": "python",
},
},
)
results = response.json()

# Create a snapshot
response = requests.post(
"http://localhost:8000/snapshot/create",
json={
"repository": {
"repo_full_name": "owner/repo",
"language": "python",
},
"tags": ["production", "v1.0"],
},
)
snapshot_id = response.json()["snapshot_id"]

# Execute an agent with context
response = requests.post(
"http://localhost:8000/agent/execute",
json={
"snapshot_id": snapshot_id,
"prompt": "Fix the bug in the login component",
},
)
agent_results = response.json()
```
164 changes: 164 additions & 0 deletions codegen-on-oss/codegen_on_oss/analysis/harness_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
"""
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be beneficial to add unit tests for the CodebaseAnalysisHarness class to ensure that the analysis functionality works as expected. Consider adding tests for methods like analyze_codebase, diff_versus_commit, and files_in_patch.

CodebaseAnalysisHarness - Integration of the harness.py functionality from swebench.
This module provides comprehensive codebase analysis capabilities by integrating
the core functionality from the swebench harness.py module.
"""

import json
import subprocess
from pathlib import Path
from typing import Dict, List, Optional, Set, Union
Comment on lines +8 to +11
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused imports and update type annotations

The subprocess module is imported but not used, and several typing imports are deprecated in modern Python.

import json
-import subprocess
from pathlib import Path
-from typing import Dict, List, Optional, Set, Union
+from typing import Optional, Union

For modern type annotations, consider using Python 3.10+ syntax:

-from typing import Dict, List, Optional, Set, Union
+from typing import Optional, Union
# Then use built-in types like dict, list instead of Dict, List
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import json
import subprocess
from pathlib import Path
from typing import Dict, List, Optional, Set, Union
import json
from pathlib import Path
from typing import Optional, Union
🧰 Tools
🪛 Ruff (0.8.2)

9-9: subprocess imported but unused

Remove unused import: subprocess

(F401)


11-11: typing.Dict is deprecated, use dict instead

(UP035)


11-11: typing.List is deprecated, use list instead

(UP035)


11-11: typing.Set is deprecated, use set instead

(UP035)


11-11: typing.Set imported but unused

Remove unused import: typing.Set

(F401)

🤖 Prompt for AI Agents (early access)
In codegen-on-oss/codegen_on_oss/analysis/harness_integration.py around lines 8 to 11, remove the unused import of the subprocess module and replace deprecated typing imports with modern Python 3.10+ type annotation syntax. Update the type annotations accordingly to use built-in generic types like list, dict, set, and union instead of importing them from typing.


from loguru import logger

from codegen import Codebase
from codegen.configs.models.codebase import CodebaseConfig


class CodebaseAnalysisHarness:
"""
A harness for comprehensive codebase analysis, integrating functionality
from the swebench harness.py module.
"""

def __init__(
self,
codebase: Codebase,
metadata: Optional[Dict] = None,
tags: Optional[List[str]] = None,
):
"""
Initialize the CodebaseAnalysisHarness with a codebase.
Args:
codebase: The Codebase object to analyze
metadata: Optional metadata to associate with the analysis
tags: Optional tags to categorize the analysis
"""
self.codebase = codebase
self.metadata = metadata or {}
self.tags = tags or []
self.analysis_results = {}

Check failure on line 42 in codegen-on-oss/codegen_on_oss/analysis/harness_integration.py

View workflow job for this annotation

GitHub Actions / mypy

error: Need type annotation for "analysis_results" (hint: "analysis_results: dict[<type>, <type>] = ...") [var-annotated]

Comment on lines +35 to +43
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class lacks proper input validation for repository names and potential rate limiting handling. Consider adding validation for repository names and implement retry logic for network operations.

Suggested change
codebase: The Codebase object to analyze
metadata: Optional metadata to associate with the analysis
tags: Optional tags to categorize the analysis
"""
self.codebase = codebase
self.metadata = metadata or {}
self.tags = tags or []
self.analysis_results = {}
def __init__(
self,
codebase: Codebase,
metadata: Optional[Dict] = None,
tags: Optional[List[str]] = None,
max_retries: int = 3,
retry_delay: float = 1.0,
):
if not codebase.repo_name or '/' not in codebase.repo_name:
raise ValueError("Invalid repository name format. Expected 'owner/repo'")
self.codebase = codebase
self.metadata = metadata or {}
self.tags = tags or []
self.max_retries = max_retries
self.retry_delay = retry_delay
self.analysis_results = {}

@classmethod
def from_repo(
cls,
repo_full_name: str,
commit: Optional[str] = None,
language: str = "python",
disable_file_parse: bool = False,
) -> "CodebaseAnalysisHarness":
"""
Create a CodebaseAnalysisHarness from a repository.
Args:
repo_full_name: The full name of the repository (e.g., "owner/repo")
commit: Optional commit hash to checkout
language: The primary language of the codebase
disable_file_parse: Whether to disable file parsing
Returns:
A new CodebaseAnalysisHarness instance
"""
config = CodebaseConfig(
disable_file_parse=disable_file_parse,
)
codebase = Codebase.from_repo(
repo_full_name=repo_full_name,
commit=commit,
language=language,

Check failure on line 70 in codegen-on-oss/codegen_on_oss/analysis/harness_integration.py

View workflow job for this annotation

GitHub Actions / mypy

error: Argument "language" to "from_repo" of "Codebase" has incompatible type "str"; expected "Literal['python', 'typescript'] | ProgrammingLanguage | None" [arg-type]
config=config,
)
return cls(codebase=codebase)

def analyze_codebase(self) -> Dict:
"""
Perform comprehensive analysis of the codebase.
Returns:
A dictionary containing analysis results
"""
logger.info(f"Analyzing codebase: {self.codebase.repo_name}")

Check failure on line 82 in codegen-on-oss/codegen_on_oss/analysis/harness_integration.py

View workflow job for this annotation

GitHub Actions / mypy

error: "Codebase[SourceFile[Any, Any, Any, Any, Any, Any], Directory[Any, Any, Any, Any, Any, Any, Any], Symbol[Any, Any], Class[Function[Decorator[Any, Any, Parameter[Any, Any]], CodeBlock[Any, Any], Parameter[Any, Any], Type[Any]], Decorator[Any, Any, Parameter[Any, Any]], CodeBlock[Any, Any], Parameter[Any, Any], Type[Any]], Function[Decorator[Any, Any, Parameter[Any, Any]], CodeBlock[Any, Any], Parameter[Any, Any], Type[Any]], Import[Any], Assignment[Any], Interface[Any, Any, Any, Any], TypeAlias[Any, Any], Parameter[Any, Any], CodeBlock[Any, Any]]" has no attribute "repo_name" [attr-defined]

# Collect basic codebase statistics
stats = {
"repo_name": self.codebase.repo_name,

Check failure on line 86 in codegen-on-oss/codegen_on_oss/analysis/harness_integration.py

View workflow job for this annotation

GitHub Actions / mypy

error: "Codebase[SourceFile[Any, Any, Any, Any, Any, Any], Directory[Any, Any, Any, Any, Any, Any, Any], Symbol[Any, Any], Class[Function[Decorator[Any, Any, Parameter[Any, Any]], CodeBlock[Any, Any], Parameter[Any, Any], Type[Any]], Decorator[Any, Any, Parameter[Any, Any]], CodeBlock[Any, Any], Parameter[Any, Any], Type[Any]], Function[Decorator[Any, Any, Parameter[Any, Any]], CodeBlock[Any, Any], Parameter[Any, Any], Type[Any]], Import[Any], Assignment[Any], Interface[Any, Any, Any, Any], TypeAlias[Any, Any], Parameter[Any, Any], CodeBlock[Any, Any]]" has no attribute "repo_name" [attr-defined]
"language": self.codebase.language,
"file_count": len(self.codebase.files),

Check failure on line 88 in codegen-on-oss/codegen_on_oss/analysis/harness_integration.py

View workflow job for this annotation

GitHub Actions / mypy

error: Argument 1 to "len" has incompatible type overloaded function; expected "Sized" [arg-type]
"metadata": self.metadata,
"tags": self.tags,
}

# Get file structure
file_structure = self._get_file_structure()
stats["file_structure"] = file_structure

# Store the results
self.analysis_results = stats
return stats

def _get_file_structure(self) -> Dict:
"""
Get the file structure of the codebase.
Returns:
A dictionary representing the file structure
"""
structure = {}
for file_path in self.codebase.files:
parts = file_path.split("/")
current = structure
for i, part in enumerate(parts):
if i == len(parts) - 1: # This is a file
current.setdefault("files", []).append(part)
else: # This is a directory
Comment on lines +97 to +115
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class lacks proper input validation and error handling. Consider adding parameter validation and more specific error types. The _get_file_structure method could be optimized to handle large codebases better.

Suggested change
# Store the results
self.analysis_results = stats
return stats
def _get_file_structure(self) -> Dict:
"""
Get the file structure of the codebase.
Returns:
A dictionary representing the file structure
"""
structure = {}
for file_path in self.codebase.files:
parts = file_path.split("/")
current = structure
for i, part in enumerate(parts):
if i == len(parts) - 1: # This is a file
current.setdefault("files", []).append(part)
else: # This is a directory
def _get_file_structure(self) -> Dict:
if not self.codebase.files:
raise ValueError('No files found in codebase')
structure = {}
for file_path in self.codebase.files:
try:
parts = file_path.split('/')
current = structure
for i, part in enumerate(parts[:-1]):
current = current.setdefault('dirs', {}).setdefault(part, {})
current.setdefault('files', []).append(parts[-1])
except Exception as e:
logger.error(f'Error processing file {file_path}: {str(e)}')
return structure

current.setdefault("dirs", {}).setdefault(part, {})
current = current["dirs"][part]
return structure

def diff_versus_commit(self, commit: str) -> str:
"""
Take a diff of current contents versus the specified commit.
Args:
commit: The commit hash to diff against
Returns:
The diff output as a string
"""
return self.codebase.get_diff(base=commit)

def files_in_patch(self, patch: str) -> List[str]:
"""
Extract the list of modified files from a unified diff patch string.
Args:
patch: The unified diff patch string
Returns:
A list of modified file paths
"""
files = []
for line in patch.split("\n"):
if line.startswith("--- a/") or line.startswith("+++ b/"):
fname = line.split("/", 1)[1]
if fname not in files:
files.append(fname)
return files

def save_analysis_results(self, output_path: Union[str, Path]) -> None:
"""
Save the analysis results to a JSON file.
Args:
output_path: The path to save the results to
"""
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w") as f:
json.dump(self.analysis_results, f, indent=2)

logger.info(f"Analysis results saved to {output_path}")

41 changes: 41 additions & 0 deletions codegen-on-oss/codegen_on_oss/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,5 +124,46 @@ def run(
parser.parse(repo_url, commit_hash)


@cli.command()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Potential logger duplication in the serve command.

Each invocation of the serve command adds a new logger sink, causing duplicate log output when run repeatedly (e.g., in tests). Move logger setup out of the command or check for existing sinks before adding.

Suggested implementation:

# Global logger configuration
from loguru import logger
if not logger._core.handlers:
    logger.add("codegen_on_oss.log", rotation="1 MB")
@cli.command()
@click.option(
    "--host",
    type=str,
    default="0.0.0.0",
    help="Host to bind the server to",
)
@click.option(
    "--port",
    type=int,
    default=8000,
    help="Port to bind the server to",
)
def serve(host, port):
    # Removed duplicate logger sink addition, as global configuration is already set up.
    ...

Make sure that any other logger.add() calls related to the serve command are removed or similarly guarded elsewhere in the file. Adjust the file path and log sink configuration according to your project's logging conventions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI command serve is a great addition, but it would be helpful to add a test for it in the test_cli.py file to ensure it works correctly.

@click.option(
"--host",
type=str,
default="0.0.0.0",
help="Host to bind the server to",
)
@click.option(
"--port",
type=int,
default=8000,
help="Port to bind the server to",
)
@click.option(
"--debug",
is_flag=True,
help="Debug mode",
)
def serve(
host: str = "0.0.0.0",
port: int = 8000,
debug: bool = False,
):

This comment was marked as resolved.

"""
Start the Code Context Retrieval Server.

This server provides endpoints for codebase analysis, context management,
and agent execution.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing debug parameter impact in docstring category Documentation

Tell me more
What is the issue?

The serve function's docstring doesn't explain the important impact of the debug parameter on logging levels.

Why this matters

Users won't understand how the debug flag affects server behavior and logging, which could impact troubleshooting.

Suggested change ∙ Feature Preview
"""
Start the Code Context Retrieval Server.

This server provides endpoints for codebase analysis, context management,
and agent execution. When debug mode is enabled, logging level is set to DEBUG
for detailed output.
"""
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

logger.add(
sys.stdout,
format="{time: HH:mm:ss} {level} {message}",
level="DEBUG" if debug else "INFO",
)

from codegen_on_oss.context_server import start_server
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late Import of Critical Server Component category Functionality

Tell me more
What is the issue?

The import statement for start_server is placed inside the serve function instead of at the module level with other imports.

Why this matters

Late imports can mask import errors until runtime and cause the server to fail unexpectedly when the serve command is invoked.

Suggested change ∙ Feature Preview

Move the import statement to the module level with other imports at the top of the file:

import sys
from pathlib import Path

import click
from loguru import logger
from codegen_on_oss.context_server import start_server

# ... rest of the code ...
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.


logger.info(f"Starting Code Context Retrieval Server on {host}:{port}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI implementation should include better error handling and progress feedback. Consider adding a progress bar for long-running operations and more detailed error messages. Also, add command validation and help text.

Suggested change
@cli.command()
@click.option(
"--host",
type=str,
default="0.0.0.0",
help="Host to bind the server to",
)
@click.option(
"--port",
type=int,
default=8000,
help="Port to bind the server to",
)
@click.option(
"--debug",
is_flag=True,
help="Debug mode",
)
def serve(
host: str = "0.0.0.0",
port: int = 8000,
debug: bool = False,
):
"""
Start the Code Context Retrieval Server.
This server provides endpoints for codebase analysis, context management,
and agent execution.
"""
logger.add(
sys.stdout,
format="{time: HH:mm:ss} {level} {message}",
level="DEBUG" if debug else "INFO",
)
from codegen_on_oss.context_server import start_server
logger.info(f"Starting Code Context Retrieval Server on {host}:{port}")
from tqdm import tqdm
from typing import Tuple
def validate_server_params(host: str, port: int) -> Tuple[bool, str]:
import socket
try:
socket.inet_aton(host)
if not (0 <= port <= 65535):
return False, 'Port must be between 0 and 65535'
return True, ''
except:
return False, 'Invalid host address'
@cli.command(help='Start the Code Context Retrieval Server')
@click.option('--host', type=str, default='0.0.0.0',
help='Host address to bind the server to (default: 0.0.0.0)')
@click.option('--port', type=int, default=8000,
help='Port number to bind the server to (default: 8000)')
def serve(host: str, port: int):
valid, error = validate_server_params(host, port)
if not valid:
raise click.UsageError(error)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete Server Configuration Logging category Logging

Tell me more
What is the issue?

Server startup log message doesn't include key configuration information about debug mode

Why this matters

Missing debug mode status in logs makes it harder to verify the server's running configuration during deployment and debugging.

Suggested change ∙ Feature Preview
logger.info(f"Starting Code Context Retrieval Server on {host}:{port} (debug={debug})")
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

start_server(host=host, port=port)


if __name__ == "__main__":
cli()
9 changes: 9 additions & 0 deletions codegen-on-oss/codegen_on_oss/context_server/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Context server module for code context retrieval."""

from codegen_on_oss.context_server.server import (
app,
start_server,
)

__all__ = ["app", "start_server"]

Loading
Loading