diff --git a/codegen-on-oss/README.md b/codegen-on-oss/README.md index a7700eb77..104605136 100644 --- a/codegen-on-oss/README.md +++ b/codegen-on-oss/README.md @@ -1,337 +1,146 @@ -# Overview +# Codegen On OSS -The **Codegen on OSS** package provides a modular pipeline that: +This package provides a set of tools for analyzing and visualizing codebases. It's designed to help developers understand code structure, identify issues, and optimize their codebase. -- **Collects repository URLs** from different sources (e.g., CSV files or GitHub searches). -- **Parses repositories** using the codegen tool. -- **Profiles performance** and logs metrics for each parsing run. -- **Logs errors** to help pinpoint parsing failures or performance bottlenecks. +## Features -______________________________________________________________________ +- **Code Analysis**: Analyze code structure, dependencies, and relationships +- **Call Graph Visualization**: Visualize function call graphs and paths +- **Dead Code Detection**: Identify unused code in your codebase +- **Diff Analysis**: Analyze code changes and their impact +- **Issue Detection**: Find potential issues in your code -## Package Structure +## Installation -The package is composed of several modules: - -- `sources` - - - Defines the Repository source classes and settings. Settings are all configurable via environment variables - - - Github Source - - ```python - class GithubSettings(SourceSettings): - language: Literal["python", "typescript"] = "python" - heuristic: Literal[ - "stars", - "forks", - "updated", - # "watchers", - # "contributors", - # "commit_activity", - # "issues", - # "dependency", - ] = "stars" - github_token: str | None = None - ``` - - - The three options available now are the three supported by the Github API. - - Future Work Additional options will require different strategies - - - CSV Source - - - Simply reads repo URLs from CSV - -- `cache` - - - Currently only specifies the cache directory. It is used for caching git repositories pulled by the pipeline `--force-pull` can be used to re-pull from the remote. - -- `cli` - - - Built with Click, the CLI provides two main commands: - - `run-one`: Parses a single repository specified by URL. - - `run`: Iterates over repositories obtained from a selected source and parses each one. - -- **`metrics`** - - - Provides profiling tools to measure performance during the parse: - - `MetricsProfiler`: A context manager that creates a profiling session. - - `MetricsProfile`: Represents a "span" or a "run" of a specific repository. Records step-by-step metrics (clock duration, CPU time, memory usage) and writes them to a CSV file specified by `--output-path` - -- **`parser`** - - Contains the `CodegenParser` class that orchestrates the parsing process: - - - Clones the repository (or forces a pull if specified). - - Initializes a `Codebase` (from the codegen tool). - - Runs post-initialization validation. - - Integrates with the `MetricsProfiler` to log measurements at key steps. - -______________________________________________________________________ - -## Getting Started - -1. **Configure the Repository Source** +```bash +pip install codegen-on-oss +``` - Decide whether you want to read from a CSV file or query GitHub: +## Usage - - For CSV, ensure that your CSV file (default: `input.csv`) exists and contains repository URLs in its first column \[`repo_url`\] and commit hash \[`commit_hash`\] (or empty) in the second column. - - For GitHub, configure your desired settings (e.g., `language`, `heuristic`, and optionally a GitHub token) via environment variables (`GITHUB_` prefix) +### Codebase Analysis -1. **Run the Parser** +```python +from codegen_on_oss.analyzers.context.codebase import Codebase - Use the CLI to start parsing: +# Create a codebase object +codebase = Codebase("/path/to/your/codebase") - - To parse one repository: +# Get all functions in the codebase +functions = codebase.functions - ```bash - uv run cgparse run-one --help - ``` +# Get all classes in the codebase +classes = codebase.classes - - To parse multiple repositories from a source: +# Get a specific function +main_function = codebase.get_function("main") - ```bash - uv run cgparse run --help - ``` +# Get a specific class +user_class = codebase.get_class("User") +``` -1. **Review Metrics and Logs** +### Call Graph Visualization - After parsing, check the CSV (default: `metrics.csv` ) to review performance measurements per repository. Error logs are written to the specified error output file (default: `errors.log`) +```python +from codegen_on_oss.visualizers import CallGraphFromNode +import matplotlib.pyplot as plt +import networkx as nx -______________________________________________________________________ +# Create a call graph visualizer +visualizer = CallGraphFromNode(function_name="main", max_depth=3) -## Running on Modal +# Generate the graph +G = visualizer.visualize(codebase) -```shell -$ uv run modal run modal_run.py +# Visualize the graph +plt.figure(figsize=(12, 8)) +pos = nx.spring_layout(G) +nx.draw(G, pos, with_labels=True, node_color="lightblue", node_size=1500, font_size=10) +plt.title("Call Graph from main") +plt.show() ``` -Codegen runs this parser on modal using the CSV source file `input.csv` tracked in this repository. +### Dead Code Detection -### Modal Configuration +```python +from codegen_on_oss.visualizers import DeadCodeVisualizer +import matplotlib.pyplot as plt +import networkx as nx -- **Compute Resources**: Allocates 4 CPUs and 16GB of memory. -- **Secrets & Volumes**: Uses secrets (for bucket credentials) and mounts a volume for caching repositories. -- **Image Setup**: Builds on a Debian slim image with Python 3.12, installs required packages (`uv` and `git` ) -- **Environment Configuration**: Environment variables (e.g., GitHub settings) are injected at runtime. +# Create a dead code visualizer +visualizer = DeadCodeVisualizer(exclude_test_files=True, exclude_decorated=True) -The function `parse_repo_on_modal` performs the following steps: +# Generate the graph +G = visualizer.visualize(codebase) -1. **Environment Setup**: Updates environment variables and configures logging using Loguru. -1. **Source Initialization**: Creates a repository source based on the provided type (e.g., GitHub). -1. **Metrics Profiling**: Instantiates `MetricsProfiler` to capture and log performance data. -1. **Repository Parsing**: Iterates over repository URLs and parses each using the `CodegenParser`. -1. **Error Handling**: Logs any exceptions encountered during parsing. -1. **Result Upload**: Uses the `BucketStore` class to upload the configuration, logs, and metrics to an S3 bucket. - -### Bucket Storage - -**Bucket (public):** [codegen-oss-parse](https://s3.amazonaws.com/codegen-oss-parse/) +# Visualize the graph +plt.figure(figsize=(15, 10)) +pos = nx.spring_layout(G) +nx.draw(G, pos, with_labels=True, node_color="red", node_size=1500, font_size=10) +plt.title("Dead Code Visualization") +plt.show() +``` -The results of each run are saved under the version of `codegen` lib that the run installed and the source type it was run with. Within this prefix: +### Filtered Call Graph + +```python +from codegen_on_oss.visualizers import CallGraphFilter +import matplotlib.pyplot as plt +import networkx as nx + +# Create a filtered call graph visualizer +visualizer = CallGraphFilter( + function_name="process_request", + class_name="ApiHandler", + method_names=["get", "post", "put", "delete"], + max_depth=3 +) + +# Generate the graph +G = visualizer.visualize(codebase) + +# Visualize the graph +plt.figure(figsize=(12, 8)) +pos = nx.spring_layout(G) +nx.draw(G, pos, with_labels=True, node_size=1500, font_size=10) +plt.title("API Endpoints Call Graph") +plt.show() +``` -- Source Settings - - `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/config.json` -- Metrics - - `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/metrics.csv` -- Logs - - `https://s3.amazonaws.com/codegen-oss-parse/{version}/{source}/output.logs` +### Call Paths Between Functions + +```python +from codegen_on_oss.visualizers import CallPathsBetweenNodes +import matplotlib.pyplot as plt +import networkx as nx + +# Create a call paths visualizer +visualizer = CallPathsBetweenNodes( + start_function_name="start_process", + end_function_name="end_process", + max_depth=5 +) + +# Generate the graph +G = visualizer.visualize(codebase) + +# Visualize the graph +plt.figure(figsize=(12, 8)) +pos = nx.spring_layout(G) +nx.draw(G, pos, with_labels=True, node_color="lightgreen", node_size=1500, font_size=10) +plt.title("Call Paths Visualization") +plt.show() +``` -______________________________________________________________________ +## Advanced Usage -### Running it yourself +See the `examples` directory for more detailed examples of how to use the package. -You can also run `modal_run.py` yourself. It is designed to be run via Modal for cloud-based parsing. It offers additional configuration methods: +## Contributing -```shell -$ uv run modal run modal_run.py -``` +Contributions are welcome! Please feel free to submit a Pull Request. -- **CSV and Repository Volumes:** - The script defines two Modal volumes: - - - `codegen-oss-input-volume`: For uploading and reloading CSV inputs. - - `codegen-oss-repo-volume`: For caching repository data during parsing. - The repository and input volume names are configurable via environment variables (`CODEGEN_MODAL_REPO_VOLUME` and `CODEGEN_MODAL_INPUT_VOLUME`). - -- **Secrets Handling:** - The script loads various credentials via Modal secrets. It first checks for a pre-configured Modal secret (`codegen-oss-bucket-credentials` configurable via environment variable `CODEGEN_MODAL_SECRET_NAME`) and falls back to dynamically created Modal secret from local `.env` or environment variables if not found. - -- **Entrypoint Parameters:** - The main function supports multiple source types: - - - **csv:** Uploads a CSV file (`--csv-file input.csv`) for batch processing. - - **single:** Parses a single repository specified by its URL (`--single-url "https://github.com/codegen-sh/codegen-sdk.git"`) and an optional commit hash (`--single-commit ...`) - - **github:** Uses GitHub settings, language (`--github-language python`) and heuristic (`--github-heuristic stars`) to query for top repositories. - -- **Result Storage:** - Upon completion, logs and metrics are automatically uploaded to the S3 bucket specified by the environment variable `BUCKET_NAME` (default: `codegen-oss-parse`). This allows for centralized storage and easy retrieval of run outputs. The AWS Credentials provided in the secret are used for this operation. - -______________________________________________________________________ - -## Extensibility - -**Adding New Sources:** - -You can define additional repository sources by subclassing `RepoSource` and providing a corresponding settings class. Make sure to set the `source_type` and register your new source by following the pattern established in `CSVInputSource` or `GithubSource`. - -**Improving Testing:** - -The detailed metrics collected can help you understand where parsing failures occur or where performance lags. Use these insights to improve error handling and optimize the codegen parsing logic. - -**Containerization and Automation:** - -There is a Dockerfile that can be used to create an image capable of running the parse tests. Runtime environment variables can be used to configure the run and output. - -**Input & Configuration** - -Explore a better CLI for providing options to the Modal run. - -______________________________________________________________________ - -## Example Log Output - -```shell -[codegen-on-oss*] codegen/codegen-on-oss/$ uv run cgparse run --source csv - 21:32:36 INFO Cloning repository https://github.com/JohnSnowLabs/spark-nlp.git - 21:36:57 INFO { - "profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git", - "step": "codebase_init", - "delta_time": 7.186550649999845, - "cumulative_time": 7.186550649999845, - "cpu_time": 180.3553702, - "memory_usage": 567525376, - "memory_delta": 317095936, - "error": null -} - 21:36:58 INFO { - "profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git", - "step": "post_init_validation", - "delta_time": 0.5465090990001045, - "cumulative_time": 7.733059748999949, - "cpu_time": 180.9174761, - "memory_usage": 569249792, - "memory_delta": 1724416, - "error": null -} - 21:36:58 ERROR Repository: https://github.com/JohnSnowLabs/spark-nlp.git -Traceback (most recent call last): - - File "/home/codegen/codegen/codegen-on-oss/.venv/bin/cgparse", line 10, in - sys.exit(cli()) - │ │ └ - │ └ - └ - File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1161, in __call__ - return self.main(*args, **kwargs) - │ │ │ └ {} - │ │ └ () - │ └ - └ - File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1082, in main - rv = self.invoke(ctx) - │ │ └ - │ └ - └ - File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1697, in invoke - return _process_result(sub_ctx.command.invoke(sub_ctx)) - │ │ │ │ └ - │ │ │ └ - │ │ └ - │ └ - └ ._process_result at 0x7f466597fb00> - File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 1443, in invoke - return ctx.invoke(self.callback, **ctx.params) - │ │ │ │ │ └ {'source': 'csv', 'output_path': 'metrics.csv', 'error_output_path': 'errors.log', 'cache_dir': PosixPath('/home/.cache... - │ │ │ │ └ - │ │ │ └ - │ │ └ - │ └ - └ - File "/home/codegen/codegen/codegen-on-oss/.venv/lib/python3.12/site-packages/click/core.py", line 788, in invoke - return __callback(*args, **kwargs) - │ └ {'source': 'csv', 'output_path': 'metrics.csv', 'error_output_path': 'errors.log', 'cache_dir': PosixPath('/home/.cache... - └ () - - File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/cli.py", line 121, in run - parser.parse(repo_url) - │ │ └ 'https://github.com/JohnSnowLabs/spark-nlp.git' - │ └ - └ - - File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/parser.py", line 52, in parse - with self.metrics_profiler.start_profiler( - │ │ └ - │ └ - └ - - File "/home/.local/share/uv/python/cpython-3.12.6-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 158, in __exit__ - self.gen.throw(value) - │ │ │ └ ParseRunError() - │ │ └ - │ └ - └ - -> File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/metrics.py", line 41, in start_profiler - yield profile - └ - - File "/home/codegen/codegen/codegen-on-oss/codegen_on_oss/parser.py", line 64, in parse - raise ParseRunError(validation_status) - │ └ - └ - -codegen_on_oss.parser.ParseRunError: LOW_IMPORT_RESOLUTION_RATE - 21:36:58 INFO { - "profile_name": "https://github.com/JohnSnowLabs/spark-nlp.git", - "step": "TOTAL", - "delta_time": 7.740976418000173, - "cumulative_time": 7.740976418000173, - "cpu_time": 180.9221699, - "memory_usage": 569249792, - "memory_delta": 0, - "error": "LOW_IMPORT_RESOLUTION_RATE" -} - 21:36:58 INFO Cloning repository https://github.com/Lightning-AI/lightning.git - 21:37:53 INFO { - "profile_name": "https://github.com/Lightning-AI/lightning.git", - "step": "codebase_init", - "delta_time": 24.256577352999557, - "cumulative_time": 24.256577352999557, - "cpu_time": 211.3604081, - "memory_usage": 1535971328, - "memory_delta": 966184960, - "error": null -} - 21:37:53 INFO { - "profile_name": "https://github.com/Lightning-AI/lightning.git", - "step": "post_init_validation", - "delta_time": 0.137609629000508, - "cumulative_time": 24.394186982000065, - "cpu_time": 211.5082702, - "memory_usage": 1536241664, - "memory_delta": 270336, - "error": null -} - 21:37:53 INFO { - "profile_name": "https://github.com/Lightning-AI/lightning.git", - "step": "TOTAL", - "delta_time": 24.394700584999555, - "cumulative_time": 24.394700584999555, - "cpu_time": 211.5088282, - "memory_usage": 1536241664, - "memory_delta": 0, - "error": null -} -``` +## License -## Example Metrics Output +This project is licensed under the MIT License - see the LICENSE file for details. -| profile_name | step | delta_time | cumulative_time | cpu_time | memory_usage | memory_delta | error | -| ---------------------- | -------------------- | ------------------ | ------------------ | ----------- | ------------ | ------------ | -------------------------- | -| JohnSnowLabs/spark-nlp | codebase_init | 7.186550649999845 | 7.186550649999845 | 180.3553702 | 567525376 | 317095936 | | -| JohnSnowLabs/spark-nlp | post_init_validation | 0.5465090990001045 | 7.733059748999949 | 180.9174761 | 569249792 | 1724416 | | -| JohnSnowLabs/spark-nlp | TOTAL | 7.740976418000173 | 7.740976418000173 | 180.9221699 | 569249792 | 0 | LOW_IMPORT_RESOLUTION_RATE | -| Lightning-AI/lightning | codebase_init | 24.256577352999557 | 24.256577352999557 | 211.3604081 | 1535971328 | 966184960 | | -| Lightning-AI/lightning | post_init_validation | 0.137609629000508 | 24.394186982000065 | 211.5082702 | 1536241664 | 270336 | | -| Lightning-AI/lightning | TOTAL | 24.394700584999555 | 24.394700584999555 | 211.5088282 | 1536241664 | 0 | | diff --git a/codegen-on-oss/codegen_on_oss/analyzers/context/graph/__init__.py b/codegen-on-oss/codegen_on_oss/analyzers/context/graph/__init__.py index 979afe76f..ff41860a4 100644 --- a/codegen-on-oss/codegen_on_oss/analyzers/context/graph/__init__.py +++ b/codegen-on-oss/codegen_on_oss/analyzers/context/graph/__init__.py @@ -21,7 +21,7 @@ def build_dependency_graph(edges: list[dict[str, Any]]) -> nx.DiGraph: Returns: NetworkX DiGraph representing the dependencies """ - graph = nx.DiGraph() + graph: nx.DiGraph = nx.DiGraph() for edge in edges: source = edge.get("source") diff --git a/codegen-on-oss/codegen_on_oss/analyzers/diff_lite.py b/codegen-on-oss/codegen_on_oss/analyzers/diff_lite.py index 934b68d70..80d8d9024 100644 --- a/codegen-on-oss/codegen_on_oss/analyzers/diff_lite.py +++ b/codegen-on-oss/codegen_on_oss/analyzers/diff_lite.py @@ -124,9 +124,12 @@ def from_git_diff(cls, git_diff: Diff) -> Self: if git_diff.a_blob: old = git_diff.a_blob.data_stream.read() + # Ensure path is never None + path = Path(git_diff.a_path) if git_diff.a_path else Path("") + return cls( change_type=ChangeType.from_git_change_type(git_diff.change_type), - path=Path(git_diff.a_path) if git_diff.a_path else None, + path=path, rename_from=Path(git_diff.rename_from) if git_diff.rename_from else None, rename_to=Path(git_diff.rename_to) if git_diff.rename_to else None, old_content=old, diff --git a/codegen-on-oss/codegen_on_oss/analyzers/issues.py b/codegen-on-oss/codegen_on_oss/analyzers/issues.py index c20ddc3ea..f6ff6d9b4 100644 --- a/codegen-on-oss/codegen_on_oss/analyzers/issues.py +++ b/codegen-on-oss/codegen_on_oss/analyzers/issues.py @@ -11,7 +11,7 @@ from collections.abc import Callable from dataclasses import asdict, dataclass, field from enum import Enum -from typing import Any +from typing import Any, Dict, List, Optional # Configure logging logging.basicConfig( @@ -189,10 +189,10 @@ def to_dict(self) -> dict[str, Any]: result["suggestion"] = self.suggestion if self.related_symbols: - result["related_symbols"] = self.related_symbols + result["related_symbols"] = self.related_symbols # type: ignore if self.related_locations: - result["related_locations"] = [ + result["related_locations"] = [ # type: ignore loc.to_dict() for loc in self.related_locations ] @@ -242,7 +242,7 @@ def __init__(self, issues: list[Issue] | None = None): issues: Initial list of issues """ self.issues = issues or [] - self._filters = [] + self._filters: list[tuple[Callable[[Issue], bool], str]] = [] def add_issue(self, issue: Issue): """ @@ -333,7 +333,7 @@ def group_by_severity(self) -> dict[IssueSeverity, list[Issue]]: Returns: Dictionary mapping severities to lists of issues """ - result = {severity: [] for severity in IssueSeverity} + result: dict[IssueSeverity, list[Issue]] = {severity: [] for severity in IssueSeverity} for issue in self.issues: result[issue.severity].append(issue) @@ -347,7 +347,7 @@ def group_by_category(self) -> dict[IssueCategory, list[Issue]]: Returns: Dictionary mapping categories to lists of issues """ - result = {category: [] for category in IssueCategory} + result: dict[IssueCategory, list[Issue]] = {category: [] for category in IssueCategory} for issue in self.issues: if issue.category: @@ -362,7 +362,7 @@ def group_by_file(self) -> dict[str, list[Issue]]: Returns: Dictionary mapping file paths to lists of issues """ - result = {} + result: dict[str, list[Issue]] = {} for issue in self.issues: if issue.location.file not in result: @@ -381,7 +381,7 @@ def statistics(self) -> dict[str, Any]: """ by_severity = self.group_by_severity() by_category = self.group_by_category() - by_status = {status: [] for status in IssueStatus} + by_status: dict[IssueStatus, list[Issue]] = {status: [] for status in IssueStatus} for issue in self.issues: by_status[issue.status].append(issue) @@ -506,7 +506,7 @@ def create_issue( message=message, severity=severity, location=location, - category=category, + category=category if category != "" else None, # type: ignore symbol=symbol, suggestion=suggestion, ) diff --git a/codegen-on-oss/codegen_on_oss/visualizers/__init__.py b/codegen-on-oss/codegen_on_oss/visualizers/__init__.py new file mode 100644 index 000000000..b70671f05 --- /dev/null +++ b/codegen-on-oss/codegen_on_oss/visualizers/__init__.py @@ -0,0 +1,22 @@ +"""Visualizers for codebase analysis. + +This package contains visualizers for analyzing and visualizing different aspects +of a codebase, such as call graphs, dead code, and more. +""" + +from codegen_on_oss.visualizers.call_graph_from_node import ( + CallGraphFilter, + CallGraphFromNode, + CallPathsBetweenNodes, +) +from codegen_on_oss.visualizers.codebase_visualizer import CodebaseVisualizer +from codegen_on_oss.visualizers.dead_code import DeadCodeVisualizer + +__all__ = [ + "CodebaseVisualizer", + "CallGraphFromNode", + "CallGraphFilter", + "CallPathsBetweenNodes", + "DeadCodeVisualizer", +] + diff --git a/codegen-on-oss/codegen_on_oss/visualizers/call_graph_from_node.py b/codegen-on-oss/codegen_on_oss/visualizers/call_graph_from_node.py new file mode 100644 index 000000000..0b7b1d146 --- /dev/null +++ b/codegen-on-oss/codegen_on_oss/visualizers/call_graph_from_node.py @@ -0,0 +1,314 @@ +from typing import Optional, Union + +import networkx as nx + +from codegen_on_oss.analyzers.context.graph import Class +from codegen_on_oss.analyzers.context.graph.function import Function +from codegen_on_oss.analyzers.context.graph.function_call import FunctionCall +from codegen_on_oss.visualizers.codebase_visualizer import CodebaseVisualizer + + +class CallGraphFromNode(CodebaseVisualizer): + """Creates a directed call graph for a given function. + + Starting from the specified function, it recursively iterates through its function calls + and the functions called by them, building a graph of the call paths to a maximum depth. + The root of the directed graph is the starting function, each node represents a function call, + and edge from node A to node B indicates that function A calls function B. + """ + + def __init__( + self, + function_name: str, + max_depth: int = 5, + graph_external_modules: bool = False, + ): + """Initialize the call graph visualizer. + + Args: + function_name: Name of the function to trace + max_depth: Maximum depth of the call graph + graph_external_modules: Whether to include external module calls + """ + self.function_name = function_name + self.max_depth = max_depth + self.graph_external_modules = graph_external_modules + self.G: nx.DiGraph = nx.DiGraph() + + def visualize(self, codebase) -> nx.DiGraph: + """Create a call graph visualization starting from the specified function. + + Args: + codebase: The codebase to analyze + + Returns: + A directed graph representing the call paths + """ + # Get the function to trace + function_to_trace = codebase.get_function(self.function_name) + + # Set starting node + self.G.add_node(function_to_trace, color="yellow") + + # Add all the children (and sub-children) to the graph + self._create_downstream_call_trace(function_to_trace) + + return self.G + + def _create_downstream_call_trace( + self, + parent: Optional[Union[FunctionCall, Function]] = None, + depth: int = 0 + ): + """Creates call graph for parent. + + This function recurses through the call graph of a function and creates a visualization. + + Args: + parent: The function for which a call graph will be created + depth: The current depth of the recursive stack + """ + # If the maximum recursive depth has been exceeded, return + if self.max_depth <= depth: + return + + if isinstance(parent, FunctionCall): + src_call, src_func = parent, parent.function_definition + else: + src_call, src_func = parent, parent + + # Iterate over all call paths of the symbol + for call in src_func.function_calls: + # The symbol being called + func = call.function_definition + + # Ignore direct recursive calls + if func.name == src_func.name: + continue + + # If the function being called is not from an external module + if not isinstance(func, str): # External modules are represented as strings + # Add `call` to the graph and an edge from `src_call` to `call` + self.G.add_node(call) + self.G.add_edge(src_call, call) + + # Recursive call to function call + self._create_downstream_call_trace(call, depth + 1) + elif self.graph_external_modules: + # Add `call` to the graph and an edge from `src_call` to `call` + self.G.add_node(call) + self.G.add_edge(src_call, call) + + +class CallGraphFilter(CodebaseVisualizer): + """Creates a filtered call graph visualization. + + This visualizer shows a call graph from a given function or symbol, + filtering out test files and class declarations and including only methods + with specific names (by default: post, get, patch, delete). + """ + + def __init__( + self, + function_name: str, + class_name: str, + method_names: list[str] = None, + max_depth: int = 5, + skip_class_declarations: bool = True, + ): + """Initialize the filtered call graph visualizer. + + Args: + function_name: Name of the function to trace + class_name: Name of the class to filter methods from + method_names: List of method names to include (defaults to HTTP methods) + max_depth: Maximum depth of the call graph + skip_class_declarations: Whether to skip class declarations in the graph + """ + self.function_name = function_name + self.class_name = class_name + self.method_names = method_names or ["post", "get", "patch", "delete"] + self.max_depth = max_depth + self.skip_class_declarations = skip_class_declarations + self.G: nx.DiGraph = nx.DiGraph() + + def visualize(self, codebase) -> nx.DiGraph: + """Create a filtered call graph visualization. + + Args: + codebase: The codebase to analyze + + Returns: + A directed graph representing the filtered call paths + """ + # Get the function to trace + func_to_trace = codebase.get_function(self.function_name) + + # Get the class to filter methods from + cls = codebase.get_class(self.class_name) + + # Add the main symbol as a node + self.G.add_node(func_to_trace, color="red") + + # Start the recursive traversal + self._create_filtered_downstream_call_trace(func_to_trace, 1, cls) + + return self.G + + def _create_filtered_downstream_call_trace( + self, + parent: Union[FunctionCall, Function], + current_depth: int, + cls: Class + ): + """Creates a filtered call graph. + + Args: + parent: The function or call to trace from + current_depth: Current depth in the call graph + cls: The class to filter methods from + """ + if current_depth > self.max_depth: + return + + # If parent is of type Function + if isinstance(parent, Function): + # Set both src_call, src_func to parent + src_call, src_func = parent, parent + else: + # Get the first callable of parent + src_call, src_func = parent, parent.function_definition + + # Iterate over all call paths of the symbol + for call in src_func.function_calls: + # The symbol being called + func = call.function_definition + + if self.skip_class_declarations and isinstance(func, Class): + continue + + # If the function being called is not from an external module and is not defined in a test file + if not isinstance(func, str) and not getattr(func, 'file', {}).get('filepath', '').startswith("test"): + # Add `call` to the graph and an edge from `src_call` to `call` + metadata = {} + if isinstance(func, Function) and getattr(func, 'is_method', False) and func.name in self.method_names: + name = f"{func.parent_class.name}.{func.name}" + metadata = {"color": "yellow", "name": name} + self.G.add_node(call, **metadata) + self.G.add_edge(src_call, call, symbol=cls) # Add edge from current to successor + + # Recursively add successors of the current symbol + self._create_filtered_downstream_call_trace(call, current_depth + 1, cls) + + +class CallPathsBetweenNodes(CodebaseVisualizer): + """Visualizes call paths between two specified functions. + + This visualizer generates and visualizes a call graph between two specified functions. + It starts from a given function and iteratively traverses through its function calls, + building a directed graph of the call paths. The visualizer then identifies all simple + paths between the start and end functions, creating a subgraph that includes only the + nodes in these paths. + """ + + def __init__( + self, + start_function_name: str, + end_function_name: str, + max_depth: int = 5, + ): + """Initialize the call paths visualizer. + + Args: + start_function_name: Name of the starting function + end_function_name: Name of the ending function + max_depth: Maximum depth of the call graph + """ + self.start_function_name = start_function_name + self.end_function_name = end_function_name + self.max_depth = max_depth + self.G: nx.DiGraph = nx.DiGraph() + + def visualize(self, codebase) -> nx.DiGraph: + """Create a visualization of call paths between two functions. + + Args: + codebase: The codebase to analyze + + Returns: + A directed graph representing the call paths + """ + # Get the start and end functions + start = codebase.get_function(self.start_function_name) + end = codebase.get_function(self.end_function_name) + + # Set starting node as blue + self.G.add_node(start, color="blue") + # Set ending node as red + self.G.add_node(end, color="red") + + # Start the recursive traversal + self._create_downstream_call_trace(start, end, 1) + + # Find all the simple paths between start and end + try: + all_paths = list(nx.all_simple_paths(self.G, source=start, target=end)) + + # Collect all nodes that are part of these paths + nodes_in_paths = set() + for path in all_paths: + nodes_in_paths.update(path) + + # Create a new subgraph with only the nodes in the paths + self.G = self.G.subgraph(nodes_in_paths) + except (nx.NetworkXNoPath, nx.NodeNotFound): + # If no path exists, return the original graph + pass + + return self.G + + def _create_downstream_call_trace( + self, + parent: Union[FunctionCall, Function], + end: Function, + current_depth: int + ): + """Creates a call graph between two functions. + + Args: + parent: The current function or call in the traversal + end: The target end function + current_depth: Current depth in the call graph + """ + if current_depth > self.max_depth: + return + + # If parent is of type Function + if isinstance(parent, Function): + # Set both src_call, src_func to parent + src_call, src_func = parent, parent + else: + # Get the first callable of parent + src_call, src_func = parent, parent.function_definition + + # Iterate over all call paths of the symbol + for call in src_func.function_calls: + # The symbol being called + func = call.function_definition + + # Ignore direct recursive calls + if func.name == src_func.name: + continue + + # If the function being called is not from an external module + if not isinstance(func, str): + # Add `call` to the graph and an edge from `src_call` to `call` + self.G.add_node(call) + self.G.add_edge(src_call, call) + + if func == end: + self.G.add_edge(call, end) + return + # Recursive call to function call + self._create_downstream_call_trace(call, end, current_depth + 1) + diff --git a/codegen-on-oss/codegen_on_oss/visualizers/codebase_visualizer.py b/codegen-on-oss/codegen_on_oss/visualizers/codebase_visualizer.py new file mode 100644 index 000000000..0b8c3c6ab --- /dev/null +++ b/codegen-on-oss/codegen_on_oss/visualizers/codebase_visualizer.py @@ -0,0 +1,25 @@ +from abc import ABC, abstractmethod + +import networkx as nx + + +class CodebaseVisualizer(ABC): + """Base class for codebase visualizers. + + This abstract class defines the interface for all codebase visualizers. + Subclasses must implement the visualize method to create a visualization + of some aspect of the codebase. + """ + + @abstractmethod + def visualize(self, codebase) -> nx.DiGraph: + """Create a visualization of the codebase. + + Args: + codebase: The codebase to visualize + + Returns: + A directed graph representing the visualization + """ + pass + diff --git a/codegen-on-oss/codegen_on_oss/visualizers/dead_code.py b/codegen-on-oss/codegen_on_oss/visualizers/dead_code.py new file mode 100644 index 000000000..59fe32b57 --- /dev/null +++ b/codegen-on-oss/codegen_on_oss/visualizers/dead_code.py @@ -0,0 +1,77 @@ +import networkx as nx + +from codegen_on_oss.analyzers.context.graph.function import Function +from codegen_on_oss.analyzers.context.graph.import_resolution import Import +from codegen_on_oss.analyzers.context.graph.symbol import Symbol +from codegen_on_oss.visualizers.codebase_visualizer import CodebaseVisualizer + + +class DeadCodeVisualizer(CodebaseVisualizer): + """Visualizes dead code in the codebase. + + This visualizer identifies functions that have no usages and are not in test files + or decorated. These functions are considered 'dead code' and are added to a directed + graph. The visualizer then explores the dependencies of these dead code functions, + adding them to the graph as well. This process helps to identify not only directly + unused code but also code that might only be used by other dead code (second-order + dead code). + """ + + def __init__(self, exclude_test_files: bool = True, exclude_decorated: bool = True): + """Initialize the dead code visualizer. + + Args: + exclude_test_files: Whether to exclude test files from analysis + exclude_decorated: Whether to exclude decorated functions from analysis + """ + self.exclude_test_files = exclude_test_files + self.exclude_decorated = exclude_decorated + + def visualize(self, codebase) -> nx.DiGraph: + """Create a visualization of dead code in the codebase. + + Args: + codebase: The codebase to analyze + + Returns: + A directed graph representing the dead code and its dependencies + """ + # Create a directed graph to visualize dead and second-order dead code + G = nx.DiGraph() + + # First, identify all dead code + dead_code: list[Function] = [] + + # Iterate through all functions in the codebase + for function in codebase.functions: + # Filter down functions + if self.exclude_test_files and "test" in function.file.filepath: + continue + + if self.exclude_decorated and function.decorators: + continue + + # Check if the function has no usages + if not function.symbol_usages: + # Add the function to the dead code list + dead_code.append(function) + # Add the function to the graph as dead code + G.add_node(function, color="red") + + # Now, find second-order dead code + for symbol in dead_code: + # Get all usages of the dead code symbol + for dep in symbol.dependencies: + if isinstance(dep, Import): + dep = dep.imported_symbol + if isinstance(dep, Symbol): + if not (self.exclude_test_files and "test" in getattr(dep, 'name', '')): + G.add_node(dep) + G.add_edge(symbol, dep, color="red") + for usage_symbol in dep.symbol_usages: + if isinstance(usage_symbol, Function): + if not (self.exclude_test_files and "test" in usage_symbol.name): + G.add_edge(usage_symbol, dep) + + return G + diff --git a/codegen-on-oss/examples/visualization_examples.py b/codegen-on-oss/examples/visualization_examples.py new file mode 100644 index 000000000..41e5307b4 --- /dev/null +++ b/codegen-on-oss/examples/visualization_examples.py @@ -0,0 +1,174 @@ +"""Examples of using the visualization capabilities of codegen-on-oss. + +This module demonstrates how to use the various visualization tools +provided by the codegen-on-oss package to analyze and understand code structure. +""" + +import networkx as nx +from matplotlib import pyplot as plt + +from codegen_on_oss.analyzers.context.codebase import Codebase +from codegen_on_oss.visualizers import ( + CallGraphFilter, + CallGraphFromNode, + CallPathsBetweenNodes, + DeadCodeVisualizer, +) + + +def visualize_call_graph(codebase_path: str, function_name: str): + """Visualize the call graph starting from a specific function. + + Args: + codebase_path: Path to the codebase + function_name: Name of the function to start from + """ + # Create a codebase object + codebase = Codebase(codebase_path) + + # Create a call graph visualizer + visualizer = CallGraphFromNode(function_name=function_name, max_depth=3) + + # Generate the graph + G = visualizer.visualize(codebase) + + # Visualize the graph + plt.figure(figsize=(12, 8)) + pos = nx.spring_layout(G) + nx.draw(G, pos, with_labels=True, node_color="lightblue", node_size=1500, font_size=10) + plt.title(f"Call Graph from {function_name}") + plt.show() + + +def visualize_filtered_call_graph(codebase_path: str, function_name: str, class_name: str): + """Visualize a filtered call graph showing only specific method types. + + Args: + codebase_path: Path to the codebase + function_name: Name of the function to start from + class_name: Name of the class to filter methods from + """ + # Create a codebase object + codebase = Codebase(codebase_path) + + # Create a filtered call graph visualizer + visualizer = CallGraphFilter( + function_name=function_name, + class_name=class_name, + method_names=["get", "post", "put", "delete"], + max_depth=3 + ) + + # Generate the graph + G = visualizer.visualize(codebase) + + # Visualize the graph + plt.figure(figsize=(12, 8)) + pos = nx.spring_layout(G) + + # Draw nodes with different colors based on attributes + node_colors = [] + for node in G.nodes(): + if G.nodes[node].get("color") == "red": + node_colors.append("red") + elif G.nodes[node].get("color") == "yellow": + node_colors.append("yellow") + else: + node_colors.append("lightblue") + + nx.draw(G, pos, with_labels=True, node_color=node_colors, node_size=1500, font_size=10) + plt.title(f"Filtered Call Graph from {function_name}") + plt.show() + + +def visualize_call_paths(codebase_path: str, start_function: str, end_function: str): + """Visualize all call paths between two functions. + + Args: + codebase_path: Path to the codebase + start_function: Name of the starting function + end_function: Name of the ending function + """ + # Create a codebase object + codebase = Codebase(codebase_path) + + # Create a call paths visualizer + visualizer = CallPathsBetweenNodes( + start_function_name=start_function, + end_function_name=end_function, + max_depth=5 + ) + + # Generate the graph + G = visualizer.visualize(codebase) + + # Visualize the graph + plt.figure(figsize=(12, 8)) + pos = nx.spring_layout(G) + + # Draw nodes with different colors based on attributes + node_colors = [] + for node in G.nodes(): + if G.nodes[node].get("color") == "blue": + node_colors.append("blue") + elif G.nodes[node].get("color") == "red": + node_colors.append("red") + else: + node_colors.append("lightgreen") + + nx.draw(G, pos, with_labels=True, node_color=node_colors, node_size=1500, font_size=10) + plt.title(f"Call Paths from {start_function} to {end_function}") + plt.show() + + +def visualize_dead_code(codebase_path: str): + """Visualize dead code in the codebase. + + Args: + codebase_path: Path to the codebase + """ + # Create a codebase object + codebase = Codebase(codebase_path) + + # Create a dead code visualizer + visualizer = DeadCodeVisualizer(exclude_test_files=True, exclude_decorated=True) + + # Generate the graph + G = visualizer.visualize(codebase) + + # Visualize the graph + plt.figure(figsize=(15, 10)) + pos = nx.spring_layout(G) + + # Draw nodes with different colors based on attributes + node_colors = [] + for node in G.nodes(): + if G.nodes[node].get("color") == "red": + node_colors.append("red") + else: + node_colors.append("gray") + + # Draw edges with different colors based on attributes + edge_colors = [] + for u, v in G.edges(): + if G.edges[u, v].get("color") == "red": + edge_colors.append("red") + else: + edge_colors.append("black") + + nx.draw(G, pos, with_labels=True, node_color=node_colors, edge_color=edge_colors, + node_size=1500, font_size=10, arrows=True) + plt.title("Dead Code Visualization") + plt.show() + + +if __name__ == "__main__": + # Example usage + codebase_path = "path/to/your/codebase" + + # Uncomment the examples you want to run + # visualize_call_graph(codebase_path, "main") + # visualize_filtered_call_graph(codebase_path, "process_request", "ApiHandler") + # visualize_call_paths(codebase_path, "start_process", "end_process") + # visualize_dead_code(codebase_path) +