diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 574486da9..2457ed38a 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -1,10 +1,15 @@ # Motivation + # Content + + # Testing + + # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 1493e389e..1274b4b1a 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -103,3 +103,13 @@ repos: language: system pass_filenames: false always_run: true + - repo: https://github.com/hukkin/mdformat + rev: 0.7.22 # Use the ref you want to point at + hooks: + - id: mdformat + # Optionally add plugins + additional_dependencies: + - mdformat-gfm + - mdformat-ruff + - mdformat-config + - mdformat-pyproject diff --git a/CLA.md b/CLA.md index a4fc53632..1c22a4b92 100644 --- a/CLA.md +++ b/CLA.md @@ -7,44 +7,49 @@ **Project Owner/Organization:** Codegen, Inc. 1. **Definitions** - 1. **“You”** or **“Contributor”** means the individual or entity (and its Affiliates) that Submits a Contribution. - 2. **“Contribution”** means any work of authorship (including any modifications or additions) that is intentionally Submitted by You for inclusion in the Project, in any form (including but not limited to source code, documentation, or other materials). - 3. **“Submit”** or **“Submitted”** means any act of transferring a Contribution to Codegen, Inc. via pull request, email, or any other method of communication for the purpose of inclusion in the Project. -2. **Grant of Copyright License** - Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc.: + 1. **“You”** or **“Contributor”** means the individual or entity (and its Affiliates) that Submits a Contribution. + 1. **“Contribution”** means any work of authorship (including any modifications or additions) that is intentionally Submitted by You for inclusion in the Project, in any form (including but not limited to source code, documentation, or other materials). + 1. **“Submit”** or **“Submitted”** means any act of transferring a Contribution to Codegen, Inc. via pull request, email, or any other method of communication for the purpose of inclusion in the Project. - - A perpetual, worldwide, non-exclusive, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works. -3. **Grant of Patent License** +1. **Grant of Copyright License** - Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc. a perpetual, worldwide, non-exclusive, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer Your Contribution, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution alone or by combination of Your Contribution with the Project to which You Submitted it. + Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc.: - If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Project to which You have contributed, directly or indirectly infringes any patent, then any patent licenses granted to that entity under this CLA for that Contribution or Project shall terminate as of the date such litigation is filed. + - A perpetual, worldwide, non-exclusive, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works. -4. **Representations and Warranties** - 1. **Original Work**. You represent that each of Your Contributions is an original work of authorship and that You have the necessary rights to grant the licenses under this CLA. - 2. **Third-Party Rights**. If Your employer(s) or any third party has rights to intellectual property that You create, You represent that You have received permission to make Contributions on behalf of that employer or third party (or that such employer or third party has waived those rights for Your Contributions). - 3. **No Other Agreements**. You represent that You are not aware of any other agreement or obligation that is inconsistent with the rights granted under this CLA. -5. **Disclaimer of Warranty** +1. **Grant of Patent License** - UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, YOU PROVIDE YOUR CONTRIBUTIONS ON AN **“AS IS”** BASIS, **WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND**, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc. a perpetual, worldwide, non-exclusive, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer Your Contribution, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution alone or by combination of Your Contribution with the Project to which You Submitted it. -6. **Limitation of Liability** + If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Project to which You have contributed, directly or indirectly infringes any patent, then any patent licenses granted to that entity under this CLA for that Contribution or Project shall terminate as of the date such litigation is filed. - IN NO EVENT SHALL CODEGEN, INC. OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE), ARISING IN ANY WAY OUT OF OR IN CONNECTION WITH THIS AGREEMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +1. **Representations and Warranties** -7. **Subsequent Contributions and Updates** + 1. **Original Work**. You represent that each of Your Contributions is an original work of authorship and that You have the necessary rights to grant the licenses under this CLA. + 1. **Third-Party Rights**. If Your employer(s) or any third party has rights to intellectual property that You create, You represent that You have received permission to make Contributions on behalf of that employer or third party (or that such employer or third party has waived those rights for Your Contributions). + 1. **No Other Agreements**. You represent that You are not aware of any other agreement or obligation that is inconsistent with the rights granted under this CLA. - You agree that all current and future Contributions to the Project Submitted by You shall be subject to the terms of this CLA. Codegen, Inc. may publish updates to this CLA from time to time; in such case, You may need to agree to new terms before any subsequent Contributions. +1. **Disclaimer of Warranty** -8. **License Modification Rights** + UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, YOU PROVIDE YOUR CONTRIBUTIONS ON AN **“AS IS”** BASIS, **WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND**, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - You agree that Codegen, Inc. may change the license(s) applicable to the open source project(s) to which Your Contributions relate at Codegen, Inc.’s sole discretion, including without limitation by re-licensing the project(s) and Your Contributions under any other open source or “free” software license, or a commercial or proprietary license of Codegen, Inc.’s choosing. +1. **Limitation of Liability** -9. **Governing Law** + IN NO EVENT SHALL CODEGEN, INC. OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE), ARISING IN ANY WAY OUT OF OR IN CONNECTION WITH THIS AGREEMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - This CLA shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflicts of laws provisions. +1. **Subsequent Contributions and Updates** -10. **Signature / Electronic Consent** + You agree that all current and future Contributions to the Project Submitted by You shall be subject to the terms of this CLA. Codegen, Inc. may publish updates to this CLA from time to time; in such case, You may need to agree to new terms before any subsequent Contributions. - By signing or otherwise indicating Your acceptance of this CLA, You acknowledge that You have read and agree to be bound by its terms. If You are signing on behalf of an entity, You represent and warrant that You have the authority to do so. +1. **License Modification Rights** + + You agree that Codegen, Inc. may change the license(s) applicable to the open source project(s) to which Your Contributions relate at Codegen, Inc.’s sole discretion, including without limitation by re-licensing the project(s) and Your Contributions under any other open source or “free” software license, or a commercial or proprietary license of Codegen, Inc.’s choosing. + +1. **Governing Law** + + This CLA shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflicts of laws provisions. + +1. **Signature / Electronic Consent** + + By signing or otherwise indicating Your acceptance of this CLA, You acknowledge that You have read and agree to be bound by its terms. If You are signing on behalf of an entity, You represent and warrant that You have the authority to do so. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index bcf6acfc9..a169cf8ad 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,8 +7,8 @@ Thank you for your interest in contributing to Codegen! This document outlines t By contributing to Codegen, you agree that: 1. Your contributions will be licensed under the project's license. -2. You have the right to license your contribution under the project's license. -3. You grant Codegen a perpetual, worldwide, non-exclusive, royalty-free license to use your contribution. +1. You have the right to license your contribution under the project's license. +1. You grant Codegen a perpetual, worldwide, non-exclusive, royalty-free license to use your contribution. See our [CLA](CLA.md) for more details. @@ -19,6 +19,7 @@ See our [CLA](CLA.md) for more details. UV is a fast Python package installer and resolver. To install: **macOS**: + ```bash brew install uv ``` @@ -28,6 +29,7 @@ For other platforms, see the [UV installation docs](https://github.com/astral-sh ### Setting Up the Development Environment After installing UV, set up your development environment: + ```bash uv venv source .venv/bin/activate @@ -35,6 +37,7 @@ uv sync --dev ``` > [!TIP] +> > - If sync fails with `missing field 'version'`, you may need to delete lockfile and rerun `rm uv.lock && uv sync --dev`. > - If sync fails with failed compilation, you may need to install clang and rerun `uv sync --dev`. @@ -51,10 +54,10 @@ uv run pytest tests/integration/codemod/test_codemods.py -n auto ## Pull Request Process 1. Fork the repository and create your branch from `develop`. -2. Ensure your code passes all tests. -3. Update documentation as needed. -4. Submit a pull request to the `develop` branch. -5. Include a clear description of your changes in the PR. +1. Ensure your code passes all tests. +1. Update documentation as needed. +1. Submit a pull request to the `develop` branch. +1. Include a clear description of your changes in the PR. ## Release Process diff --git a/README.md b/README.md index 01f7aab8b..7d6a7b2d3 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,6 @@ [Codegen](https://docs.codegen.com) is a python library for manipulating codebases. - ```python from codegen import Codebase @@ -37,11 +36,13 @@ for function in codebase.functions: # Comprehensive static analysis for references, dependencies, etc. if not function.usages: # Auto-handles references and imports to maintain correctness - function.move_to_file('deprecated.py') + function.move_to_file("deprecated.py") ``` + Write code that transforms code. Codegen combines the parsing power of [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) with the graph algorithms of [rustworkx](https://github.com/Qiskit/rustworkx) to enable scriptable, multi-language code manipulation at scale. ## Installation and Usage + We support - Running Codegen in Python 3.12 – 3.13 @@ -50,7 +51,6 @@ We support - Windows is not supported - Python, Typescript, Javascript and React codebases - ``` # Install inside existing project uv pip install codegen diff --git a/architecture/1. plumbing/file-discovery.md b/architecture/1. plumbing/file-discovery.md new file mode 100644 index 000000000..f4c3998d0 --- /dev/null +++ b/architecture/1. plumbing/file-discovery.md @@ -0,0 +1,19 @@ +# File Discovery + +The file discovery process is responsible for identifying and organizing all relevant files in a project that need to be processed by the SDK. + +## Initialization + +- We take in either a list of projects or a path to a filesystem. +- If we get a path, we'll detect the programming language, initialize the git client based on the path and get a Project + +## File discovery + +- We discover files using the git client so we can respect gitignored files +- We then filter files based on the language and the project configuration + - If specified, we filter by subdirectories + - We also filter by file extensions + +## Next Step + +After file discovery is complete, the files are passed to the [Tree-sitter Parsing](../parsing/tree-sitter.md) phase, where each file is parsed into a concrete syntax tree. diff --git a/architecture/2. parsing/A. Tree Sitter.md b/architecture/2. parsing/A. Tree Sitter.md new file mode 100644 index 000000000..3500b65fd --- /dev/null +++ b/architecture/2. parsing/A. Tree Sitter.md @@ -0,0 +1,33 @@ +# Tree-sitter Parsing + +Tree-sitter is used as the primary parsing engine for converting source code into concrete syntax trees. Tree-sitter supports two modes of operation: + +```python +def my_function(): + pass +``` + +Tree sitter parses this as the following: + +``` +module [0, 0] - [3, 0] + function_definition [0, 0] - [1, 8] + name: identifier [0, 4] - [0, 15] + parameters: parameters [0, 15] - [0, 17] + body: block [1, 4] - [1, 8] + pass_statement [1, 4] - [1, 8] +``` + +- An CST mode which includes syntax nodes (for example, the `def` keyword, spaces, or parentheses). The syntax nodes are "anonymous" and don't have any semantic meaning. + - You don't see these nodes in the tree-sitter output, but they are there. +- A AST mode where we only focus on the semantic nodes (for example, the `my_function` identifier, and the `pass` statement). These are 'named nodes' and have semantic meaning. + - This is different from field names (like 'body'). These mean nothing about the node, they indicate what role the child node ('block') plays in the parent node ('function_definition'). + +## Implementation Details + +- We construct a mapping between file type and the tree-sitter grammar +- For each file given to us (via git), we parse it using the appropriate grammar + +## Next Step + +Once the concrete syntax trees are built, they are transformed into our abstract syntax tree representation in the [AST Construction](./B.%20AST%20Construction.md) phase. diff --git a/architecture/2. parsing/B. AST Construction.md b/architecture/2. parsing/B. AST Construction.md new file mode 100644 index 000000000..c6484aaba --- /dev/null +++ b/architecture/2. parsing/B. AST Construction.md @@ -0,0 +1,77 @@ +# AST Construction + +The tree-sitter CST/AST is powerful but it focuses on syntax highlighting and not semantic meaning. +For example, take decorators: + +```python +@decorator +def my_function(): + pass +``` + +``` +module [0, 0] - [3, 0] + decorated_definition [0, 0] - [2, 8] + decorator [0, 0] - [0, 10] + identifier [0, 1] - [0, 10] + definition: function_definition [1, 0] - [2, 8] + name: identifier [1, 4] - [1, 15] + parameters: parameters [1, 15] - [1, 17] + body: block [2, 4] - [2, 8] + pass_statement [2, 4] - [2, 8] + +``` + +You can see the decorated_definition node has a decorator and a definition. This makes sense for syntax highlighting - the decorator is highlighted seperately from the function definition. + +However, this is not useful for semantic analysis. We need to know that the decorator is decorating the function definition - there is a single function definition which may contain multiple decorators. +This becomes visibile when we consider function call chains: + +```python +a().b().c().d() +``` + +``` +module [0, 0] - [2, 0] + expression_statement [0, 0] - [0, 15] + call [0, 0] - [0, 15] + function: attribute [0, 0] - [0, 13] + object: call [0, 0] - [0, 11] + function: attribute [0, 0] - [0, 9] + object: call [0, 0] - [0, 7] + function: attribute [0, 0] - [0, 5] + object: call [0, 0] - [0, 3] + function: identifier [0, 0] - [0, 1] + arguments: argument_list [0, 1] - [0, 3] + attribute: identifier [0, 4] - [0, 5] + arguments: argument_list [0, 5] - [0, 7] + attribute: identifier [0, 8] - [0, 9] + arguments: argument_list [0, 9] - [0, 11] + attribute: identifier [0, 12] - [0, 13] + arguments: argument_list [0, 13] - [0, 15] +``` + +You can see that the chain of calls is represented as a deeply nested structure. This is not useful for semantic analysis or performing edits on these nodes. Therefore, when parsing we need to build an AST that is more useful for semantic analysis. + +## Implementation + +- For each file, we parse a file-specific AST +- We offer two modes of parsing: + - Pattern based parsing: It maps a particular node type to a semantic node type. For example, we broadly map all identifiers to the `Name` node type. + - Custom parsing: It takes a CST and builds a custom node type. For example, we can turn a decorated_definition node into a function_definition node with decorators. This involves careful arranging of the CST nodes into a new structure. + +## Pattern based parsing + +To do this, we need to build a mapping between the tree-sitter node types and our semantic node types. These mappings are language specific and stored in node_classes. They are processed by parser.py at runtime. We can access these via many functions - child_by_field_name, \_parse_expression, etc. These methods both wrap the tree-sitter methods and parse the tree-sitter node into our semantic node. + +## Custom parsing + +These are more complex and require more work. Most symbols (classes, functions, etc), imports, exports, and other complex constructs are parsed using custom parsing. + +## Statement parsing + +Statements have another layer of complexity. They are essentially pattern based but the mapping and logic is defined directly in the parser.py file. + +## Next Step + +After the AST is constructed, the system moves on to [Import Resolution](../3.%20imports-exports/A.%20Imports.md) to analyze module dependencies and resolve symbols across files. diff --git a/architecture/3. imports-exports/A. Imports.md b/architecture/3. imports-exports/A. Imports.md new file mode 100644 index 000000000..09d70d902 --- /dev/null +++ b/architecture/3. imports-exports/A. Imports.md @@ -0,0 +1,7 @@ +# Import Resolution + +TODO + +## Next Step + +After import resolution, the system analyzes [Export Analysis](./B.%20Exports.md) and handles [TSConfig Support](./C.%20TSConfig.md) for TypeScript projects. This is followed by comprehensive [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md). diff --git a/architecture/3. imports-exports/B. Exports.md b/architecture/3. imports-exports/B. Exports.md new file mode 100644 index 000000000..9da67fcb4 --- /dev/null +++ b/architecture/3. imports-exports/B. Exports.md @@ -0,0 +1,7 @@ +# Export Analysis + +TODO + +## Next Step + +After export analysis is complete, for TypeScript projects, the system processes [TSConfig Support](./C.%20TSConfig.md) configurations. Then it moves on to [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md) to build a complete understanding of types and symbols. diff --git a/architecture/3. imports-exports/C. TSConfig.md b/architecture/3. imports-exports/C. TSConfig.md new file mode 100644 index 000000000..e9c77ae0c --- /dev/null +++ b/architecture/3. imports-exports/C. TSConfig.md @@ -0,0 +1,7 @@ +# TSConfig Support + +TODO + +## Next Step + +After TSConfig processing is complete, the system proceeds to [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md) where it builds a complete understanding of types, symbols, and their relationships. diff --git a/architecture/4. type-analysis/A. Type Analysis.md b/architecture/4. type-analysis/A. Type Analysis.md new file mode 100644 index 000000000..9f2d9c28c --- /dev/null +++ b/architecture/4. type-analysis/A. Type Analysis.md @@ -0,0 +1,25 @@ +# Type Analysis + +The type analysis system builds a complete understanding of types and symbols across the codebase. + +## Basic flow + +- Discover names that need to be resolved +- Resolve names +- Convert resolutions into graph edges + +## The resolution stack + +To accomplish this, we have an in house computation engine - the ResolutionStack. Each stack frame contains a reference to it's parent frame. However, a parent can have multiple child frames (IE: Union Types). + +When we resolve types on a node, we call resolved_type_frames to get the resolved types. Once we know what goes in the next frame, we call with_resolution_frame to construct the next frame. This is a generator that yields the next frame until we've resolved all the types. Resolved_type_frames is a property caches a list of the generated frames. +Therefore, once you have computed type resolution on a node, you don't need to recompute it. That way, we can start at arbitrary nodes without performance overhead. + +This is similar to how other's implement incremental computation engines with a few weaknesses: + +- There is only 1 query in the query engine +- Partial cache invalidation isn't implemented + +## Next Step + +After understanding the type analysis system overview, let's look at how we [walk the syntax tree](./B.%20Tree%20Walking.md) to analyze code structure. diff --git a/architecture/4. type-analysis/B. Tree Walking.md b/architecture/4. type-analysis/B. Tree Walking.md new file mode 100644 index 000000000..c0c777dc4 --- /dev/null +++ b/architecture/4. type-analysis/B. Tree Walking.md @@ -0,0 +1,49 @@ +# Tree Walking + +To compute dependencies, we have to walk the entire AST for every file. +At a high level, the procedure is pretty simple + +```python +def compute_dependencies(self): + for child in self.children: + compute_dependencies(child) +``` + +We start at the root node and walk the tree until we have computed all dependencies. + +## Usage Kind identification + +We have to identify the kind of usage for each node. This is done by looking at the parent node and the child node. + +```python +def foo() -> c: + c() +``` + +We will classify the usage kind of the `c` callsite differently from the return type. + +```python +class PyFunction(...): + ... + + def _compute_dependencies(self, usage_kind: UsageKind): + self.return_type._compute_dependencies(UsageKind.RETURN_TYPE) + self.body._compute_dependencies(UsageKind.BODY) +``` + +By default, we just pass the usage kind to the children. + +## Resolvable Nodes + +At no step in this process described so far have we actually computed any dependencies. That's because there are some special nodes ("Resolvables") that do the heavy lifting. All of the tree walking is just to identify these nodes and the context they are used in. Resolvables are anything inheriting from `Resolvable`: + +- [Name Resolution](./C.%20Name%20Resolution.md) +- [Chained Attributes](./D.%20Chained%20Attributes.md) +- [Function Calls](./E.%20Function%20Calls.md) +- [Subscript Expression](./G.%20Subscript%20Expression.md) + +These are all processed using the [Type Analysis](./A.%20Type%20Analysis.md) to get the definition of the node. They are then converted into [Graph Edges](./H.%20Graph%20Edges.md) and added to the graph. + +## Next Step + +After understanding how we walk the tree, let's look at how we [resolve names](./C.%20Name%20Resolution.md) in the code. diff --git a/architecture/4. type-analysis/C. Name Resolution.md b/architecture/4. type-analysis/C. Name Resolution.md new file mode 100644 index 000000000..bd6516708 --- /dev/null +++ b/architecture/4. type-analysis/C. Name Resolution.md @@ -0,0 +1,70 @@ +# Name Resolution + +The name resolution system handles symbol references, scoping rules, and name binding across the codebase. + +## What's in a name? + +A name is a `Name` node. It is just a string of text. +For example, `foo` is a name. + +```python +from my_module import foo + +foo() +``` + +Tree sitter parses this into: + +``` +module [0, 0] - [2, 0] + import_from_statement [0, 0] - [0, 25] + module_name: dotted_name [0, 5] - [0, 14] + identifier [0, 5] - [0, 14] + name: dotted_name [0, 22] - [0, 25] + identifier [0, 22] - [0, 25] + expression_statement [1, 0] - [1, 5] + call [1, 0] - [1, 5] + function: identifier [1, 0] - [1, 3] + arguments: argument_list [1, 3] - [1, 5] +``` + +We can map the identifier nodes to `Name` nodes. +You'll see there are actually 3 name nodes here: `foo`, `my_module`, and `foo`. + +- `my_module` is the module name. +- `foo` is the name imported from the module. +- `foo` is the name of the function being called. + +## Name Resolution + +Name resolution is the process of resolving a name to its definition. To do this, all we need to do is + +1. Get the name we're looking for. (e.g. `foo`) +1. Find the scope we're looking in. (in this case, the global file scope) +1. Recursively search the scope for the name (which will return the node corresponding `from my_module import foo`). +1. Use the type engine to get the definition of the name (which will return the function definition). + +## Scoping + +```python +# Local vs global scope +from my_module import foo, bar, fuzz + + +def outer(): + def foo(): ... + + foo() + bar() + fuzz() + + def fuzz(): ... +``` + +If we wanted to resolve `foo` in this case, we would start at the name foo, then check it's parent recursively till we arrive at the function outer. We would then check for the name foo and find there is a nested function with that name. We would then return the function definition. +However, if we wanted to resolve `bar`, we would then check for the name bar and find there is no nested function, variable, or parameter with that name. We would then return the import statement. +Finally for fuzz, when we check for the name fuzz, we would find there is a nested function with that name, but it is defined after the call to `fuzz()`. We would then return the import. + +## Next Step + +These simple cases let us build up to more complex cases. [Chained Attributes](./D.%20Chained%20Attributes.md) covers how we handle method and property access chains. diff --git a/architecture/4. type-analysis/D. Chained Attributes.md b/architecture/4. type-analysis/D. Chained Attributes.md new file mode 100644 index 000000000..57a3b941c --- /dev/null +++ b/architecture/4. type-analysis/D. Chained Attributes.md @@ -0,0 +1,89 @@ +# Chained Attributes + +```python +class Foo: + def foo(self): ... + + +a = Foo() +a.foo() +``` + +A core functionality is to be able to calculate that `a.foo()` is a usage of `foo` in the `Foo` class. +To do this, we must first understand how tree-sitter parses the code. + +``` +module [0, 0] - [5, 0] + class_definition [0, 0] - [2, 11] + name: identifier [0, 6] - [0, 9] + body: block [1, 4] - [2, 11] + function_definition [1, 4] - [2, 11] + name: identifier [1, 8] - [1, 11] + parameters: parameters [1, 11] - [1, 17] + identifier [1, 12] - [1, 16] + body: block [2, 8] - [2, 11] + expression_statement [2, 8] - [2, 11] + ellipsis [2, 8] - [2, 11] + expression_statement [3, 0] - [3, 9] + assignment [3, 0] - [3, 9] + left: identifier [3, 0] - [3, 1] + right: call [3, 4] - [3, 9] + function: identifier [3, 4] - [3, 7] + arguments: argument_list [3, 7] - [3, 9] + expression_statement [4, 0] - [4, 7] + call [4, 0] - [4, 7] + function: attribute [4, 0] - [4, 5] + object: identifier [4, 0] - [4, 1] + attribute: identifier [4, 2] - [4, 5] + arguments: argument_list [4, 5] - [4, 7] +``` + +If we look at this parse tree - we can see that the `a.foo()` call has a name of type attribute. The object of the call is an identifier for `a`, and the `foo` is an attribute of the identifier for `a`. Typescript has a similar structure. These are the core building blocks of chained attributes. +Chained attributes contain 2 parts: + +1. The object: `a` +1. The attribute: `foo` + +All we must do to resolve the definition of `a.foo` is + +1. Find the definition of the object `a` (the class `Foo`) +1. Get the attribute (`foo`) on the resolved object (`Foo`) (the function `foo`) +1. Resolve the attribute to it's original definition (in this case, the function `foo`) + +## Step 1: Resolve the object + +We can resolve the object by calling resolved_types to get potential types of the object. +If it is a name (like `a`) we can use the name resolution to get the definition of the name. +If it is another chained attribute, we can recursively resolve the chained attribute. +If the original type is a union, we can operate on multiple types and return all the possible results. + +## Step 2: Get the attribute + +We can get the attribute by calling resolve_attribute on the resolved object. Nodes which implement this inherit from `HasAttribute`. Examples include: + +- Class +- File +- Type aliases +- Enums + +## Step 3: Resolve the attribute + +Finally, we can resolve the attribute by calling resolved_types on the attribute. This is useful in cases, particularly for attributes of the class like the following: + +```python +def fuzz(): ... + + +class Foo: + foo = fuzz + + +a = Foo() +a.foo() +``` + +We can resolve the attribute by calling resolved_types on the attribute to go from the attribute (foo) to the underlying resolved type (fuzz). + +## Next Step + +After handling chained attributes, the system moves on to [Function Calls](./E.%20Function%20Calls.md) analysis for handling function and method invocations. diff --git a/architecture/4. type-analysis/E. Function Calls.md b/architecture/4. type-analysis/E. Function Calls.md new file mode 100644 index 000000000..d4db8cd6b --- /dev/null +++ b/architecture/4. type-analysis/E. Function Calls.md @@ -0,0 +1,64 @@ +# Function Call + +At a first glance, function calls are simple. We can resolve the function call by looking up the function name in the current scope. + +However, there are some complexities to consider. + +## Constructors + +In Python, we can call a class definition as if it were a function. This is known as a constructor. + +```python +class Foo: + def __init__(self): ... + + +a = Foo() +``` + +This changes the behavior of the function call from the name. The name resolves to Foo (the class definition) but the constructor resolves to the function definition. + +## Imports + +```typescript +require('foo') +``` + +In this case, we need to resolve the import statement to the module definition. + +## Return Types + +```python +class Foo: + def foo(self) -> int: + return 1 + + +class Bar: + def bar(self) -> Foo: ... + + +a = Bar() +a.bar().foo() +``` + +In this case, we need to resolve the return type of the function to the type of the return value. However, the function definition is not the same as the return type. This means we now have 3 different things going on with function calls: + +1. Resolving the function definition +1. Resolving the return type +1. Computing what this function call depends on (both the function definition and the arguments passed to the function) + +## Generics + +```python +def foo[T](a: list[T]) -> T: ... + + +foo([1, 2, 3]) +``` + +Generics depend on the types of the arguments to the function. We need to resolve the types of the arguments to the function to determine the type of the generic. [Generics](./F.%20Generics.md) covers how we handle generics. + +## Next Step + +After understanding function calls, let's look at how we handle [Generics](./F.%20Generics.md) in the type system. diff --git a/architecture/4. type-analysis/F. Generics.md b/architecture/4. type-analysis/F. Generics.md new file mode 100644 index 000000000..46df52bfc --- /dev/null +++ b/architecture/4. type-analysis/F. Generics.md @@ -0,0 +1,7 @@ +# Generics Analysis + +TODO + +## Next Step + +After generics analysis, the system handles [Subscript Expressions](./G.%20Subscript%20Expression.md) for array and dictionary access. diff --git a/architecture/4. type-analysis/G. Subscript Expression.md b/architecture/4. type-analysis/G. Subscript Expression.md new file mode 100644 index 000000000..e2bb1a80a --- /dev/null +++ b/architecture/4. type-analysis/G. Subscript Expression.md @@ -0,0 +1,7 @@ +# Subscript Expression + +TODO + +## Next Step + +After handling subscript expressions, the system builds [Graph Edges](./H.%20Graph%20Edges.md) to represent relationships between types and symbols. diff --git a/architecture/4. type-analysis/H. Graph Edges.md b/architecture/4. type-analysis/H. Graph Edges.md new file mode 100644 index 000000000..46efd3c46 --- /dev/null +++ b/architecture/4. type-analysis/H. Graph Edges.md @@ -0,0 +1,59 @@ +# Graph Edges + +The SDK contains a graph of nodes and edges. +Nodes are the core of the graph and represent the symbols in the codebase. Examples include: + +- Symbols: Classes, functions, Assignments, etc. +- Imports, Exports +- Files +- Parameters, Attributes + Edges are between - each containes 4 elements: +- Source: The node that the edge is coming from +- Target: The node that the edge is going to +- Type: The type of the edge +- Metadata: Additional information about the edge + +## Edge Types + +We have 4 types of [edges](../src/codegen/sdk/enums.py#L10) + +- IMPORT_SYMBOL_RESOLUTION: An edge from an import to a symbol +- EXPORT: An edge from a symbol to an export +- SUBCLASS: An edge from a symbol to a subclass +- SYMBOL_USAGE: An edge from a symbol to a usage + +The only edges that are used in almost every API are SYMBOL_USAGE edges. They are also the only ones that have additional metadata. + +## Edge construction order + +To compute the graph we follow a specific order: + +1. Import edges are added first + - This is completely independent of the type engine +1. Symbol edges are added next + - these may export symbols that are imported from other files. + - This is almost entirely independent of the type engine +1. Subclass edges are added next + - these may reference symbols that are imported or exported from other files. + - This is fully dependent on the type engine +1. Usage edges are added last + - they reference symbols that are imported or exported from other files + - This is fully dependent on the type engine + - Subclass edges are computed beforehand as a performance optimization + +## Usages + +SYMBOL_USAGE edges contain additional [metadata](../src/codegen/sdk/core/dataclasses/usage.py) + +- match: The exact match of the usage +- usage_symbol: The symbol this object is used in. Derived from the match object +- usage_type: How this symbol was used. Derived from the resolution stack +- imported_by: The import that imported this symbol. Derived from the resolution stack +- kind: Where this symbol was used (IE: in a type parameter or in the body of the class, etc). Derived from the compute dependencies function + You may notice these edges are actually between the usage symbol and the match object but the match object is not on the graph. This way we have constructed triple edges. +- They are technically edges between the usage symbol and the symbol contained in the match object +- The edge metadata contains the match object + +## Next Step + +After constructing the type graph, the system moves on to [Edit Operations](../5.%20performing-edits/A.%20Edit%20Operations.md) where it can safely modify code while preserving type relationships. diff --git a/architecture/5. performing-edits/A. Edit Operations.md b/architecture/5. performing-edits/A. Edit Operations.md new file mode 100644 index 000000000..850b8e103 --- /dev/null +++ b/architecture/5. performing-edits/A. Edit Operations.md @@ -0,0 +1,7 @@ +# Edit Operations + +TODO + +## Next Step + +After preparing edits, they are managed by the [Transaction Manager](./B.%20Transaction%20Manager.md) to ensure consistency and atomicity. diff --git a/architecture/5. performing-edits/B. Transaction Manager.md b/architecture/5. performing-edits/B. Transaction Manager.md new file mode 100644 index 000000000..a41d91270 --- /dev/null +++ b/architecture/5. performing-edits/B. Transaction Manager.md @@ -0,0 +1,7 @@ +# Transaction Manager + +TODO + +## Next Step + +After managing transactions, the system handles [Incremental Computation](../6.%20incremental-computation/A.%20Overview.md) to efficiently update the codebase graph as changes occur. diff --git a/architecture/6. incremental-computation/A. Overview.md b/architecture/6. incremental-computation/A. Overview.md new file mode 100644 index 000000000..b3d013c23 --- /dev/null +++ b/architecture/6. incremental-computation/A. Overview.md @@ -0,0 +1,7 @@ +# Incremental Computation + +TODO + +## Next Step + +After understanding the overview of incremental computation, let's look at how we [detect changes](./B.%20Change%20Detection.md) in the codebase. diff --git a/architecture/6. incremental-computation/B. Change Detection.md b/architecture/6. incremental-computation/B. Change Detection.md new file mode 100644 index 000000000..f3416385e --- /dev/null +++ b/architecture/6. incremental-computation/B. Change Detection.md @@ -0,0 +1,7 @@ +# Change Detection + +TODO + +## Next Step + +After detecting changes, the system performs [Graph Recomputation](./C.%20Graph%20Recomputation.md) to update the dependency graph efficiently. diff --git a/architecture/6. incremental-computation/C. Graph Recomputation.md b/architecture/6. incremental-computation/C. Graph Recomputation.md new file mode 100644 index 000000000..72da61850 --- /dev/null +++ b/architecture/6. incremental-computation/C. Graph Recomputation.md @@ -0,0 +1,7 @@ +# Graph Recomputation + +TODO + +## Next Step + +After graph recomputation, the system is ready for the next set of operations. The cycle continues with [File Discovery](../plumbing/file-discovery.md) for any new changes. diff --git a/architecture/architecture.md b/architecture/architecture.md new file mode 100644 index 000000000..dd044e4dc --- /dev/null +++ b/architecture/architecture.md @@ -0,0 +1,113 @@ +# Architecture of the Codegen SDK + +This is a technical document explaining the architecture of the Codegen SDK. + +## Purpose of the SDK + +This SDK is designed to accomplish a large set of use cases in one tool: + +- Parsing large, enterprise-scale codebases +- Making syntax aware changes to code while respecting original formatting +- Being user-friendly and easy to use +- Able to quickly execute large scale refactorings against a codebase +- Supporting multiple languages with common abstractions +- Aware of both project structure (tsconfig.json, pyproject.toml, etc.) and language-specific structure (imports, etc.) +- Able to perform type resolution +- Responding to changes to the codebase and updating the graph + +### Performance + +A key problem is performance. We must be able to quickly respond to user requests on enterprise codebases (IE: renaming a symbol). However, we don't know what those requests are in advance and the scope of these requests can be quite massive (They may choose to iterate over a large number of symbols and their usages). To respond to these problems, we introduced codegen cloud. We split operations into two parts: + +- A "parse" step that builds up a graph of the codebase + - This can take a long time to complete, but it only needs to be done once + - This computes the entire graph of the codebase +- A "run" step that performs operations on the codebase + - This can be done quickly, but it needs to be done many times + - This uses the graph to perform operations on the codebase + +This allows us to perform operations on the codebase without having to parse it every time. + +## Existing Solutions + +To accomplish these goals, we can look at existing classes of solutions: + +### Language Server Architecture + +The immediate question is: why not use a language server? They have a lot of the same goals as codegen, but do not address many of our goals: + +- Language servers can handle many of these same use cases, but they are not as performant as we need. +- Generally, language servers compute their results lazily. This doesn't work for us because we need to perform a large number of operations on the codebase. +- While the LSP protocol is powerful, it is not designed to be scriptable the way codegen is. +- In Python, many of the language servers are an aglamation of many different tools and libraries. None are very good at refactoring or offer the comprehensive set of features that codegen does. + +Generally language servers parse codebases in response to user actions. This is not a good fit for us because we need to perform a large number of operations on the codebase without knowing which symbols are being changed or queried. + +### Compiler Architecture + +Many of the same goals can be accomplished with a compiler. C However, compilers are not as user-friendly as we need. + +- They do not generally offer easy-to-use apis +- They do not focus on refactoring code after parsing +- They generally don't handle graph-updates +- They aren't common or complete in python/typescript + +Generally compilers build up knowledge of the entire codebase in a single pass. This is a much better fit for our use case. + +## Architecture + +The codegen SDK combines aspects of both systems to accomplish our goals. +At a high level our architecture is: + +1. We discover files to parse + +## Processing Steps + +The SDK processes code through several distinct steps: + +1. \[File Discovery\](./1. plumbing/file-discovery.md) + + - Project structure analysis + - File system traversal + +1. \[Tree-sitter Parsing\](./2. parsing/A. Tree Sitter.md) + + - Initial syntax tree construction + - Language-specific parsing rules + - Error recovery + +1. \[AST Construction\](./2. parsing/B. AST Construction.md) + + - Abstract syntax tree building + - Node type assignment + - Syntax validation + +1. \[Import & Export Resolution\](./3. imports-exports/A. Imports.md) + + - Module dependency analysis + - \[Export Analysis\](./3. imports-exports/B. Exports.md) + - \[TSConfig Support\](./3. imports-exports/C. TSConfig.md) + - Path resolution + +1. \[Type Analysis\](./4. type-analysis/A. Type Analysis.md) + + - \[Type Analysis\](./4. type-analysis/A. Type Analysis.md) + - \[Tree Walking\](./4. type-analysis/B. Tree Walking.md) + - \[Name Resolution\](./4. type-analysis/C. Name Resolution.md) + - \[Chained Attributes\](./4. type-analysis/D. Chained Attributes.md) + - \[Function Calls\](./4. type-analysis/E. Function Calls.md) + - \[Generics\](./4. type-analysis/F. Generics.md) + - \[Subscript Expression\](./4. type-analysis/G. Subscript Expression.md) + - \[Graph Edges\](./4. type-analysis/H. Graph Edges.md) + +1. \[Performing Edits\](./5. performing-edits/A. Edit Operations.md) + + - \[Transaction Manager\](./5. performing-edits/B. Transaction Manager.md) + - Change validation + - Format preservation + +1. \[Incremental Computation\](./6. incremental-computation/A. Overview.md) + + - \[Detecting Changes\](./6. incremental-computation/B. Change Detection.md) + - \[Recomputing Graph\](./6. incremental-computation/C. Graph Recomputation.md) + - Cache invalidation diff --git a/architecture/external/dependency-manager.md b/architecture/external/dependency-manager.md new file mode 100644 index 000000000..071a10526 --- /dev/null +++ b/architecture/external/dependency-manager.md @@ -0,0 +1,7 @@ +# Dependency Manager + +TODO + +## Next Step + +The dependency manager works closely with the [Type Engine](./type-engine.md) to ensure type compatibility across dependencies. diff --git a/architecture/external/type-engine.md b/architecture/external/type-engine.md new file mode 100644 index 000000000..54313a82b --- /dev/null +++ b/architecture/external/type-engine.md @@ -0,0 +1,7 @@ +# Type Engine + +TODO + +## Next Step + +The type engine works in conjunction with the [Dependency Manager](./dependency-manager.md) to ensure type safety across project dependencies. diff --git a/docs/README.md b/docs/README.md index fe8f68c86..31ec67d2a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -3,6 +3,7 @@ ## Development From within the `docs/` subdirectory: + ```bash npm i -g mintlify mintlify dev --port 3333 diff --git a/src/codegen/cli/README.md b/src/codegen/cli/README.md index 287d79787..1f09650d8 100644 --- a/src/codegen/cli/README.md +++ b/src/codegen/cli/README.md @@ -1,7 +1,9 @@ # Codegen CLI + A codegen module that handles all `codegen` CLI commands. ### Dependencies + - [codegen.sdk](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/sdk) - [codegen.shared](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/shared) diff --git a/src/codegen/git/README.md b/src/codegen/git/README.md index 5dab61a69..c2e4d7e84 100644 --- a/src/codegen/git/README.md +++ b/src/codegen/git/README.md @@ -1,6 +1,8 @@ # Codegen Git + A codegen module to supports git operations on codebase. ### Dependencies + - [codegen.sdk](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/sdk) - [codegen.shared](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/shared) diff --git a/src/codegen/gsbuild/README.md b/src/codegen/gsbuild/README.md index 72ba434fc..f2a4c9a02 100644 --- a/src/codegen/gsbuild/README.md +++ b/src/codegen/gsbuild/README.md @@ -1,2 +1,3 @@ # Codegen GS Build + A codegen module that builds the codegen SDK. diff --git a/src/codegen/gscli/README.md b/src/codegen/gscli/README.md index ad8774b53..9bcf652c3 100644 --- a/src/codegen/gscli/README.md +++ b/src/codegen/gscli/README.md @@ -1,2 +1,3 @@ # Codegen GS CLI + This module to be moved out into `src/code_generation` diff --git a/src/codegen/runner/README.md b/src/codegen/runner/README.md index 7f9689fd4..facb14e63 100644 --- a/src/codegen/runner/README.md +++ b/src/codegen/runner/README.md @@ -1,7 +1,9 @@ # Codegen Runner + A module to run functions with managed state + lifecycle. ### Dependencies + - [codegen.sdk](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/sdk) - [codegen.git](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/git) - [codegen.shared](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/shared) diff --git a/src/codegen/sdk/README.md b/src/codegen/sdk/README.md index 83611b169..7aefbe289 100644 --- a/src/codegen/sdk/README.md +++ b/src/codegen/sdk/README.md @@ -1,6 +1,8 @@ # Codegen SDK + A codegen module that contains the core Codebase graph parsing and manipulation logic. ### Dependencies + - [codegen.git](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/git) - [codegen.shared](https://github.com/codegen-sh/codegen-sdk/tree/develop/src/codegen/shared) diff --git a/src/codegen/shared/README.md b/src/codegen/shared/README.md index 080bd8251..8c21453f1 100644 --- a/src/codegen/shared/README.md +++ b/src/codegen/shared/README.md @@ -1,6 +1,8 @@ # Codegen Shared + A codegen module to contain a miscellaneous set of shared utilities. ### Dependencies + This module should NOT contain any high level dependencies on other codegen modules. It should only depend on standard libraries and other shared utilities. diff --git a/src/codegen/shared/compilation/README.md b/src/codegen/shared/compilation/README.md index d6913e076..236fdb209 100644 --- a/src/codegen/shared/compilation/README.md +++ b/src/codegen/shared/compilation/README.md @@ -1,6 +1,7 @@ Utils around compiling a user's codeblock into a function. This includes: + - Raising on any dangerous operations in the codeblock - Catching and logging any compilation errors - Monkey patching built-ins like print diff --git a/tests/shared/README.md b/tests/shared/README.md index 171f05cf8..391194a9d 100644 --- a/tests/shared/README.md +++ b/tests/shared/README.md @@ -1,4 +1,5 @@ # Tests/shared ### What kind of things should go in here? + Testing utilities like mocks, fixtures, test data that are used across multiple tests.