diff --git a/architecture/2. parsing/B. AST Construction.md b/architecture/2. parsing/B. AST Construction.md index c6484aaba..06a1cd48c 100644 --- a/architecture/2. parsing/B. AST Construction.md +++ b/architecture/2. parsing/B. AST Construction.md @@ -74,4 +74,4 @@ Statements have another layer of complexity. They are essentially pattern based ## Next Step -After the AST is constructed, the system moves on to [Import Resolution](../3.%20imports-exports/A.%20Imports.md) to analyze module dependencies and resolve symbols across files. +After the AST is constructed, the system moves on to [Directory Parsing](./C.%20Directory%20Parsing.md) to build a hierarchical representation of the codebase's directory structure. diff --git a/architecture/2. parsing/C. Directory Parsing.md b/architecture/2. parsing/C. Directory Parsing.md new file mode 100644 index 000000000..f25de2e29 --- /dev/null +++ b/architecture/2. parsing/C. Directory Parsing.md @@ -0,0 +1,50 @@ +# Directory Parsing + +The Directory Parsing system is responsible for creating and maintaining a hierarchical representation of the codebase's directory structure in memory. Directories do not hold references to the file itself, but instead holds the names to the files and does a dynamic lookup when needed. + +In addition to providing a more cohesive API for listing directory files, the Directory API is also used for [TSConfig](../3.%20imports-exports/C.%20TSConfig.md)-based (Import Resolution)[../3.%20imports-exports/A.%20Imports.md]. + +## Core Components + +The Directory Tree is constructed during the initial build_graph step in codebase_context.py, and is recreated from scratch on every re-sync. More details are below: + +## Directory Tree Construction + +The directory tree is built through the following process: + +1. The `build_directory_tree` method in `CodebaseContext` is called during graph initialization or when the codebase structure changes. +1. The method iterates through all files in the repository, creating directory objects for each directory path encountered. +1. For each file, it adds the file to its parent directory using the `_add_file` method. +1. Directories are created recursively as needed using the `get_directory` method with create_on_missing=True\`. + +## Directory Representation + +The `Directory` class provides a rich interface for working with directories: + +- **Hierarchy Navigation**: Access parent directories and subdirectories +- **File Access**: Retrieve files by name or extension +- **Symbol Access**: Find symbols (classes, functions, etc.) within files in the directory +- **Directory Operations**: Rename, remove, or update directories + +Each `Directory` instance maintains: + +- A reference to its parent directory +- Lists of files and subdirectories +- Methods to recursively traverse the directory tree + +## File Representation + +Files are represented by the `File` class and its subclasses: + +- `File`: Base class for all files, supporting basic operations like reading and writing content +- `SourceFile`: Specialized class for source code files that can be parsed into an AST + +Files maintain references to: + +- Their parent directory +- Their content (loaded dynamically to preserve the source of truth) +- For source files, the parsed AST and symbols + +## Next Step + +After the directory structure is parsed, the system can perform [Import Resolution](../3.%20imports-exports/A.%20Imports.md) to analyze module dependencies and resolve symbols across files. diff --git a/architecture/3. imports-exports/A. Imports.md b/architecture/3. imports-exports/A. Imports.md index 09d70d902..cca5951ab 100644 --- a/architecture/3. imports-exports/A. Imports.md +++ b/architecture/3. imports-exports/A. Imports.md @@ -1,7 +1,60 @@ # Import Resolution -TODO +Import resolution follows AST construction in the code analysis pipeline. It identifies dependencies between modules and builds a graph of relationships across the codebase. + +> NOTE: This is an actively evolving part of Codegen SDK, so some details here may be imcomplete, outdated, or incorrect. + +## Purpose + +The import resolution system serves these purposes: + +1. **Dependency Tracking**: Maps relationships between files by resolving import statements. +1. **Symbol Resolution**: Connects imported symbols to their definitions. +1. **Module Graph Construction**: Builds a directed graph of module dependencies. +1. **(WIP) Cross-Language Support**: Provides implementations for different programming languages. + +## Core Components + +### ImportResolution Class + +The `ImportResolution` class represents the outcome of resolving an import statement. It contains: + +- The source file containing the imported symbol +- The specific symbol being imported (if applicable) +- Whether the import references an entire file/module + +### Import Base Class + +The `Import` class is the foundation for language-specific import implementations. It: + +- Stores metadata about the import (module path, symbol name, alias) +- Provides the abstract `resolve_import()` method +- Adds symbol resolution edges to the codebase graph + +### Language-Specific Implementations + +#### Python Import Resolution + +The `PyImport` class extends the base `Import` class with Python-specific logic: + +- Handles relative imports +- Supports module imports, named imports, and wildcard imports +- Resolves imports using configurable resolution paths and `sys.path` +- Handles special cases like `__init__.py` files + +#### TypeScript Import Resolution + +The `TSImport` class implements TypeScript-specific resolution: + +- Supports named imports, default imports, and namespace imports +- Handles type imports and dynamic imports +- Resolves imports using TSConfig path mappings +- Supports file extension resolution + +## Implementation + +After file and directory parse, we loop through all import nodes and perform `add_symbol_resolution_edge`. This then invokes the language-specific `resolve_import` method that converts the import statement into a resolvable `ImportResolution` object (or None if the import cannot be resolved). This import symbol and the `ImportResolution` object are then used to add a symbol resolution edge to the graph, where it can then be used in future steps to resolve symbols. ## Next Step -After import resolution, the system analyzes [Export Analysis](./B.%20Exports.md) and handles [TSConfig Support](./C.%20TSConfig.md) for TypeScript projects. This is followed by comprehensive [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md). +After import resolution, the system analyzes [Export Analysis](./B.%20Exports.md) and handles [TSConfig Support](./C.%20TSConfig.md) for TypeScript projects. This is followed by [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md). diff --git a/architecture/3. imports-exports/B. Exports.md b/architecture/3. imports-exports/B. Exports.md index 9da67fcb4..0e42c98c4 100644 --- a/architecture/3. imports-exports/B. Exports.md +++ b/architecture/3. imports-exports/B. Exports.md @@ -1,6 +1,74 @@ # Export Analysis -TODO +Some languages contain additional metadata on "exported" symbols, specifying which symbols are made available to other modules. Export analysis follows import resolution in the code analysis pipeline. It identifies and processes exported symbols from modules, enabling the system to track what each module makes available to others. + +## Core Components + +### Export Base Class + +The `Export` class serves as the foundation for language-specific export implementations. It: + +- Stores metadata about the export (symbol name, is default, etc.) +- Tracks the relationship between the export and its declared symbol +- Adds export edges to the codebase graph + +### TypeScript Export Implementation + +The `TSExport` class implements TypeScript-specific export handling: + +- Supports various export styles (named exports, default exports, re-exports) +- Handles export declarations with and without values +- Processes wildcard exports (`export * from 'module'`) +- Manages export statements with multiple exports + +#### Export Types and Symbol Resolution + +The TypeScript implementation handles several types of exports: + +1. **Declaration Exports** + + - Function declarations (including generators) + - Class declarations + - Interface declarations + - Type alias declarations + - Enum declarations + - Namespace declarations + - Variable/constant declarations + +1. **Value Exports** + + - Object literals with property exports + - Arrow functions and function expressions + - Classes and class expressions + - Assignment expressions + - Primitive values and expressions + +1. **Special Export Forms** + + - Wildcard exports (`export * from 'module'`) + - Named re-exports (`export { name as alias } from 'module'`) + - Default exports with various value types + +#### Symbol Tracking and Dependencies + +The export system: + +- Maintains relationships between exported symbols and their declarations +- Validates export names match their declared symbols +- Tracks dependencies through the codebase graph +- Handles complex scenarios like: + - Shorthand property exports in objects + - Nested function and class declarations + - Re-exports from other modules + +#### Integration with Type System + +Exports are tightly integrated with the type system: + +- Exported type declarations are properly tracked +- Symbol resolution considers both value and type exports +- Re-exports preserve type information +- Export edges in the codebase graph maintain type relationships ## Next Step diff --git a/architecture/3. imports-exports/C. TSConfig.md b/architecture/3. imports-exports/C. TSConfig.md index e9c77ae0c..b2362a7c8 100644 --- a/architecture/3. imports-exports/C. TSConfig.md +++ b/architecture/3. imports-exports/C. TSConfig.md @@ -1,6 +1,80 @@ # TSConfig Support -TODO +TSConfig support is a critical component for TypeScript projects in the import resolution system. It processes TypeScript configuration files (tsconfig.json) to correctly resolve module paths and dependencies. + +## Purpose + +The TSConfig support system serves these purposes: + +1. **Path Mapping**: Resolves custom module path aliases defined in the tsconfig.json file. +1. **Base URL Resolution**: Handles non-relative module imports using the baseUrl configuration. +1. **Project References**: Manages dependencies between TypeScript projects using the references field. +1. **Directory Structure**: Respects rootDir and outDir settings for maintaining proper directory structures. + +## Core Components + +### TSConfig Class + +The `TSConfig` class represents a parsed TypeScript configuration file. It: + +- Parses and stores the configuration settings from tsconfig.json +- Handles inheritance through the "extends" field +- Provides methods for translating between import paths and absolute file paths +- Caches computed values for performance optimization + +## Configuration Processing + +### Configuration Inheritance + +TSConfig files can extend other configuration files through the "extends" field: + +1. Base configurations are loaded and parsed first +1. Child configurations inherit and can override settings from their parent +1. Path mappings, base URLs, and other settings are merged appropriately + +### Path Mapping Resolution + +The system processes the "paths" field in tsconfig.json to create a mapping between import aliases and file paths: + +1. Path patterns are normalized (removing wildcards, trailing slashes) +1. Relative paths are converted to absolute paths +1. Mappings are stored for efficient lookup during import resolution + +### Project References + +The "references" field defines dependencies between TypeScript projects: + +1. Referenced projects are identified and loaded +1. Their configurations are analyzed to determine import paths +1. Import resolution can cross project boundaries using these references + +## Import Resolution Process + +### Path Translation + +When resolving an import path in TypeScript: + +1. Check if the path matches any path alias in the tsconfig.json +1. If a match is found, translate the path according to the mapping +1. Apply baseUrl resolution for non-relative imports +1. Handle project references for cross-project imports + +### Optimization Techniques + +The system employs several optimizations: + +1. Caching computed values to avoid redundant processing +1. Early path checking for common patterns (e.g., paths starting with "@" or "~") +1. Hierarchical resolution that respects the configuration inheritance chain + +## Integration with Import Resolution + +The TSConfig support integrates with the broader import resolution system: + +1. Each TypeScript file is associated with its nearest tsconfig.json +1. Import statements are processed using the file's associated configuration +1. Path mappings are applied during the module resolution process +1. Project references are considered when resolving imports across project boundaries ## Next Step diff --git a/architecture/5. performing-edits/A. Edit Operations.md b/architecture/5. performing-edits/A. Edit Operations.md deleted file mode 100644 index 850b8e103..000000000 --- a/architecture/5. performing-edits/A. Edit Operations.md +++ /dev/null @@ -1,7 +0,0 @@ -# Edit Operations - -TODO - -## Next Step - -After preparing edits, they are managed by the [Transaction Manager](./B.%20Transaction%20Manager.md) to ensure consistency and atomicity. diff --git a/architecture/5. performing-edits/A. Transactions.md b/architecture/5. performing-edits/A. Transactions.md new file mode 100644 index 000000000..c27c7e65f --- /dev/null +++ b/architecture/5. performing-edits/A. Transactions.md @@ -0,0 +1,54 @@ +# Transactions + +Transactions represent atomic changes to files in the codebase. Each transaction defines a specific modification that can be queued, validated, and executed. + +## Transaction Types + +The transaction system is built around a base `Transaction` class with specialized subclasses: + +### Content Transactions + +- **RemoveTransaction**: Removes content between specified byte positions +- **InsertTransaction**: Inserts new content at a specified byte position +- **EditTransaction**: Replaces content between specified byte positions + +### File Transactions + +- **FileAddTransaction**: Creates a new file +- **FileRenameTransaction**: Renames an existing file +- **FileRemoveTransaction**: Deletes a file + +## Transaction Priority + +Transactions are executed in a specific order defined by the `TransactionPriority` enum: + +1. **Remove** (highest priority) +1. **Edit** +1. **Insert** +1. **FileAdd** +1. **FileRename** +1. **FileRemove** + +This ordering ensures that content is removed before editing or inserting, and that all content operations happen before file operations. + +## Key Concepts + +### Byte-Level Operations + +All content transactions operate at the byte level rather than on lines or characters. This provides precise control over modifications and allows transactions to work with any file type, regardless of encoding or line ending conventions. + +### Content Generation + +Transactions support both static content (direct strings) and dynamic content (generated at execution time). This flexibility allows for complex transformations where the new content depends on the state of the codebase at execution time. + +Most content transactions use static content, but dynamic content is supported for rare cases where the new content depends on the state of other transactions. One common example is handling whitespace during add and remove transactions. + +### File Operations + +File transactions are used to create, rename, and delete files. + +> NOTE: It is important to note that most file transactions such as `FileAddTransaction` are no-ops (AKA skiping Transaction Manager) and instead applied immediately once the `create_file` API is called. This allows for created files to be immediately available for edit and use. The reason file operations are still added to Transaction Manager is to help with optimizing graph re-parse and diff generation. (Keeping track of which files exist and don't exist anymore). + +## Next Step + +After understanding the transaction system, they are managed by the [Transaction Manager](./B.%20Transaction%20Manager.md) to ensure consistency and atomicity. diff --git a/architecture/5. performing-edits/B. Transaction Manager.md b/architecture/5. performing-edits/B. Transaction Manager.md index a41d91270..4ed78a750 100644 --- a/architecture/5. performing-edits/B. Transaction Manager.md +++ b/architecture/5. performing-edits/B. Transaction Manager.md @@ -1,6 +1,92 @@ # Transaction Manager -TODO +The Transaction Manager coordinates the execution of transactions across multiple files, handling conflict resolution, and enforcing resource limits. + +## High-level Concept + +Since all node operations are on byte positions of the original file, multiple operations that change the total byte length of the file will result in offset errors and broken code. + +Give this example over here: + +``` +Original: FooBar +Operations: Remove "Foo" (bytes 0-3), Insert "Hello" (bytes 0-5) + Remove "Bar" (bytes 3-6), Insert "World" (bytes 3-7) +``` + +If these operations were applied in order, the result would be: + +``` +Result: FooBar +Operation: Remove "Foo" (bytes 0-3), Insert "Hello" (bytes 0-5) +Result: HelloBar +Operation: Remove "Bar" (bytes 3-6), Insert "World" (bytes 3-7) +Result: HelWorldar +``` + +Resulting in an invalid output. + +⭐ The key with TransactionManager is that it queues up all transactions in a given Codemod run, the applies all of the ***backwards*** from the last byte range to the first. Given the same example as above but applied backwards: + +``` +Result: FooBar +Operation: Remove "Bar" (bytes 3-6), Insert "World" (bytes 3-7) +Result: FooWorld +Operation: Remove "Foo" (bytes 0-3), Insert "Hello" (bytes 0-5) +Result: HelloWorld +``` + +TransactionManager also performs some additional operations such detecting conflicts and coordinating (some basic) conflict resolutions. Overall, the core responsibilities are as follows: + +1. **Transaction Queueing**: Maintains a queue of pending transactions organized by file +1. **Conflict Resolution**: Detects and resolves conflicts between transactions +1. **Transaction Execution**: Applies transactions in the correct order +1. **Resource Management**: Enforces limits on transaction count and execution time +1. **Change Tracking**: Generates diffs for applied changes + +## Sorting Transactions + +Before execution, transactions are sorted based on (in this priority): + +1. Position in the file (higher byte positions first) +1. Transaction type (following the priority order) +1. User-defined priority +1. Creation order + +This sorting ensures that transactions are applied in a deterministic order that minimizes conflicts. Larger byte ranges are always edited first, removals happen before insertions, and older transactions are applied before newer ones. + +## Conflict Resolution + +### Conflict Types + +The manager identifies several types of conflicts: + +1. **Overlapping Transactions**: Multiple transactions affecting the same byte range +1. **Contained Transactions**: One transaction completely contained within another +1. **Adjacent Transactions**: Transactions affecting adjacent byte ranges + +In it's current implementation, TransactionManager only handles Contained Transactions that are trivially sovable. (If a remove transaction completely overlaps with another remove transaction, only the larger one will be kept) + +## Resource Management + +The Transaction Manager enforces two types of limits: + +1. **Transaction Count**: Optional maximum number of transactions +1. **Execution Time**: Optional time limit for transaction processing + +These limits prevent excessive resource usage and allow for early termination of long-running operations. + +## Commit Process + +The commit process applies queued transactions to the codebase: + +1. Transactions are sorted according to priority rules +1. Files are processed one by one +1. For each file, transactions are executed in order +1. Diffs are collected for each modified file +1. The queue is cleared after successful commit + +The diff's are later used during resyc to efficiently update the codebase graph as changes occur. See [Incremental Computation](../6.%20incremental-computation/A.%20Overview.md) for more details. ## Next Step diff --git a/architecture/external/dependency-manager.md b/architecture/external/dependency-manager.md index 071a10526..ed8e42a3d 100644 --- a/architecture/external/dependency-manager.md +++ b/architecture/external/dependency-manager.md @@ -1,6 +1,99 @@ # Dependency Manager -TODO +> WARNING: Dependency manager is an experimental feature designed for Codegen Cloud! The current implementation WILL delete any existing `node_modules` folder! + +## Motivation + +A future goal of Codegen is to support resolving symbols directly from dependencies, instead of falling back to `ExternalModule`s. (In fact, some experimental Codegen features such as [Type Engine](./type-engine.md) already parse and use 3rd party dependencies from `node_modules`) + +This requires us to pull and install dependencies from a repository's `package.json`. However, simply installing dependencies from `package.json` is not enough, as many projects require internal dependencies that use custom NPM registries. Others require custom post-install scripts that may not run on our codemod environments. + +Dependency Manager is an experimental solution to this problem. It creates a shadow tree of `package.json` files that includes all core dependencies and settings from the repository's original `package.json` without any custom registries or potentially problematic settings. + +> NOTE: Currently, this is only implemented for TypeScript projects. + +## Implementation + +Given this example codebase structure: + +``` +repo/ +├── package.json +├── node_modules/ +├── src/ +│ ├── frontend/ +│ │ └── package.json +│ └── backend/ +│ └── package.json +└── tests/ + └── package.json +``` + +Dependency Manager first deletes any existing `node_modules` folder in the user's repository. After this step, Dependency Manager initializes itself to use the correct version of NPM, Yarn, or PNPM for the user's repository. + +Dependency Manager then creates a "shadow copy" of the repository's original `package.json` file. This shadow copy is used to later revert any changes made by Codegen before running codemods. With these steps, the codebase structure now looks like this: + +``` +repo/ +├── package.json +├── package.json.gs_internal.bak +├── src/ +│ ├── frontend/ +│ │ └── package.json +│ │ └── package.json.gs_internal.bak +│ └── backend/ +│ └── package.json +│ └── package.json.gs_internal.bak +└── tests/ + └── package.json + └── package.json.gs_internal.bak +``` + +Next, Dependency Manager iterates through all the `package.json` files and creates a "clean" version of each file. This "clean" version only includes a subset of information from the original, including: + +- Name +- Version +- Package Manager Details +- Workspaces + +Most importantly, this step iterates through `dependencies` and `devDependencies` of each `package.json` file and validates them against the npm registry. If a package is not found, it is added to a list of invalid dependencies and removed from the `package.json` file. + +After this step, the codebase structure now looks like this: + +``` +repo/ +├── package.json (modified) +├── package.json.gs_internal.bak +├── src/ +│ ├── frontend/ +│ │ └── package.json (modified) +│ │ └── package.json.gs_internal.bak +│ └── backend/ +│ └── package.json (modified) +│ └── package.json.gs_internal.bak +└── tests/ + └── package.json (modified) + └── package.json.gs_internal.bak +``` + +After the shadow and cleaning steps, Dependency Manager proceeds to install the user's dependencies through NPM, Yarn, or PNPM, depending on the detected installer type. Finally, Dependency Manager restores the original `package.json` files and removes the shadow copies. + +The final codebase structure looks like this: + +``` +repo/ +├── package.json +├── node_modules/ +├── src/ +│ ├── frontend/ +│ │ └── package.json +│ └── backend/ +│ └── package.json +└── tests/ + └── package.json +``` + +If all goes well, Dependency Manager will have successfully installed the user's dependencies and prepared the codebase for codemods. ## Next Step diff --git a/architecture/external/type-engine.md b/architecture/external/type-engine.md index 54313a82b..42b96f643 100644 --- a/architecture/external/type-engine.md +++ b/architecture/external/type-engine.md @@ -1,6 +1,24 @@ # Type Engine -TODO +Type Engine is an experimental feature of Codegen that leverages the [TypeScript Compiler API](https://github.com/microsoft/TypeScript/wiki/Using-the-Compiler-API) to provide deeper insight into a user's codebase (such as resolving return types). + +> NOTE: Currently, this is only implemented for TypeScript projects. + +There are currently two experimental implementations of TypeScript's Type Engine: an external process-based implementation and a V8-based implementation. + +## Implementation (External Process) + +During codebase parsing, the Type Engine spawns a type inference subprocess (defined in `src/codegen/sdk/typescript/external/typescript_analyzer/run_full.ts`) that concurrently parses the codebase with the TypeScript API to resolve return types. The final analyzer output is placed in `/tmp/typescript-analysis.json` and is read in by Codegen to resolve return types. + +## Implementation (V8) + +The V8-based implementation is much more flexible and powerful in comparison but is currently not as stable. It uses the [PyMiniRacer](https://github.com/sqreen/py_mini_racer) package to spawn a V8-based JavaScript engine that can parse the codebase with the TypeScript API to resolve return types. + +The entirety of `src/codegen/sdk/typescript/external/typescript_analyzer` is compiled down using [Rollup.js](https://rollupjs.org/) into a single `index.js` file. A couple of patches are applied to the engine source to remove `require` and `export` statements, which are not supported by MiniRacer. + +Then, the entire `index.js` file is loaded into the MiniRacer context. To work around file read limitations with V8, an in-memory shadow filesystem is created that mimics the user's repository's filesystem. These are defined in `fsi.ts` (`FileSystemInterface`) and `fs_proxy.ts` (`ProxyFileSystem`). The TypeScript Compiler then uses the custom `ProxyFileSystem.readFile` function instead of the traditional `fs.readFile`. + +Once the analyzer is initialized and the codebase is parsed, the entire TypeScript Compiler API is available in the MiniRacer context. The analyzer can then be used to resolve return types for any function in the codebase or to parse the codebase and generate a full type analysis. ## Next Step