codegen-sh · bagel897 · Feb 5, 2025 · Feb 5, 2025 · Feb 5, 2025 · Feb 5, 2025
@@ -1,6 +1,46 @@
 # Incremental Computation
 
-TODO
+After we performed some changes to the codebase, we may need to recompute the codebase graph.
+This is not a trivial task, because we need to be able to recompute the codebase graph incrementally and efficiently.
+
+## Use Cases
+
+### 1. Repeated Moves
+
+```python
+# file1.py
+def foo():
+    return bar()
+
+
+def bar():
+    return 42
+```
+
+Let's move symbol `bar` to `file2.py`
+
+```python
+# file2.py
+def bar():
+    return 42
+```
+
+Then we move symbol `foo` to `file3.py`
+
+```python
+# file3.py
+from file2 import bar
+
+
+def foo():
+    return bar()
+```
+
+You'll notice we have added an import from file2, not file1. This means that before we can move foo to file3, we need to sync the graph to reflect the changes in file2.
+
+### 2. Branching
+
+If we want to checkout a different branch, we need to update the baseline state to the git commit of the new branch and recompute the codebase graph.
 
 ## Next Step
 

@@ -1,6 +1,57 @@
 # Change Detection
 
-TODO
+## Lifecycle of an operation on the codebase graph
+
+Changes will go through 4 states. By default, we do not apply changes to the codebase graph, only to the filesystem.
+
+### Pending transactions
+
+After calling an edit or other transaction method, the changes are stored in a pending transaction. Pending transactions will be committed as described in the previous chapter.
+
+### Pending syncs
+
+After a transaction is committed, the file is marked as a pending sync. This means the filesystem state has been updated, but the codebase graph has not been updated yet.
+
+### Applied syncs
+
+When we sync the graph, we apply all the pending syncs and clear them. The codebase graph is updated to reflect the changes. We track all the applied syncs in the codebase graph.
+
+### Saved/baseline state
+
+Finally, we can set the baseline state to a git commit. This is the state we target when we reset the codebase graph. When we checkout branches, we update the baseline state.
+
+## Change Detection
+
+When we sync or build the graph, first we build a list of all files in 3 categories:
+
+- Removed files
+- Added files
+- Files to repase
+
+For example, if we move a file, it will be in the added and removed files
+If we add a file, it will be in the added files even if we peformed edits on it later.
+
+## Codebase.commit logic
+
+We follow the following logic
+
+1. Commit all pending transactions
+1. Write all buffered files to the disk
+1. Store this to pending changes (usually we will skip the remaining steps if we commit without syncing the graph)
+1. Build list of removed, added and modified files from pending changes
+1. For removed files, we need to remove all the edges that point to the file.
+1. For added files, we need to add all the edges that point to the file.
+1. For modified files, we remove all the edges that point to the file and add all the edges that point to the new file. This is complicated since edges may pass through the modified file and need to be intelligently updated.
+1. Mark all pending changes as applied
+
+## Reset logic
+
+Reset is just the inverse of commit. We need to
+
+1. Cancel all pending transactions
+1. Restore file state to the state to the target git commit
+1. Clear all pending changes to the graph
+1. Reverse all applied syncs to the graph
 
 ## Next Step
 

@@ -1,6 +1,39 @@
 # Graph Recomputation
 
-TODO
+## Node Reparsing
+
+Some limitations we encounter are:
+
+- It is non-trivial to update tree sitter nodes, and the SDK has no method to do this.
+- Therefore, all existing nodes are invalidated and need to be recomputed every time filesystem state changes.
+
+Therefore, to recompute the graph, we must first have the filesystem state updated. Then we can remove all nodes in the modified files and create new nodes in the modified files.
+
+## Edge Recomputation
+
+- Nodes may either use (out edges) or be used by (in edges) other nodes.
+  - Recomputing the out-edges is straightforward, we just need to reparse the file and compute dependencies again.
+  - Recomputing the in-edges is more difficult.
+    - The basic algorithm of any incremental computation engine is to:
+      - Detect what changed
+      - Update that query with the new data
+      - If the output of the query changed, we need to update all the queries that depend on that query.
+
+### Detecting what changed
+
+A difficulty is that the nodes are completely freshed for updated files. Therefore, this by default will include all nodes in updated files.
+
+### Updating the query
+
+To do this, we:
+
+- Wipe the entire cache of the query engine
+- Remove all existing out edges of the node
+- Recompute dependencies of that node
+
+### Update what changed
+
+This part has not been fully implemented yet. Currently, we update all the nodes that are descendants of the changed node and all the nodes in the file.
 
 ## Next Step