Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 41 additions & 1 deletion architecture/6. incremental-computation/A. Overview.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,46 @@
# Incremental Computation

TODO
After we performed some changes to the codebase, we may need to recompute the codebase graph.
This is not a trivial task, because we need to be able to recompute the codebase graph incrementally and efficiently.

## Use Cases

### 1. Repeated Moves

```python
# file1.py
def foo():
return bar()


def bar():
return 42
```

Let's move symbol `bar` to `file2.py`

```python
# file2.py
def bar():
return 42
```

Then we move symbol `foo` to `file3.py`

```python
# file3.py
from file2 import bar


def foo():
return bar()
```

You'll notice we have added an import from file2, not file1. This means that before we can move foo to file3, we need to sync the graph to reflect the changes in file2.

### 2. Branching

If we want to checkout a different branch, we need to update the baseline state to the git commit of the new branch and recompute the codebase graph.

## Next Step

Expand Down
53 changes: 52 additions & 1 deletion architecture/6. incremental-computation/B. Change Detection.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,57 @@
# Change Detection

TODO
## Lifecycle of an operation on the codebase graph

Changes will go through 4 states. By default, we do not apply changes to the codebase graph, only to the filesystem.

### Pending transactions

After calling an edit or other transaction method, the changes are stored in a pending transaction. Pending transactions will be committed as described in the previous chapter.

### Pending syncs

After a transaction is committed, the file is marked as a pending sync. This means the filesystem state has been updated, but the codebase graph has not been updated yet.

### Applied syncs

When we sync the graph, we apply all the pending syncs and clear them. The codebase graph is updated to reflect the changes. We track all the applied syncs in the codebase graph.

### Saved/baseline state

Finally, we can set the baseline state to a git commit. This is the state we target when we reset the codebase graph. When we checkout branches, we update the baseline state.

## Change Detection

When we sync or build the graph, first we build a list of all files in 3 categories:

- Removed files
- Added files
- Files to repase

For example, if we move a file, it will be in the added and removed files
If we add a file, it will be in the added files even if we peformed edits on it later.

## Codebase.commit logic

We follow the following logic

1. Commit all pending transactions
1. Write all buffered files to the disk
1. Store this to pending changes (usually we will skip the remaining steps if we commit without syncing the graph)
1. Build list of removed, added and modified files from pending changes
1. For removed files, we need to remove all the edges that point to the file.
1. For added files, we need to add all the edges that point to the file.
1. For modified files, we remove all the edges that point to the file and add all the edges that point to the new file. This is complicated since edges may pass through the modified file and need to be intelligently updated.
1. Mark all pending changes as applied

## Reset logic

Reset is just the inverse of commit. We need to

1. Cancel all pending transactions
1. Restore file state to the state to the target git commit
1. Clear all pending changes to the graph
1. Reverse all applied syncs to the graph

## Next Step

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,39 @@
# Graph Recomputation

TODO
## Node Reparsing

Some limitations we encounter are:

- It is non-trivial to update tree sitter nodes, and the SDK has no method to do this.
- Therefore, all existing nodes are invalidated and need to be recomputed every time filesystem state changes.

Therefore, to recompute the graph, we must first have the filesystem state updated. Then we can remove all nodes in the modified files and create new nodes in the modified files.

## Edge Recomputation

- Nodes may either use (out edges) or be used by (in edges) other nodes.
- Recomputing the out-edges is straightforward, we just need to reparse the file and compute dependencies again.
- Recomputing the in-edges is more difficult.
- The basic algorithm of any incremental computation engine is to:
- Detect what changed
- Update that query with the new data
- If the output of the query changed, we need to update all the queries that depend on that query.

### Detecting what changed

A difficulty is that the nodes are completely freshed for updated files. Therefore, this by default will include all nodes in updated files.

### Updating the query

To do this, we:

- Wipe the entire cache of the query engine
- Remove all existing out edges of the node
- Recompute dependencies of that node

### Update what changed

This part has not been fully implemented yet. Currently, we update all the nodes that are descendants of the changed node and all the nodes in the file.

## Next Step

Expand Down