Why I think static tree representations tend to be less bug-prone than patches

If we use in-memory data structures rather than physical directories to speed up communication between plugins, we have two options as to what exactly we exchange: (1) A list of changes, or (2) a static view of the files like `entries`. The current usage of FSTree seems to be favoring option (1). I've alluded in the past that I'd prefer (2) for usability reasons, but I've never properly explained this. So let me do this now.

There's no action items to do in response to this GitHub issue, but if you're working on FSTree, and particularly on getting FSTree into the official node API, I'd simply like you to read this please!

Here's the specifics of each option:

1. A patch, i.e. a list of changes, looking like [['change', 'foo/a.js'], ['create', 'bar/b.js'], ['unlink', 'bar/c.js']]

2. A static representation of the directory tree, similar to the current `entries` structure, but [Merkle-tree](https://en.wikipedia.org/wiki/Merkle_tree)-like, e.g.

    ```js
    // First argument (200) is maximum last-modified time of all
    // subdirectories and files recursively, taking the role of the hash in a
    // proper Merkle tree. Current time is 200.
    Directory(200, {
      'foo': Directory(200, {
        'a.js': File(200)
      }),
      'bar': Directory(200, {
        'b.js': File(200),
        /* no c.js here */
        'd.js': File(100)
      })
    })
    ```

Now clearly these options are technically isomorphic: We can reconstruct an in-memory Merkle tree of files by applying the patch to it on each rebuild, and we can conversely reconstruct the patch by diffing the current tree against the tree we got on the previous rebuild. Performance of this diffing operation is, by the way, the reason why I'm suggesting a Merkle-tree-like structure: With a flat entry list, diffing is O(totalFiles), whereas with a Merkle tree, it is O(changedFiles) (except for pathological cases), similar to how `git diff commit1..commit2` is very fast even on large trees.

However, despite this isomorphism, what I want to argue here is that there is a major usability difference between the two, in that option (1) encourages a class of subtle caching bugs. To make plugin authors "fall into a pit of success" as Stef likes to say, I'd like to steer them towards working against the static representation by default, and only use diffing when it's really needed for performance.

Here's an example of the type of bug that can arise if you try to process a list of changes.

Let's say we're writing a plugin that transforms any file `foo.js` with an optional neighboring `foo.js.map` file into some output file(s).

If you were to write down an algorithm for this with caching, it would probably look like this (pseudo-code):

```js
function build(changes) {
  changes.on(['create', 'change'], function(entry) {
    if (!/\.js$/.test(entry.relativePath)) continue;

    var outputFileName = ...;
    // compile will automatically pick up .map files
    writeFileSync(outputFileName, compile(entry.relativePath));
  }

  // Must not forget to handle deletion
  changes.on(['unlink'], function(entry) {
    if (!/\.js$/.test(entry.relativePath)) continue;

    var outputFileName = ...;
    unlinkSync(outputFileName):
  })
}
```

Can you spot the bug in the caching logic?

The bug is this: When a `.js.map` file is created, changed or unlinked, but the corresponding `.js` file is not changed, the `.js` file still needs to be recompiled. The code above doesn't do this. When we fix this bug, we also need to take care to deduplicate change events when both the `.js` and the `.js.map` file change.

So the "straightforward" code above contains a subtle bug, and the correct implementation is considerably more involved, even for a simple case like this where you have a single static dependency, rather than a set of dynamic dependencies due to `import` mechanisms. To make matters worse, the bug is rarely triggered in practice and so will go unnoticed. As a result, we end up with rare cases of broken caching, eroding trust in the rebuild system and causing people to restart Broccoli whenever they switch branches "just in case".

This class of bugs isn't hypothetical at all. I sat down for a chat with Igor Minar a while back to talk about in-memory representations, and he told me about a TypeScript plugin that picked up `.ts` files plus optional neighboring `.js` files. (The `.js` files allow for manually defining JavaScript bridges for TypeScript code when the automatic bridge isn't good enough.) He described the plugin to me as using patches, and I said, "can you pull up the code? I want to show to you a problem that happens with patches." He pulled up the code, and within a minute of looking at it, I spotted exactly this issue. The bug was there, and I had been able to *predict* that it was there without knowing the code. This was in an (otherwise) well-written and clean plugin implementation, written by an experienced JavaScript developer.

Let me contrast the previous, buggy implementation with another straightforward implementation, but this time we use a static representation of the filesystem rather than a patch:

```js
function build(inputFileTree) {
  // oldCache/newCache is a cache-purging pattern: newCache will only contain
  // those cache entries that were needed for this rebuild
  var oldCache = this.cache || {};
  var newCache = {};

  var outputDirectoryTree = new DirectoryTree;
  for (var jsFile in inputDirectoryTree.walkSync('**/*.js')) {
    // hashFiles hashes [fileName, mtime] for files that exist, or [fileName,
    // null] for files that don't exist.
    var cacheKey = inputDirectoryTree.hashFiles([jsFile, jsFile + '.map']);

    // Get from cache, or compile if changed
    var outputFiles = oldCache.cacheKey ||
      compile(jsFile);
    outputFileTree.add(outputFiles);
    newCache.cacheKey = outputFiles;
  }

  this.cache = newCache;

  return outputDirectoryTree;
}
```

This implementation is quite straightforward, and it's correct! We built an ad-hoc cache in only a few lines of code on top of the file tree primitive.

Now I should mention that the disadvantage of the second implementation is that we're doing an O(n) in-memory iteration of all the input files, costing us a millisecond or so. There are several ways to get around this, and some of them involve using patches, but in those cases, we can move the (bug-prone) patch-processing code into a carefully-reviewed shared library. That library might call `inputFileTree.diff(this.previousInputFileTree)` and process the resulting patch. The main point I'm getting at here though is that in the static-representation case, a reasonably-fast naive implementation that *doesn't* use a supporting library will tend to come out correct.

Secondly, I should mention that in the static representation case, to get patches, like `inputFileTree.diff(this.previousInputFileTree)`, we need to make sure that the producer doesn't mutate the `previousInputFileTree` when it rebuilds, and we need to make sure that we don't try to physically access any of the files the `previousInputFileTree` points to, because those files might have gone away in the meantime.

On balance, I still believe that steering users towards static representations, rather than patches, will result in more reliable and easier-to-maintain code.

Thanks for reading! :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why I think static tree representations tend to be less bug-prone than patches #72

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why I think static tree representations tend to be less bug-prone than patches #72

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions