Skip to content

Why I think static tree representations tend to be less bug-prone than patches #72

@joliss

Description

@joliss

If we use in-memory data structures rather than physical directories to speed up communication between plugins, we have two options as to what exactly we exchange: (1) A list of changes, or (2) a static view of the files like entries. The current usage of FSTree seems to be favoring option (1). I've alluded in the past that I'd prefer (2) for usability reasons, but I've never properly explained this. So let me do this now.

There's no action items to do in response to this GitHub issue, but if you're working on FSTree, and particularly on getting FSTree into the official node API, I'd simply like you to read this please!

Here's the specifics of each option:

  1. A patch, i.e. a list of changes, looking like [['change', 'foo/a.js'], ['create', 'bar/b.js'], ['unlink', 'bar/c.js']]

  2. A static representation of the directory tree, similar to the current entries structure, but Merkle-tree-like, e.g.

    // First argument (200) is maximum last-modified time of all
    // subdirectories and files recursively, taking the role of the hash in a
    // proper Merkle tree. Current time is 200.
    Directory(200, {
      'foo': Directory(200, {
        'a.js': File(200)
      }),
      'bar': Directory(200, {
        'b.js': File(200),
        /* no c.js here */
        'd.js': File(100)
      })
    })

Now clearly these options are technically isomorphic: We can reconstruct an in-memory Merkle tree of files by applying the patch to it on each rebuild, and we can conversely reconstruct the patch by diffing the current tree against the tree we got on the previous rebuild. Performance of this diffing operation is, by the way, the reason why I'm suggesting a Merkle-tree-like structure: With a flat entry list, diffing is O(totalFiles), whereas with a Merkle tree, it is O(changedFiles) (except for pathological cases), similar to how git diff commit1..commit2 is very fast even on large trees.

However, despite this isomorphism, what I want to argue here is that there is a major usability difference between the two, in that option (1) encourages a class of subtle caching bugs. To make plugin authors "fall into a pit of success" as Stef likes to say, I'd like to steer them towards working against the static representation by default, and only use diffing when it's really needed for performance.

Here's an example of the type of bug that can arise if you try to process a list of changes.

Let's say we're writing a plugin that transforms any file foo.js with an optional neighboring foo.js.map file into some output file(s).

If you were to write down an algorithm for this with caching, it would probably look like this (pseudo-code):

function build(changes) {
  changes.on(['create', 'change'], function(entry) {
    if (!/\.js$/.test(entry.relativePath)) continue;

    var outputFileName = ...;
    // compile will automatically pick up .map files
    writeFileSync(outputFileName, compile(entry.relativePath));
  }

  // Must not forget to handle deletion
  changes.on(['unlink'], function(entry) {
    if (!/\.js$/.test(entry.relativePath)) continue;

    var outputFileName = ...;
    unlinkSync(outputFileName):
  })
}

Can you spot the bug in the caching logic?

The bug is this: When a .js.map file is created, changed or unlinked, but the corresponding .js file is not changed, the .js file still needs to be recompiled. The code above doesn't do this. When we fix this bug, we also need to take care to deduplicate change events when both the .js and the .js.map file change.

So the "straightforward" code above contains a subtle bug, and the correct implementation is considerably more involved, even for a simple case like this where you have a single static dependency, rather than a set of dynamic dependencies due to import mechanisms. To make matters worse, the bug is rarely triggered in practice and so will go unnoticed. As a result, we end up with rare cases of broken caching, eroding trust in the rebuild system and causing people to restart Broccoli whenever they switch branches "just in case".

This class of bugs isn't hypothetical at all. I sat down for a chat with Igor Minar a while back to talk about in-memory representations, and he told me about a TypeScript plugin that picked up .ts files plus optional neighboring .js files. (The .js files allow for manually defining JavaScript bridges for TypeScript code when the automatic bridge isn't good enough.) He described the plugin to me as using patches, and I said, "can you pull up the code? I want to show to you a problem that happens with patches." He pulled up the code, and within a minute of looking at it, I spotted exactly this issue. The bug was there, and I had been able to predict that it was there without knowing the code. This was in an (otherwise) well-written and clean plugin implementation, written by an experienced JavaScript developer.

Let me contrast the previous, buggy implementation with another straightforward implementation, but this time we use a static representation of the filesystem rather than a patch:

function build(inputFileTree) {
  // oldCache/newCache is a cache-purging pattern: newCache will only contain
  // those cache entries that were needed for this rebuild
  var oldCache = this.cache || {};
  var newCache = {};

  var outputDirectoryTree = new DirectoryTree;
  for (var jsFile in inputDirectoryTree.walkSync('**/*.js')) {
    // hashFiles hashes [fileName, mtime] for files that exist, or [fileName,
    // null] for files that don't exist.
    var cacheKey = inputDirectoryTree.hashFiles([jsFile, jsFile + '.map']);

    // Get from cache, or compile if changed
    var outputFiles = oldCache.cacheKey ||
      compile(jsFile);
    outputFileTree.add(outputFiles);
    newCache.cacheKey = outputFiles;
  }

  this.cache = newCache;

  return outputDirectoryTree;
}

This implementation is quite straightforward, and it's correct! We built an ad-hoc cache in only a few lines of code on top of the file tree primitive.

Now I should mention that the disadvantage of the second implementation is that we're doing an O(n) in-memory iteration of all the input files, costing us a millisecond or so. There are several ways to get around this, and some of them involve using patches, but in those cases, we can move the (bug-prone) patch-processing code into a carefully-reviewed shared library. That library might call inputFileTree.diff(this.previousInputFileTree) and process the resulting patch. The main point I'm getting at here though is that in the static-representation case, a reasonably-fast naive implementation that doesn't use a supporting library will tend to come out correct.

Secondly, I should mention that in the static representation case, to get patches, like inputFileTree.diff(this.previousInputFileTree), we need to make sure that the producer doesn't mutate the previousInputFileTree when it rebuilds, and we need to make sure that we don't try to physically access any of the files the previousInputFileTree points to, because those files might have gone away in the meantime.

On balance, I still believe that steering users towards static representations, rather than patches, will result in more reliable and easier-to-maintain code.

Thanks for reading! :-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions