Convergence and Divergence #42

jsgf · 2023-02-23T20:06:04Z

jsgf
Feb 23, 2023
Maintainer

"Convergence" and "divergence" are words I've been using to myself to express a property of hash-based indexing such as a merkle graph. I'm wondering if these terms resonate with others, or if not, some other way of expressing this idea.

Basically:

divergence - different things get different hashes, but only for differences you care about
convergence - conversely, exclude differences you don't care about, so the resulting hash is the same even if, say, the inputs are different at the byte level

For example, if you have a IDL file (eg protobuf) which is fed into a code generator which produces sources files implementing the serialization and deserialization. If you make, say, whitespace changes to the IDL input, it will have no effect on the generated output. In effect the output artifacts end up "converging" with all the other dependency graphs which include the same artifact. For something like a caching build system, it means it can skip actually compiling those intermediate generated source files, because it can reuse the results of a previously cached build system.

But if you're really concerned about how exactly those source files were generated, the convergence introduces an ambiguity - there are multiple potential build steps which could have generated that same source file. If you embed the input manifest in the generated output sources, then you avoid that by making every generated artifact distinct. But that has the downside of making the build system do redundant work.

The key point here is that what counts as convergence or divergence is very dependent on what you consider to be "interesting differences", which in turn is very use-case dependent.

The current Omnibor design is very much oriented towards "maximal divergence" - any change at the bit level of any artifact is encoded and propagated to every downstream derived artifact. This is conservative in that every change will be caught so there's maximal information conveyed - and specifically, if you see that two graphs share the same nodes then you're very sure they're exactly the same. But it also means that if there's any difference you need to then start digging into the actual artifacts to see if they're differences you care about.

On the other hand, if you can arrange for the graph to encode the right level of convergence for your use-case, then you can impement it much more efficiently. If you can be sure that "same id = no change for my use-case" then you can implement it just by looking at the graph itself.

edwarnicke · 2023-02-27T17:03:44Z

edwarnicke
Feb 27, 2023
Maintainer

The key point here is that what counts as convergence or divergence is very dependent on what you consider to be "interesting differences", which in turn is very use-case dependent.

Very much so. Basically... understanding the difference between "interesting" and "uninteresting" requires a very high degree of both semantic and contextual information in the generic.

0 replies

edwarnicke · 2023-02-27T17:51:15Z

edwarnicke
Feb 27, 2023
Maintainer

The current Omnibor design is very much oriented towards "maximal divergence" - any change at the bit level of any artifact is encoded and propagated to every downstream derived artifact.

I agree, but tend to think about this from the opposite side in terms of 'necessary understanding'. Noting byte level divergence requires little to no 'understanding'. 'less divergent' (in your parlance) approaches require much more understanding, and are thus much easier to get wrong.

Your 'protobuf generates to go code' is a really good example. But consider Go Stringer as another example. It generates code that carries out various 'niceties' for enum like things in Go. If I change the name of a const, is that 'interesting' or not? You should probably make a case for 'not'. But it requires fairly deep understanding of what is going on to make that determination one way or the other.

4 replies

jsgf Feb 28, 2023
Maintainer Author

Yeah, I think the current approach of "hash entire artifact bytewise" is the right choice for the default (and probably only) way for omnibor to define gidoid.

One could propose all kinds of specialized hash functions which take a file as input and skip over anything that's irrelevant. You could do it, but it could be very error-prone. And if you're doing that, why restrict yourself to single files? What if you wanted to include the content of multiple files?

So I think the alternative is to think carefully about what an "artifact" is. For example, one could have a build step which takes an artifact in, strips out all the irrelevant details for a specific use-case, and emits a new artifact. That new artifact - assuming it doesn't have an input manifest embedded in it - should be bitwise identical the other functionally identical artifacts and therefore converge once identified with a gidoid.

Or if you wanted to include the content of multiple files, then you could take them as inputs and concatenate/filter them into a single output artifact.

So for your example, you could strip the .go file down to a bare AST artifact, and if you want, go further and normalize all the identifiers to integers or something so that name changes are ignored.

Of course there are limitations to this approach. If you wanted to consider, for example, adding a new enum as "backwards compatible", so the file with a new enum is functionally identical to ones without so far as API compatibility is concerned, you would need to generate N different derived artifacts for each additional enum (or maybe N! if you're treating them as a set). But that speaks to the limitation of compressing files down to a single hash - you could do more efficient handling with access to the actual AST, but at the cost of having to have a specialized tool to do the analysis.

alilleybrinker Mar 6, 2023
Maintainer

I think the characterization of convergence and divergence is important, and probably something that ought to be explained in some documentation for OmniBOR (maybe the spec, maybe something else).

Really what we'd want to do is:

Define "convergence" and "divergence," as you've done here.
Explain why OmniBOR goes the route of maximal divergence, which basically comes down to the fact that it's the most general option, and does not require any domain knowledge on the part of the OmniBOR spec.
Suggest what to do for folks who then want to layer on some convergence-oriented thinking above OmniBOR.

So basically, OmniBOR sits as a base layer which gives you a small amount of signal that something changed, and allows you to identify, through the Artifact Dependency Graph, what has changed, but it does not tell you 1) how that thing changed, or 2) if that change was important.

I actually think, explained in this way, our "maximal divergence" position becomes a great win, because it leaves open a wide design space for projects to start with OmniBOR checking for that baseline signal of whether anything changed, and can then use the ADG info to evaluate any changes to determine in a convergence fashion if the change is worth caring about.

jsgf Mar 8, 2023
Maintainer Author

OmniBOR goes the route of maximal divergence

I think OmniBOR goes a bit beyond maximal divergence. By embedded the input manifest hashes in the outputs it means that it excludes any other artifact dependency graph from existing. Using input tagging you could re-derive the input manifests themselves to exclude irrelevant information, but the artifacts themselves will be indelibly stamped with the original graph.

So the big up-front question is not what you decide to include in the input manifests, but whether you embed the manifest ids in the output artifacts, or keep them separate and have the manifests also reference the outputs.

This also implies a model where the manifests are themselves first-class objects which can be the inputs and outputs of "actions".

RobMarion Mar 8, 2023

Are convergent/divergent canonical terms when applied to hashing? I understand how they are being used in this context but I'm wondering if there is a better way to put it. I can see a case for making some changes in a build non-relevant as part of a build. Isn't the point to say that component x is in build y?

OmniBOR

Convergence and Divergence #42

Uh oh!

jsgf Feb 23, 2023 Maintainer

Replies: 2 comments · 4 replies

Uh oh!

edwarnicke Feb 27, 2023 Maintainer

Uh oh!

edwarnicke Feb 27, 2023 Maintainer

Uh oh!

Uh oh!

jsgf Feb 28, 2023 Maintainer Author

Uh oh!

alilleybrinker Mar 6, 2023 Maintainer

Uh oh!

Uh oh!

jsgf Mar 8, 2023 Maintainer Author

Uh oh!

Uh oh!

RobMarion Mar 8, 2023

jsgf
Feb 23, 2023
Maintainer

Replies: 2 comments 4 replies

edwarnicke
Feb 27, 2023
Maintainer

edwarnicke
Feb 27, 2023
Maintainer

jsgf Feb 28, 2023
Maintainer Author

alilleybrinker Mar 6, 2023
Maintainer

jsgf Mar 8, 2023
Maintainer Author