UUID4 name tokeniser plus improvements to mixed data sets. #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With this change on my local corpus:
Master:
For uuid4 see #132. The difference between uuid4 and uuid4+ is the latter has "-flowcell-10" appended to every line. The old code had explicit check for ONT UUIDs, but not bare ones without an extension, so it needed 37 chars and not 36. Hence master compresses the data with "-flowcell-10" better than without. Regardless of that however, both compress better still by using an array of CHAR tokens instead of separate string-char-string-char-... combinations (due to nuls).
We can demonstrate this easily with two names of foo: and bar: where is +1 in foo and +1-16 in bar.
Theoretically foo should have no information per record other than the initial meta-data describing the format.
Bar however has 4 bits of ddelta offset per record, so has 2.5Kb of information to encode. Mixing them should not increase the size as we are strictly alternating and the diff distance is a fixed -2.
We can do
cat _ | tokenise_namevsegrep foo _ | ...andegrep bar _ | ...to validate the mixed vs separate names.Master gives 13300, 86 and 2603 respectively.
This PR gives 2684, 86 and 2603 repsectively.
The change achieves this by improving the returned node rather than -1, which defaults to the previous record. It was previously always delta bar against foo and foo against bar.