UUID4 name tokeniser plus improvements to mixed data sets. #134

jkbonfield · 2025-05-13T15:09:55Z

UUID4 improvements.

With this change on my local corpus:

htscodecs-corpus/names/01.names     169625
htscodecs-corpus/names/02.names     417607
htscodecs-corpus/names/03.names     49868
htscodecs-corpus/names/05.names     288460
htscodecs-corpus/names/08.names     42594
htscodecs-corpus/names/09.names     255867
htscodecs-corpus/names/10.names     307270
htscodecs-corpus/names/20.names     139707
htscodecs-corpus/names/_139.names   86966   *
htscodecs-corpus/names/_shuf.names  638699  *
htscodecs-corpus/names/nv.names     329535
htscodecs-corpus/names/nv2.names    87802
htscodecs-corpus/names/rr.names     91103
htscodecs-corpus/names/uuid4+.names 153556  *
htscodecs-corpus/names/uuid4.names  153439  *

Master:

htscodecs-corpus/names/01.names     169625
htscodecs-corpus/names/02.names     417607
htscodecs-corpus/names/03.names     49981
htscodecs-corpus/names/05.names     288460
htscodecs-corpus/names/08.names     42594
htscodecs-corpus/names/09.names     255867
htscodecs-corpus/names/10.names     307270
htscodecs-corpus/names/20.names     139707
htscodecs-corpus/names/_139.names   100798  *
htscodecs-corpus/names/_shuf.names  710366  *
htscodecs-corpus/names/nv.names     329535
htscodecs-corpus/names/nv2.names    87802
htscodecs-corpus/names/rr.names     91103
htscodecs-corpus/names/uuid4+.names 187560  *
htscodecs-corpus/names/uuid4.names  242453  *

For uuid4 see #132. The difference between uuid4 and uuid4+ is the latter has "-flowcell-10" appended to every line. The old code had explicit check for ONT UUIDs, but not bare ones without an extension, so it needed 37 chars and not 36. Hence master compresses the data with "-flowcell-10" better than without. Regardless of that however, both compress better still by using an array of CHAR tokens instead of separate string-char-string-char-... combinations (due to nuls).

The other change can be seen in _139.names (a mix of 5k names from 01, 03 and 09.names) and _shuf (a mix of all of them). This still isn't optimal, but it's considerably closer.

We can demonstrate this easily with two names of foo: and bar: where is +1 in foo and +1-16 in bar.

perl -e '$b1=1;$b2=1;for ($i=0;$i<5000;$i++) {print "foo:$b1\n";print "bar:$b2\n";$b1++;$b2+=1+int(rand(16))}' > _

Theoretically foo should have no information per record other than the initial meta-data describing the format.
Bar however has 4 bits of ddelta offset per record, so has 2.5Kb of information to encode. Mixing them should not increase the size as we are strictly alternating and the diff distance is a fixed -2.

We can do cat _ | tokenise_name vs egrep foo _ | ... and egrep bar _ | ... to validate the mixed vs separate names.
Master gives 13300, 86 and 2603 respectively.
This PR gives 2684, 86 and 2603 repsectively.

The change achieves this by improving the returned node rather than -1, which defaults to the previous record. It was previously always delta bar against foo and foo against bar.

A candidate for replacement of samtools#132. We still need to do more work here on optimising name compression in a general manner, especially on mixed data sets, but this is resolves a simple case while having no impact on any other name formats.

We already had code for spotting this in search_trie, so improved that a little and use it in encode_name instead of having a second scan. Also improve the compression of mixed data sets. This still isn't optimal as we'd need to start separating the name classes and adding NOP tokens, but it's often a 10-20% compression improvement.

rhpvorderman · 2025-05-14T07:08:50Z

htscodecs/tokenise_name3.c

-               ((d[f+0] >= '0' && d[f+0] <='9') || (d[f+0] >= 'a' && d[f+0] <= 'f')) &&
-               ((d[f+35] >= '0' && d[f+35] <='9') || (d[f+35] >= 'a' && d[f+35] <= 'f'))) {
+    } else if (l >= 36
+               && d[f+8]=='-' && d[f+13]=='-' && d[f+18]=='-' && d[f+23]=='-'


d[f+14] is always 4 in uuid4. See wikipedia. So this can also be added as a check.

We tried some alternate UUID versions. This change seems to be beneficial even with UUIDs with a common prefix (like version 7), so it seems best to leave it accepting all UUID versions.

jkbonfield added 2 commits May 12, 2025 16:43

jkbonfield force-pushed the uuid4-name-tok branch from 991198f to 40b8c82 Compare May 13, 2025 15:10

jkbonfield mentioned this pull request May 13, 2025

Seperate out tokens based on punctuation. #132

Closed

rhpvorderman reviewed May 14, 2025

View reviewed changes

daviesrob self-assigned this Jul 3, 2025

daviesrob merged commit a532924 into samtools:master Jul 3, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UUID4 name tokeniser plus improvements to mixed data sets. #134

UUID4 name tokeniser plus improvements to mixed data sets. #134

Uh oh!

jkbonfield commented May 13, 2025

Uh oh!

rhpvorderman May 14, 2025

Uh oh!

daviesrob Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UUID4 name tokeniser plus improvements to mixed data sets. #134

UUID4 name tokeniser plus improvements to mixed data sets. #134

Uh oh!

Conversation

jkbonfield commented May 13, 2025

Uh oh!

rhpvorderman May 14, 2025

Choose a reason for hiding this comment

Uh oh!

daviesrob Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants