Skip to content

Conversation

@jkbonfield
Copy link
Collaborator

  1. UUID4 improvements.

With this change on my local corpus:

htscodecs-corpus/names/01.names     169625
htscodecs-corpus/names/02.names     417607
htscodecs-corpus/names/03.names     49868
htscodecs-corpus/names/05.names     288460
htscodecs-corpus/names/08.names     42594
htscodecs-corpus/names/09.names     255867
htscodecs-corpus/names/10.names     307270
htscodecs-corpus/names/20.names     139707
htscodecs-corpus/names/_139.names   86966   *
htscodecs-corpus/names/_shuf.names  638699  *
htscodecs-corpus/names/nv.names     329535
htscodecs-corpus/names/nv2.names    87802
htscodecs-corpus/names/rr.names     91103
htscodecs-corpus/names/uuid4+.names 153556  *
htscodecs-corpus/names/uuid4.names  153439  *

Master:

htscodecs-corpus/names/01.names     169625
htscodecs-corpus/names/02.names     417607
htscodecs-corpus/names/03.names     49981
htscodecs-corpus/names/05.names     288460
htscodecs-corpus/names/08.names     42594
htscodecs-corpus/names/09.names     255867
htscodecs-corpus/names/10.names     307270
htscodecs-corpus/names/20.names     139707
htscodecs-corpus/names/_139.names   100798  *
htscodecs-corpus/names/_shuf.names  710366  *
htscodecs-corpus/names/nv.names     329535
htscodecs-corpus/names/nv2.names    87802
htscodecs-corpus/names/rr.names     91103
htscodecs-corpus/names/uuid4+.names 187560  *
htscodecs-corpus/names/uuid4.names  242453  *

For uuid4 see #132. The difference between uuid4 and uuid4+ is the latter has "-flowcell-10" appended to every line. The old code had explicit check for ONT UUIDs, but not bare ones without an extension, so it needed 37 chars and not 36. Hence master compresses the data with "-flowcell-10" better than without. Regardless of that however, both compress better still by using an array of CHAR tokens instead of separate string-char-string-char-... combinations (due to nuls).

  1. The other change can be seen in _139.names (a mix of 5k names from 01, 03 and 09.names) and _shuf (a mix of all of them). This still isn't optimal, but it's considerably closer.

We can demonstrate this easily with two names of foo: and bar: where is +1 in foo and +1-16 in bar.

perl -e '$b1=1;$b2=1;for ($i=0;$i<5000;$i++) {print "foo:$b1\n";print "bar:$b2\n";$b1++;$b2+=1+int(rand(16))}' > _

Theoretically foo should have no information per record other than the initial meta-data describing the format.
Bar however has 4 bits of ddelta offset per record, so has 2.5Kb of information to encode. Mixing them should not increase the size as we are strictly alternating and the diff distance is a fixed -2.

We can do cat _ | tokenise_name vs egrep foo _ | ... and egrep bar _ | ... to validate the mixed vs separate names.
Master gives 13300, 86 and 2603 respectively.
This PR gives 2684, 86 and 2603 repsectively.

The change achieves this by improving the returned node rather than -1, which defaults to the previous record. It was previously always delta bar against foo and foo against bar.

A candidate for replacement of samtools#132.

We still need to do more work here on optimising name compression in a
general manner, especially on mixed data sets, but this is resolves a
simple case while having no impact on any other name formats.
We already had code for spotting this in search_trie, so improved that
a little and use it in encode_name instead of having a second scan.

Also improve the compression of mixed data sets.  This still isn't
optimal as we'd need to start separating the name classes and adding
NOP tokens, but it's often a 10-20% compression improvement.
((d[f+0] >= '0' && d[f+0] <='9') || (d[f+0] >= 'a' && d[f+0] <= 'f')) &&
((d[f+35] >= '0' && d[f+35] <='9') || (d[f+35] >= 'a' && d[f+35] <= 'f'))) {
} else if (l >= 36
&& d[f+8]=='-' && d[f+13]=='-' && d[f+18]=='-' && d[f+23]=='-'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d[f+14] is always 4 in uuid4. See wikipedia. So this can also be added as a check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried some alternate UUID versions. This change seems to be beneficial even with UUIDs with a common prefix (like version 7), so it seems best to leave it accepting all UUID versions.

@daviesrob daviesrob self-assigned this Jul 3, 2025
@daviesrob daviesrob merged commit a532924 into samtools:master Jul 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants