Seperate out tokens based on punctuation. #132

rhpvorderman · 2025-05-07T12:30:53Z

See also #131

A very simple change. Tokens are not determined by a sequence of similar characters but separated out by punctuation chars. The stretches in between are then classified as numbers or strings.

This change simplifies the code. It also allows for adding hexadecimal tokens at a later point in time. (If desired).

Tested on the following files.
10000_illumina_ids.txt
10000_illumina_ids_simpler.txt
10000_pacbio_revio_ids.txt
10000uuid4.txt

file	before	after
10000_illumina_ids.txt	38846	38846
10000_illumina_ids_simpler.txt	23080	23080
10000uuid4.txt	242453	176373
10000_pacbio_revio_ids.txt	35484	35223

rhpvorderman · 2025-05-07T12:39:24Z

I also checked gzip,bzip2 and xz on 10000uuid4.txt and this PR performs better. Which is quite nice. (Not that it matters much for ONT data, but still).

jkbonfield · 2025-05-12T09:14:51Z

Testing on the htscodecs-corpus data I see this is a significant regression with some data sets

# master
$ for i in htscodecs-corpus/names/*.names;do echo -n "$i     ";./tokenise_name3 < $i|wc -c;done
htscodecs-corpus/names/01.names	169625
htscodecs-corpus/names/02.names	417607
htscodecs-corpus/names/03.names	49981
htscodecs-corpus/names/05.names	288460
htscodecs-corpus/names/08.names	42594
htscodecs-corpus/names/09.names	255867
htscodecs-corpus/names/10.names	307270
htscodecs-corpus/names/20.names	139707
htscodecs-corpus/names/nv.names	329535
htscodecs-corpus/names/nv2.names	87802
htscodecs-corpus/names/rr.names	91103

# This PR
$ for i in htscodecs-corpus/names/*.names;do echo -n "$i     ";~/work/samtools_master/htscodecs/tests/tokenise_name3 < $i|wc -c;done
htscodecs-corpus/names/01.names	368959
htscodecs-corpus/names/02.names	417607
htscodecs-corpus/names/03.names	50785
htscodecs-corpus/names/05.names	288460
htscodecs-corpus/names/08.names	45925
htscodecs-corpus/names/09.names	255867
htscodecs-corpus/names/10.names	307270
htscodecs-corpus/names/20.names	139707
htscodecs-corpus/names/nv.names	329535
htscodecs-corpus/names/nv2.names	176644
htscodecs-corpus/names/rr.names	91103

01.names is NCBI style FASTQ read names:

@ERR174310.1 HSQ1008_141:5:1101:1454:3564/1
@ERR174310.2 HSQ1008_141:5:1101:1485:3570/1
@ERR174310.3 HSQ1008_141:5:1101:1407:3580/1

I thought maybe it was the space that breaks it, but oddly changing it to punctuation makes things substantially worse, for both master and this PR.

nv2.names is novaseq data with barcode information. The sort of thing we see in FASTQ files often:

@VP2-06:112:H7LNDMCVY:1:1105:10845:1986 1:N:0:ATTCAGAA+AGGCGAAT
@VP2-06:112:H7LNDMCVY:1:1105:10845:1986 2:N:0:ATTCAGAA+AGGCGAAT
@VP2-06:112:H7LNDMCVY:1:1105:10845:2362 1:N:0:ATTCAGAA+AGGCGAAC
@VP2-06:112:H7LNDMCVY:1:1105:10845:2362 2:N:0:ATTCAGAA+AGGCGAAC
@VP2-06:112:H7LNDMCVY:1:1105:10845:2957 1:N:0:ATTCAGAA+AGGCGAAG
@VP2-06:112:H7LNDMCVY:1:1105:10845:2957 2:N:0:ATTCAGAA+AGGCGAAG

Changing the space for / there does fix this one though.

I think what we're seeing is optimal tokenisation is likely a "hard problem". We could perhaps try different strategies though where every read name has the same length, as this can change how we tokenise considerably.

rhpvorderman · 2025-05-12T12:00:26Z

I changed ispunct to !isalnum and that improved the situation:

../htscodecs-corpus/names/01.names  169595
../htscodecs-corpus/names/02.names  417607
../htscodecs-corpus/names/03.names  50785
../htscodecs-corpus/names/05.names  288460
../htscodecs-corpus/names/08.names  42579
../htscodecs-corpus/names/09.names  255867
../htscodecs-corpus/names/10.names  307270
../htscodecs-corpus/names/20.names  139707
../htscodecs-corpus/names/nv2.names  89417
../htscodecs-corpus/names/nv.names  329535
../htscodecs-corpus/names/rr.names  91103
../htscodecs-corpus/names/uuid4.names  176373

It gets very slight wins now in 01.names and 08.names. 03.names and nv2.names are somewhat poorer. (UUID4 is much better, but not included in the corpus, albeit that it is less relevant). The rest is the same.

These names show the weakness of the strategy. Both 03.names and 08.names have this structure.:

m130614_110304_00127_c100506142550000001823078908081323_s1_p0/100115/0_3184

The prefix is shared with other reads, so that is not really problematic. c100506142550000001823078908081323 can really better be a char with a few numbers rather than one string. I sorted the file and checked. I saw that both the first and the last part of the number are incrementing. This will work really well when broken up, but the c prevents that.

I wrote some code (not pushed) to detect an alphanumeric switch in a token and count the number of switches. If only 1 switch occurs, break up the token. This solves a lot of prefix<number> cases (at the cost of performing worse at uuid4).

../htscodecs-corpus/names/01.names  169545
../htscodecs-corpus/names/02.names  417607
../htscodecs-corpus/names/03.names  49981
../htscodecs-corpus/names/05.names  288460
../htscodecs-corpus/names/08.names  42598
../htscodecs-corpus/names/09.names  255867
../htscodecs-corpus/names/10.names  307270
../htscodecs-corpus/names/20.names  139707
../htscodecs-corpus/names/nv2.names  89417
../htscodecs-corpus/names/nv.names  329535
../htscodecs-corpus/names/rr.names  91103
../htscodecs-corpus/names/uuid4.names  179954

nv2.names is still not fixed though.

@VP2-06:112:H7LNDMCVY:1:1105:10086:4429 1:N:0:ATTCAGAA+AGGCGAAG

The difference is clearly in the ATTCAGAA+AGGCGAAG part, where not treating it as a single unit is disadvantageous. As it creates a token stream of three tokens, some of which match, some do not. Whilst treating it as a single unit gives a single stream of match/non-match tokens.

I think what we're seeing is optimal tokenisation is likely a "hard problem". We could perhaps try different strategies though where every read name has the same length, as this can change how we tokenise considerably.

Very hard indeed. Except for UUID4 names, this code is not really an improvement (unless you count 40 byte wins in some areas). It may be better to special case certain situations rather than change the current code.

A candidate for replacement of samtools#132. We still need to do more work here on optimising name compression in a general manner, especially on mixed data sets, but this is resolves a simple case while having no impact on any other name formats.

jkbonfield · 2025-05-12T15:48:11Z

I added an alternative commit to my own branch which specifically looks for UUID4 naming convention. I'm not particularly thrilled with more explicit format detection, but pragmatically it's easier than getting a general purpose detection that doesn't run the risk of harming any existing data set.

What do you think? jkbonfield@c666474

Results:

htscodecs-corpus/names/01.names    169625
htscodecs-corpus/names/02.names    417607
htscodecs-corpus/names/03.names    49981
htscodecs-corpus/names/05.names    288460
htscodecs-corpus/names/08.names    42594
htscodecs-corpus/names/09.names    255867
htscodecs-corpus/names/10.names    307270
htscodecs-corpus/names/20.names    139707
htscodecs-corpus/names/nv.names    329535
htscodecs-corpus/names/nv2.names   87802
htscodecs-corpus/names/rr.names    91103
htscodecs-corpus/names/uuid4.names 153449

rhpvorderman · 2025-05-13T07:32:33Z

Looks good to me. I mean, it is very tempting to keep trying to find the one algorithm to rule them all but I have arguably spent already a bit too much time doing that.

rhpvorderman · 2025-05-13T07:56:28Z

Thinking a bit about the lessons learned here. I think that the current method of tokenisation does only take into account the immediately preceding characters as context for tokenisation as well as only one name that has the same prefix.

Doing a two-pass, were first all the reads are tokenised and then the most optimal tokenisation is determined from the context has the possibility of yielding better results. This method can detect the UUID4 form as is shown in the python script I made. (Although it relied on anchoring on the separator chars).

So for instance, the algorithm could detect that the canonical form of any name in the stream is:
['LOWERHEXADECIMAL0', 'CHAR, 'LOWERHEXADECIMAL0', 'CHAR', 'LOWERHEXADECIMAL', 'CHAR', 'LOWERHEXADECIMAL', CHAR', 'LOWERHEXADECIMAL0']

In the second pass for each read, all lower hexadecimal characters are taken and encoded (all as N_CHAR but that is an implementation detail) then a CHAR, then a lower hexadecimal etc.

For illumina the order is.
['UPPERHEXADECIMAL', 'CHAR', 'DECIMAL', STRING', 'DECIMAL', 'CHAR', 'DECIMAL', 'CHAR', 'DECIMAL', 'CHAR', 'DECIMAL']

The disadvantage of this approach is that it does a two-pass and that is more expensive than a one pass, by definition. It then comes a question if the gains are worth it. For UUID4 yes, for all the other cases questionable. This also performs extremely poorly on mixed datasets. (However, datasets that are truly so mixed that the tokens don't line up are rare enough that the gzip fallback is acceptable.) That's quite some additional cost compared to just special-casing UUID4.

jkbonfield · 2025-05-13T08:48:21Z

Mixed datasets can, in theory, be accommodated.

Eg if we have FOO: and :BAR then in that case we're encoding data stream with (token_number,type) as (1,STRING), (2,DIGITS) and (1,DIGITS), (2,STRING) so there's no clash between the who naming forms. (Actually more than that as we'd have MATCH in there too, but that's basically constant after the first name)

Where it fails is if we have say:

FOO:1
FOO:2
BAR:10
FOO:3
BAR:15
FOO:4
BAR:20
BAR:33

Both FOO and BAR are of the form , but the digit delta for FOO is always +1 and for BAR it's variable meaning we're not getting the benefit of knowing FOO is always +1 and a constant delta.

Instead we could do:

FOO comparison: <1,string> <1,match> <2,char> <2,match> <3,ddelta>
BAR comparison: <1,string> <1,match> <2,char> <2,match> <3,nop> <4,ddelta>

This now means each <num,type> stream is destinct, bar the very first FOO/BAR for <1,string> which is a one-off. It's very hard to optimise this, but naively we could do a single pass using the existing trie that tracks the previous name type by allocating a class to a name. So FOO:1 matches nothing and gets class 1. FOO:2 matches FOO:1 in the trie and copies its class of 1. BAR:10 doesn't match in the trie so gets a new class of 2. Etc. This way every name has an allocated name class. We could then also track the number of tokens used to match a name class (plus a few extras maybe), and cumulatively add that many nops for all previous class numbers so BAR becomes:

BAR comparison: <1,nop> <2,nop> <3,nop> <4,string> <4,match> <5,char> <5,match> <6,ddelta>

We're not exploiting that both FOO's and BAR's second token is always a CHAR ':', but it's of minimal impact I suspect.

A demonstration of the issue:

@ seq22-head2[samtools.../htscodecs]; perl -e '$b1=1;$b2=1;for ($i=0;$i<5000;$i++) {print "foo:$b1\n";print "bar:$b2\n";$b1++;$b2+=int(rand(16))}' > _
@ seq22-head2[samtools.../htscodecs]; ./tests/tokenise_name3 < _|wc -c
14000
@ seq22-head2[samtools.../htscodecs]; egrep foo _ | ./tests/tokenise_name3 |wc -c
86
@ seq22-head2[samtools.../htscodecs]; egrep bar _ | ./tests/tokenise_name3 |wc -c
2675
@ seq22-head2[samtools.../htscodecs]; egrep 'foo|bar' _ | ./tests/tokenise_name3 |wc -c
14000

Mostly I doubt it matters that much, but I'm unsure!

Edit: actually that's a totally different issue - it's not pairing them up in the trie. Generalising this code is hard.

rhpvorderman · 2025-05-13T14:01:52Z

@ seq22-head2[samtools.../htscodecs]; perl -e '$b1=1;$b2=1;for ($i=0;$i<5000;$i++) {print "foo:$b1\n";print "bar:$b2\n";$b1++;$b2+=int(rand(16))}' > _
@ seq22-head2[samtools.../htscodecs]; ./tests/tokenise_name3 < _|wc -c
14000
@ seq22-head2[samtools.../htscodecs]; egrep foo _ | ./tests/tokenise_name3 |wc -c
86
@ seq22-head2[samtools.../htscodecs]; egrep bar _ | ./tests/tokenise_name3 |wc -c
2675
@ seq22-head2[samtools.../htscodecs]; egrep 'foo|bar' _ | ./tests/tokenise_name3 |wc -c
14000

Interesting.

$ cat <(egrep 'foo' _) <(egrep 'bar' _) | ./tests/tokenise_name3 |wc -c
2786

Edit: actually that's a totally different issue - it's not pairing them up in the trie.

Yes, I thought the trie was supposed to get the closest entry prefix-wise. I can imagine foo:1 ,foo:11 etc. and friends causing some serious issues here...
But then why is it not an issue when only the foo stuff is loaded? Is the trie bypassed there?

jkbonfield · 2025-05-13T15:12:58Z

See #134.

It was a flaw in the search_trie where unrecognised name forms would mistakenly reverting back to the previous record. It now reverts to the previous matching prefix up to a punctuation symbol. This also fixes foo:11 vs foo:1 (but not foo11 to foo1).

rhpvorderman · 2025-05-14T10:36:31Z

I added an own classifier instead of relying on isalpha isdigit and ispunct. This allows for classifying !?$ as alpha characters and also +,& because these usually separate things within the same field, and should thus not be encoded separately.

../htscodecs-corpus/names/01.names   169525  // -100 bytes from master
../htscodecs-corpus/names/02.names   417607
../htscodecs-corpus/names/03.names   49981
../htscodecs-corpus/names/05.names   288460
../htscodecs-corpus/names/08.names   42587  // -7 bytes from master
../htscodecs-corpus/names/09.names   255867
../htscodecs-corpus/names/10.names   307270
../htscodecs-corpus/names/20.names   139707
../htscodecs-corpus/names/nv2.names   87823  // +21 bytes from master
../htscodecs-corpus/names/nv.names   329535 
../htscodecs-corpus/names/rr.names   91103
../htscodecs-corpus/names/uuid4.names   176373 // Lots of bytes less than master.

Unfurtonately this does not lead to optimal encoding of uuid4 names yet.

But since hexdigits can be detected now I also tested this code: rhpvorderman@40927ad . Where every hexdigit is simply a char.

Unfortunately that does not help for some reason

../htscodecs-corpus/names/uuid4.names   170723

I checked with enc_debug. And unlike the special casing where everything is N_CHAR. Here all the - characters are N_MATCH. ~~Also sometimes the hexadecimal is in a form like 49023acfdedf which is encoded as suffix.~~ EDIT: Also sometimes a hexadecimal is only composed of decimal digits and the grouping gets classified as a number rather than a hexadecimal.

So special casing yields much better results than generalization. This generalization is a lot of extra code for the 100 bytes of wins it materializes in illumina...

jkbonfield · 2025-05-14T12:45:05Z

In doing my investigations I reminded myself I already had uuid4 detection in search_trie. This sets a prefix_len to 37, which basically makes it not tokenise those bytes. It gave in the ballpark of 170-180kb too. The only reason it didn't work on your case is it was written to detect uuid4-flowcell-name, hence it looks for uuid4 followed by '-'. A trivial change to remove that cured the poor compression in your uuid4 case, with a really minimal change.

However 170kb is about the max you get when you add in the '-' and nul chars from ALPHA. Splitting it by column solves that, and also the fixed '4' I guess, although it's probably slower as there are many columns being compressed instead of just 1 string.

rhpvorderman · 2025-05-16T06:38:06Z

Yes. And indeed 170 KB is probably good enough given that UUIDs are such a small part of the ONT data anyway. I will close this PR. I learned a lot from doing this. Thank you for all the feedback!

Start on punctuation based tokens

0f96905

rhpvorderman mentioned this pull request May 7, 2025

Different tokenisation and storage strategy could prove useful for storing names in CRAM. #131

Closed

rhpvorderman added 3 commits May 12, 2025 08:52

Fix logic error, remove code duplication, consistent s-i > token_length

7aa558b

Combine characterization loop

a435223

Deduplicate value calculation

8b2af2d

jmarshall mentioned this pull request May 12, 2025

Bump CI FreeBSD image_family again, to 14.2 #133

Merged

rhpvorderman added 2 commits May 12, 2025 11:36

Test for spaces too

aa36ab1

Do not isalnum instead

46ae82f

jkbonfield mentioned this pull request May 13, 2025

UUID4 name tokeniser plus improvements to mixed data sets. #134

Merged

rhpvorderman added 2 commits May 14, 2025 09:42

Insert own classifier

5d31aa0

Add code for number prefix and number suffix

90c1819

rhpvorderman closed this May 16, 2025

Seperate out tokens based on punctuation. #132

Seperate out tokens based on punctuation. #132

Uh oh!

Conversation

rhpvorderman commented May 7, 2025

Uh oh!

rhpvorderman commented May 7, 2025

Uh oh!

jkbonfield commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhpvorderman commented May 12, 2025

Uh oh!

jkbonfield commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhpvorderman commented May 13, 2025

Uh oh!

rhpvorderman commented May 13, 2025

Uh oh!

jkbonfield commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhpvorderman commented May 13, 2025

Uh oh!

jkbonfield commented May 13, 2025

Uh oh!

rhpvorderman commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented May 14, 2025

Uh oh!

rhpvorderman commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkbonfield commented May 12, 2025 •

edited

Loading

jkbonfield commented May 12, 2025 •

edited

Loading

jkbonfield commented May 13, 2025 •

edited

Loading

rhpvorderman commented May 14, 2025 •

edited

Loading