-
Notifications
You must be signed in to change notification settings - Fork 20
Seperate out tokens based on punctuation. #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I also checked gzip,bzip2 and xz on 10000uuid4.txt and this PR performs better. Which is quite nice. (Not that it matters much for ONT data, but still). |
|
Testing on the htscodecs-corpus data I see this is a significant regression with some data sets 01.names is NCBI style FASTQ read names: I thought maybe it was the space that breaks it, but oddly changing it to punctuation makes things substantially worse, for both master and this PR. nv2.names is novaseq data with barcode information. The sort of thing we see in FASTQ files often: Changing the space for / there does fix this one though. I think what we're seeing is optimal tokenisation is likely a "hard problem". We could perhaps try different strategies though where every read name has the same length, as this can change how we tokenise considerably. |
|
I changed It gets very slight wins now in 01.names and 08.names. 03.names and nv2.names are somewhat poorer. (UUID4 is much better, but not included in the corpus, albeit that it is less relevant). The rest is the same. These names show the weakness of the strategy. Both 03.names and 08.names have this structure.: The prefix is shared with other reads, so that is not really problematic. I wrote some code (not pushed) to detect an alphanumeric switch in a token and count the number of switches. If only 1 switch occurs, break up the token. This solves a lot of nv2.names is still not fixed though. The difference is clearly in the
Very hard indeed. Except for UUID4 names, this code is not really an improvement (unless you count 40 byte wins in some areas). It may be better to special case certain situations rather than change the current code. |
A candidate for replacement of samtools#132. We still need to do more work here on optimising name compression in a general manner, especially on mixed data sets, but this is resolves a simple case while having no impact on any other name formats.
|
I added an alternative commit to my own branch which specifically looks for UUID4 naming convention. I'm not particularly thrilled with more explicit format detection, but pragmatically it's easier than getting a general purpose detection that doesn't run the risk of harming any existing data set. What do you think? jkbonfield@c666474 Results: |
|
Looks good to me. I mean, it is very tempting to keep trying to find the one algorithm to rule them all but I have arguably spent already a bit too much time doing that. |
|
Thinking a bit about the lessons learned here. I think that the current method of tokenisation does only take into account the immediately preceding characters as context for tokenisation as well as only one name that has the same prefix. Doing a two-pass, were first all the reads are tokenised and then the most optimal tokenisation is determined from the context has the possibility of yielding better results. This method can detect the UUID4 form as is shown in the python script I made. (Although it relied on anchoring on the separator chars). So for instance, the algorithm could detect that the canonical form of any name in the stream is: In the second pass for each read, all lower hexadecimal characters are taken and encoded (all as N_CHAR but that is an implementation detail) then a CHAR, then a lower hexadecimal etc. For illumina the order is. The disadvantage of this approach is that it does a two-pass and that is more expensive than a one pass, by definition. It then comes a question if the gains are worth it. For UUID4 yes, for all the other cases questionable. This also performs extremely poorly on mixed datasets. (However, datasets that are truly so mixed that the tokens don't line up are rare enough that the gzip fallback is acceptable.) That's quite some additional cost compared to just special-casing UUID4. |
|
Mixed datasets can, in theory, be accommodated. Eg if we have FOO: and :BAR then in that case we're encoding data stream with (token_number,type) as (1,STRING), (2,DIGITS) and (1,DIGITS), (2,STRING) so there's no clash between the who naming forms. (Actually more than that as we'd have MATCH in there too, but that's basically constant after the first name) Where it fails is if we have say: Both FOO and BAR are of the form , but the digit delta for FOO is always +1 and for BAR it's variable meaning we're not getting the benefit of knowing FOO is always +1 and a constant delta. Instead we could do: FOO comparison: <1,string> <1,match> <2,char> <2,match> <3,ddelta> This now means each <num,type> stream is destinct, bar the very first FOO/BAR for <1,string> which is a one-off. It's very hard to optimise this, but naively we could do a single pass using the existing trie that tracks the previous name type by allocating a class to a name. So FOO:1 matches nothing and gets class 1. FOO:2 matches FOO:1 in the trie and copies its class of 1. BAR:10 doesn't match in the trie so gets a new class of 2. Etc. This way every name has an allocated name class. We could then also track the number of tokens used to match a name class (plus a few extras maybe), and cumulatively add that many nops for all previous class numbers so BAR becomes: BAR comparison: <1,nop> <2,nop> <3,nop> <4,string> <4,match> <5,char> <5,match> <6,ddelta> We're not exploiting that both FOO's and BAR's second token is always a CHAR ':', but it's of minimal impact I suspect. A demonstration of the issue: Mostly I doubt it matters that much, but I'm unsure! Edit: actually that's a totally different issue - it's not pairing them up in the trie. Generalising this code is hard. |
Interesting.
Yes, I thought the trie was supposed to get the closest entry prefix-wise. I can imagine |
|
See #134. It was a flaw in the search_trie where unrecognised name forms would mistakenly reverting back to the previous record. It now reverts to the previous matching prefix up to a punctuation symbol. This also fixes foo:11 vs foo:1 (but not foo11 to foo1). |
|
I added an own classifier instead of relying on Unfurtonately this does not lead to optimal encoding of uuid4 names yet. But since hexdigits can be detected now I also tested this code: rhpvorderman@40927ad . Where every hexdigit is simply a char. Unfortunately that does not help for some reason I checked with enc_debug. And unlike the special casing where everything is N_CHAR. Here all the So special casing yields much better results than generalization. This generalization is a lot of extra code for the 100 bytes of wins it materializes in illumina... |
|
In doing my investigations I reminded myself I already had uuid4 detection in However 170kb is about the max you get when you add in the '-' and nul chars from ALPHA. Splitting it by column solves that, and also the fixed '4' I guess, although it's probably slower as there are many columns being compressed instead of just 1 string. |
|
Yes. And indeed 170 KB is probably good enough given that UUIDs are such a small part of the ONT data anyway. I will close this PR. I learned a lot from doing this. Thank you for all the feedback! |
See also #131
A very simple change. Tokens are not determined by a sequence of similar characters but separated out by punctuation chars. The stretches in between are then classified as numbers or strings.
This change simplifies the code. It also allows for adding hexadecimal tokens at a later point in time. (If desired).
Tested on the following files.
10000_illumina_ids.txt
10000_illumina_ids_simpler.txt
10000_pacbio_revio_ids.txt
10000uuid4.txt