Add UMI support to FASTQ input/output.#1960
Merged
vasudeva8 merged 1 commit intosamtools:developfrom Oct 21, 2025
Merged
Conversation
53a9e55 to
1d64016
Compare
c8d17d7 to
710291e
Compare
The fastq_umi option (FASTQ_OPT_UMI enum) is used to enable UMI parsing in read names. When reading FASTQ it converts the 8th Illumina field to an aux tag (default to RX). We may need to amend this if people require it to work on earlier Illumina naming systems, but for now we target the current software. When writing FASTQ we hunt for a series of tags and choose the first one found. RX is the usual one, but users may wish to also use OX if they potentially have error corrected data in RX and want to regenerate fastq from the original uncorrect tag. Complexities arrive when dealing with /1 or /2 and #num multi-plexing strings, meaning we have to use temporary buffers rather than simply truncating the read names. Note we convert dual-index UMIs of SEQ+SEQ to RX:Z:SEQ-SEQ as per the SAMtags recommendation. However it's a bit wild west out there and not everyone does this. (I've seen 10X outputs with CR using underscores for example.) This is a best efforts approach to reduce the likelihood of incompatibility with downstream pipelines.
710291e to
de8afb6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The fastq_umi option (FASTQ_OPT_UMI enum) is used to enable UMI parsing in read names.
When reading FASTQ it converts the 8th Illumina field to an aux tag (default to RX). We may need to amend this if people require it to work on earlier Illumina naming systems, but for now we target the current software.
When writing FASTQ we hunt for a series of tags and choose the first one found. RX is the usual one, but users may wish to also use OX if they potentially have error corrected data in RX and want to regenerate fastq from the original uncorrect tag.
Complexities arrive when dealing with /1 or /2 and #num multi-plexing strings, meaning we have to use temporary buffers rather than simply truncating the read names.