-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hi Zam,
I believe you have previously been in contact with Phil Lobb from the UKHSA TB detection team surrounding our implementation of clockwork.
Thank you for helping us understand the choice to use diploid settings for Cortex and BCFTools mpileup.
Unfortunately, we are now having a different issue with cortex as part of our clockwork implementation. To give some background, our minos vcf contains variants which have an AD value of 0, which cause downstream processing errors with tb-profiler. This is an example from the minos vcf for one sample, where two G-->C variants appear with AD=0:
NC_000962.3 1414935 . G C . PASS . GT:AD:COV:DP:DPF:FRS:GT_CONF:GT_CONF_PERCENTILE 1/1:0,0:0,105:105:1.3202:1.0:674.53:77.66
NC_000962.3 1414937 . G C . PASS . GT:AD:COV:DP:DPF:FRS:GT_CONF:GT_CONF_PERCENTILE 1/1:0,0:0,105:105:1.3202:1.0:674.53:77.66
And here is the downstream error caused by this malformed vcf, which enables us to catch these instances:
2025-02-18 16:02:01,703 - DEBUG - {1414935: ['C'], 1414936: 'G', 1414937: ['C']}
2025-02-18 16:02:01,703 - DEBUG - CGC
2025-02-18 16:02:01,703 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2adbf5f5c3d0>
2025-02-18 16:02:01,703 - DEBUG - Reached AD
2025-02-18 16:02:01,703 - DEBUG - {'GGG': 0, 'CGC': 0}
2025-02-18 16:02:01,703 - DEBUG - Counter({'GGG': 0, 'CGC': 0})
2025-02-18 16:02:01,703 - DEBUG - dp value 0
variant.info.update({'AF':count/dp})
~~~~~^~~
ZeroDivisionError: division by zero
These variants originate from the cortex vcf prior to minos adjudication, where a large indel is called by cortex at position 1414787, an indel that spans these variants. However, upon inspection of the bam file, we see that there are no reads at this position, so we are confused as to how cortex could make this call – I’ve attached the IGV plot of this position for the sample below.
Indeed, our analysis of 49 Mycobacterium africanum samples shows 32 samples have this type of AD=0 error in the minos vcf, and by further investigating these samples, it seems to consistently be the same cortex issue where cortex calls an indel instead of a deletion at a position with no reads. These problematic indels/deletions are found at different positions in the genome for all failing samples, and consistently not in regions characterised as divergent from Mycobacterium tuberculosis (https://pmc.ncbi.nlm.nih.gov/articles/PMC497617/).
I’ve attached a couple more examples where this issue is arising below – each time, the blue bar represents an indel from cortex.
Do you have any ideas what could be causing cortex to call an indel in these positions where there are no reads? Thanks for all your help with our clockwork implementation.
Kind regards,
Matt


