The PureTarget repeat expansion panel is a targeted assay used to simultaneously resolve multiple pathogenic tandem repeats. The assay produces multiplexed high-depth sequencing libraries for dozens of samples in a single run.
To use TRGT with PureTarget data, we need to:
- Use all reads produced by a sequencing run
- Use
--preset targetedwhen running TRGT
Below we will describe the commands used to attain these adjustments.
This step will depend on which sequencer was used.
If reads were sequenced on a Revio, you should have two BAM files:
hifi_reads/movie.hifi_reads.bam and fail_reads/movie.fail_reads.bam (where
movie is a run-specific movie name). The BAM filename may also contain the
demultiplexed barcode name, but the important part is that the two files only
differ by the words "hifi" and "fail".
To prepare the input for analysis, use pbmerge to merge both BAM files as follows:
pbmerge -o movie.input.bam hifi_reads/movie.hifi_reads.bam fail_reads/movie.fail_reads.bam
(note: pbmerge is recommended over samtools merge to create proper inputs
for downstream tools such as lima or pbmm2).
If your sample is generated on a Sequel II/IIe system, you should have access to
the .subreads.bam file. In that case, when running ccs, use the
mode-all as follows:
ccs --all movie.subreads.bam movie.input.bam
In either case, the resulting movie.input.bam is the starting point of the
analysis. It should be mapped using
pbmm2 using the default --preset HIFI, similar to WGS reads:
pbmm2 align --preset HIFI movie.input.bam movie.pbmm2.bam
We strongly recommend PureTarget samples to be analyzed with the --preset targeted
option when running TRGT. All other parameters can be set as the
defaults.
The resulting command would look like this:
./trgt genotype \
--preset targeted \
--genome example/reference.fasta \
--repeats example/repeat.bed \
--reads example/sample.bam \
--output-prefix sample
The PureTarget protocol produces insert sizes of about 5 kb. Large expansions of loci like FXN, C9orf72, DMPK and CNBP produce much larger molecules that may not produce reads reaching HiFi quality thresholds at typical movie times (recall that HiFi reads have an average quality of Q20 or more). Using all reads produced by the sequencer may significantly increase the coverage of expanded alleles and prevent allelic dropouts.
Below is a comparison using an extreme example of a FXN carrier sequenced on a Revio platform. On the left is the trgt plot when only HiFi reads are used, and on the right is the same dataset when all available reads are used for the analysis:
The --preset targeted command-line option make the following changes on TRGT:
- It disables filtration of reads based on the
rqtag. By default TRGT only uses reads withrq >= 0.98 - It sets the
--genotyper clusterflag, which assigns reads to alleles based on the sequence composition, and not just the individual STR size of each read - It uses an alignment scoring for flanking sequences that is optimized for all types of reads (e.g., penalizing opening gaps less severely).
This may occur if only hifi reads are being used as the input. Please ensure
your input BAM file contains all sequenced reads, as described in step 1 of this
document. This may also occur if --preset targeted was not set when running
TRGT, in which case the expanded reads may be filtered out and TRGT will produce
a warning like this:
[WARN] - FXN: Quality filtered 35/106 reads
4. I see a clear motif in the allele consensus sequence that is not reported by TRGT or depicted on TRGT plots. Why?
TRGT only analyzes the motif units defined in the repeats catalog, so novel
motifs will not be quantified. Algorithms for de novo discovery of motif units
are currently under development. Until then, our suggested solution is to re-run
TRGT by adding the novel motif to the repeat catalog. For example, if you have
set MOTIFS=CAG, and you clearly see the motif CCG in your consensus, change
the BED file to MOTIFS=CAG,CCG and re-run TRGT. The resulting genotype should
be identical, but the VCF tags will contain quantification of spans of the new
CCG motif.
Using the targeted preset, you may see a warning like this:
[WARN] - FXN: Filtered out 5 impure reads
Internally, TRGT uses a heuristic that discards any lower-quality read that does not resemble a relatively perfect tandem repeat (less than 90% similarly in terms of the edit distance). This heuristic cannot be disabled, and the number of reads discarded is reported as a warning.
