Skip to content

Commit aeb7573

Browse files
committed
v0.0.6
1 parent 6ccc6e2 commit aeb7573

23 files changed

+2293
-846
lines changed

Makefile

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,17 @@ endif
1111

1212
# add -fno-tree-vectorize to avoid certain vectorization errors in O3 optimization
1313
# right now, we are using -O3 for the best performance, and no vectorization errors were found
14-
EXTRA_FLAGS = -Wall -Wno-unused-function -Wno-misleading-indentation -Wno-unused-variable -Wno-alloc-size-larger-than
14+
EXTRA_FLAGS = -Wall -Wno-misleading-indentation -Wno-unused-function #-Wno-unused-variable -Wno-alloc-size-larger-than
1515

1616
# Define the version number
17-
LONGCALLD_VERSION =0.0.5
17+
VERSION=0.0.6
18+
# LONGCALLD_VERSION =0.0.6
1819
# Get the Git commit hash
1920
GIT_COMMIT := $(shell git rev-parse --short HEAD 2> /dev/null)
2021
ifneq ($(GIT_COMMIT),)
21-
LONGCALLD_VERSION = 0.0.5-$(GIT_COMMIT)
22+
LONGCALLD_VERSION = $(VERSION)-$(GIT_COMMIT)
23+
else
24+
LONGCALLD_VERSION = $(VERSION)
2225
endif
2326

2427
HTSLIB_DIR = ./htslib
@@ -170,6 +173,6 @@ clean_all:
170173
clean_hts:
171174
rm -f $(HTSLIB)
172175
clean_abpoa:
173-
rm -f $(ABPOA_LIB)
176+
rm -f $(ABPOA_LIB) $(ABPOA_DIR)/src/*.o
174177
clean_wfa2:
175178
rm -f $(WFA2_LIB)

README.md

Lines changed: 33 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,9 @@
1414
## Updates (pre-release v0.0.6)
1515

1616
* Fix corrupted VCF output in v0.0.5
17-
* Low memory usage (especially when mosaic variant calling enabled)
18-
<!-- * Add [longdust](https://github.com/lh3/longdust) for long low-complexity regions -->
17+
* Fix missing MEI header in VCF output
18+
* Improved run time and memory usage (especially when mosaic variant calling enabled)
19+
* Add `--input-is-list` and `-X` to support multiple input BAM/CRAM files of the same sample for variant calling
1920

2021

2122
## Getting Started
@@ -48,10 +49,12 @@ man ./longcallD.1
4849
- [Build from source](#build-from-source)
4950
- [Usage](#usage)
5051
- [Variant calling with PacBio HiFi/Nanopore long reads](#variant-calling-with-pacbio-hifinanopore-long-reads)
52+
- [Variant calling with multiple input BAM/CRAM files of the same sample](#variant-calling-with-multiple-input-bamcram-files-of-the-same-sample)
5153
- [Low allele-frequency mosaic variant calling](#low-allele-frequency-mosaic-variant-calling)
5254
- [Region-specific variant calling](#region-specific-variant-calling)
53-
- [Variant calling and output phased long reads](#variant-calling-and-output-phased-long-reads)
55+
- [Variant calling and output phased (\& refined) long-read BAM/CRAM](#variant-calling-and-output-phased--refined-long-read-bamcram)
5456
- [Variant calling from remote files](#variant-calling-from-remote-files)
57+
- [Memory usage](#memory-usage)
5558
- [Acknowledgements](#acknowledgements)
5659
- [Contact](#contact)
5760

@@ -107,14 +110,27 @@ longcallD call -t16 ref.fa hifi.bam > hifi.vcf # default for PacBio HiFi
107110
longcallD call -t16 ref.fa ont.bam --ont > ont.vcf # for ONT reads
108111
```
109112

113+
### Variant calling with multiple input BAM/CRAM files of the same sample
114+
You can provide multiple BAM/CRAM files of the same sample for variant calling using `--input-is-list` or `-X`:
115+
```
116+
longcallD call -t16 --input-is-list ref.fa bam_list.txt > sample.vcf
117+
# where bam_list.txt contains:
118+
# sample_part1.bam
119+
# sample_part2.bam
120+
# sample_part3.bam
121+
```
122+
or
123+
```
124+
longcallD call -t16 ref.fa sample_part1.bam -X sample_part2.bam -X sample_part3.bam > sample.vcf
125+
```
126+
110127
### Low allele-frequency mosaic variant calling
111-
With `-s`, longcallD will detect both germline and somatic/mosaic variants.
128+
With `-s`, longcallD will detect both germline and low-frequency somatic/mosaic variants.
112129

113130
For each somatic/mosaic variant, a `SOMATIC` tag will be added to the INFO field in the output VCF.
114131
```
115132
longcallD call -s -t16 ref.fa hifi.bam > hifi.vcf
116133
longcallD call -s -t16 ref.fa hifi.bam -T AluY_L1_SVA_cons_noPA.fa > hifi.vcf # add MEI information in INFO field
117-
longcallD call -s -t16 ref.fa ont.bam --ont > ont.vcf
118134
```
119135

120136
### Region-specific variant calling
@@ -126,10 +142,10 @@ longcallD call -t16 ref.fa hifi.bam --region-file reg.bed > hifi_regs.vcf
126142
longcallD call -t16 ref.fa hifi.bam --autosome > hifi_autosome.vcf
127143
```
128144

129-
### Variant calling and output phased long reads
145+
### Variant calling and output phased (& refined) long-read BAM/CRAM
130146
```
131-
longcallD call -t16 ref.fa hifi.bam --hifi -b hifi_phased.bam > hifi.vcf # output phased HiFi reads (BAM tag: HP & PS)
132-
longcallD call -t16 ref.fa ont.bam --ont -b ont_phased.bam > ont.vcf # output phased ONT reads (BAM tag: HP & PS)
147+
longcallD call -t16 ref.fa hifi.bam --hifi -b hifi_phased.bam > hifi.vcf # output phased HiFi reads (BAM tag: HP & PS)
148+
longcallD call -t16 ref.fa ont.bam --ont --refine-aln -b ont_phased_refined.bam > ont.vcf # output phased & refined ONT reads (BAM tag: HP & PS)
133149
```
134150
### Variant calling from remote files
135151
```
@@ -138,6 +154,14 @@ bam=https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA2438
138154
longcallD call -t16 $ref $bam chr11:10,229,956-10,256,221 chr12:10,576,356-10,583,438 > hifi_regs.vcf
139155
```
140156

157+
## Memory usage
158+
As longcallD performs multiple-sequence alignment/re-alignment, which are memory-intensive, it usually uses more memory than other variant callers.
159+
The peak memory usage mainly depends on the number of threads (`-t/--threads`), the sequencing coverage, and the read length.
160+
For human genome sequencing data with ~40x coverage, longcallD typically uses around **1GB** (**HiFi**) or **2GB** (**ONT R10**) memory per thread for germline variant calling.
161+
162+
If you encounter memory issues, you can use `--region-file` to limit the genomic regions being processed.
163+
Human genome region list excluding centromeres are provided [here](https://github.com/yangao07/longcallD/blob/main/anno/).
164+
141165
## Acknowledgements
142166
LongcallD is dependent on the following libraries, we are grateful to all the developers/maintainers:
143167

@@ -146,7 +170,7 @@ LongcallD is dependent on the following libraries, we are grateful to all the de
146170
* [WFA](https://github.com/smarco/WFA2-lib): pairwise alignment
147171
* [edlib](https://github.com/Martinsos/edlib): fast sequence similarity calculation
148172
* [cgranges](https://github.com/lh3/cgranges): interval operations
149-
* [sdust](https://github.com/lh3/sdust) and [longdust](https://github.com/lh3/longdust): identify low-complexity regions
173+
* [sdust](https://github.com/lh3/sdust): identify low-complexity regions
150174

151175
## Contact
152176

anno/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## Mobile element sequences
2+
File `AluY_L1_SVA_cons_noPA.fa` contains consensus sequences of three major
3+
human active mobile element families (AluY, L1HS and SVA) excluding the polyA
4+
tails. It can be used with the `-T` option to add mobile element insertion
5+
(MEI) information in the INFO field of somatic/mosaic variant calls.
6+
7+
## Non-centromeric regions
8+
9+
File `chm13v2.reg.nocen.bed` *excludes* the approximate locations of centromeric
10+
satellite repeats and acrocentric short arms in CHM13 v2.0, which was *manually*
11+
constructed based on the [official satellite][cen-sat] annotation, the
12+
[DNA-BRNN][dna-brnn] satellite annotation and the [minigraph pangenome
13+
graph][HPRC-mg] from the HPRC year-1 data.
14+
15+
File `hs38.reg.nocen.bed` was constructed similarly from DNA-BRNN annotation and
16+
excludes regions where minigraph alignment faded.
17+
18+
File `hs37.reg.nocen.bed` was constructed by running DNA-BRNN on GRCh37 (hs37d5.fa).
19+
20+
[cen-sat]: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_censat_v2.0.bed
21+
[dna-brnn]: https://github.com/lh3/dna-nn
22+
[HPRC-mg]: https://zenodo.org/records/10693675
23+
[zenodo]: https://zenodo.org/records/10963019

anno/chm13v2.reg.nocen.bed

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
chr1 10000 121569169
2+
chr1 142292033 248377328
3+
chr2 10000 92250802
4+
chr2 94745067 242686752
5+
chr3 10000 90754701
6+
chr3 96465026 201095948
7+
chr4 10000 49655154
8+
chr4 55353192 193564945
9+
chr5 10000 46780042
10+
chr5 51012194 182035439
11+
chr6 10000 58236706
12+
chr6 61108390 172116628
13+
chr7 10000 60360644
14+
chr7 63764499 160557428
15+
chr8 10000 44193546
16+
chr8 46375080 146249331
17+
chr9 10000 44888599
18+
chr9 76744047 150607247
19+
chr10 10000 39583793
20+
chr10 41976237 134748134
21+
chr11 10000 50973358
22+
chr11 54526419 135117769
23+
chr12 10000 34543492
24+
chr12 37252490 133314548
25+
chr13 17558596 113556686
26+
chr14 12758411 101151492
27+
chr15 17744466 99743195
28+
chr16 10000 35784066
29+
chr16 52269756 96320374
30+
chr17 10000 23383372
31+
chr17 27621319 84266897
32+
chr18 10000 15591581
33+
chr18 21171235 80532538
34+
chr19 10000 24520766
35+
chr19 29819351 61697364
36+
chr20 10000 26333658
37+
chr20 33019590 66200255
38+
chr21 11356378 45080682
39+
chr22 15761065 51314926
40+
chrX 10000 2393410
41+
chrX 2395410 57769763
42+
chrX 60977195 153924834
43+
chrX 153926834 154249566
44+
chrY 10000 2457320
45+
chrY 2459320 62121809
46+
chrY 62123809 62450029

0 commit comments

Comments
 (0)