@@ -85,6 +85,12 @@ When performing path-finding, this threshold limits the number of paths which ar
8585speed up runtime but may come at a cost of recall. A higher ` maxpaths ` is slower and may come at a cost to
8686specificity.
8787
88+ ### ` --maxnodes `
89+ If a neighborhood has too many variants, its graph will become large in memory and slow to traverse This parameter
90+ will turn off path-finding in favor of ` --one-to-one ` haplotype to variant comparison (see Experimental Parameters
91+ below), reducing runtime and memory usage. This may reduce recall in regions with many SVs, but these regions are
92+ problematic anyway.
93+
8894### ` --hapsim `
8995After performing kmeans clustering on reads to determine the two haplotypes, if the two haplotypes have a size similarity
9096above ` hapsim ` , they are consolidated into a homozygous allele.
@@ -121,29 +127,29 @@ Details of `FT`
121127# 🔌 Compute Resources
122128
123129Kanpig is highly parallelized and will fully utilize all threads it is given. However, hyperthreading doesn't seem to
124- help and therefore the number of threads should probably be limited to the number of physical processors available.
130+ help and therefore the number of threads should probably be limited to the number of physical processors available. For
131+ memory, giving kanpig 2GB per-core is usually more than enough.
132+
133+ The actual runtime and memory usage of kanpig run will depend on the read coverage and the number of SVs in the input
134+ VCF. As a example of kanpig's resource usage with 16 cores available, genotyping a 30x long-read bam against a 2,199
135+ sample VCF (4.3 million SVs) took 13 minutes with a maximum memory usage of 12GB. Converting the bam to a plup file took
136+ 4 minutes (8GB of memory) and genotyping with this plup file took 3 minutes (12GB memory).
125137
126- For memory, a general rule is kanpig will need about 20x the size of the compressed ` .vcf.gz ` . The minimum required
127- memory is also dependent on the number of threads running as each will need space for its processing. For example,
128- a 1.6Gb vcf ( ~ 5 million SVs) using 16 cores needs at least 32Gb of RAM. That same vcf with 8 or 4 cores needs at least
129- 24Gb and 20Gb of RAM, respectively.
138+ While genotyping against a plup file is usually faster, bam to plup conversion is most useful for:
139+ * genotyping a large VCF or super-high (>50x) coverage bam.
140+ * a sample that will be genotyped multiple times (e.g. N+1 pipelines)
141+ * long-term access to reads (a plup file is up to ~ 2,000x smaller than a bam)
130142
131143# 🔬 Experimental Parameter Details
132144
133145These parameters have a varying effect on the results and are not guaranteed to be stable across releases.
134146
135- ### ` --try-exact `
136- Before performing the path-finding algorithm that applies a haplotype to the variant graph, perform a 1-to-1 comparison
137- of the haplotype to each node in the variant graph. If a single node matches above ` sizesim ` and ` seqsim ` , the
138- path-finding is skipped and haplotype applied to the node.
139-
140- This parameter will boost the specificity and speed of kanpig at the cost of recall.
141-
142- ### ` --prune `
143- Similar to ` try-exact ` , a 1-to-1 comparison is performed before path-finding. If any matches are found, all paths
144- which do not traverse the matching nodes are pruned from the variant graph.
147+ ### ` --one-to-one `
148+ Instead of performing the path-finding algorithm that applies a haplotype to the variant graph, perform a 1-to-1
149+ comparison of the haplotype to each node in the variant graph. If a single node matches above ` sizesim ` and ` seqsim ` ,
150+ the haplotype is applied to it.
145151
146- This parameter will boost the specificity and speed of kanpig at the cost of recall.
152+ This parameter will boost the specificity, increase speed, and lower memory usage of kanpig at the cost of recall.
147153
148154### ` --maxhom `
149155
0 commit comments