Skip to content

Commit 80e0cae

Browse files
authored
Update README.md
1 parent b7f8dca commit 80e0cae

File tree

1 file changed

+30
-7
lines changed

1 file changed

+30
-7
lines changed

README.md

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -115,14 +115,14 @@ Here is an overview of the algorithm:
115115

116116
>- Download or process input genomes.
117117
>- Predict proteins using pyrodigal.
118-
>- Comprehensive clustering of all proteins using CD-HIT (default options: )
118+
>- Comprehensive clustering of all proteins using CD-HIT
119119
>- Select genome with the most number of distinct protein clusters as the initial representative.
120120
>- Iteratively add more representative genomes one at a time, selecting the next based on maximized addition of novel protein clusters to the current representative set.
121121
>- End addition of representative genomes if one of three criteria are met: (i) Next genome adds less than X number of distinct protein clusters (X is by default 0), (ii) over Y% of the total distinct protein clusters across all genomes are found in the so-far selected reprsentative genomes (Y is by default 90%), or (iii) over Z% of the total distinct multi-genome protein clusters across all genomes are found in the so-far selected representative genomes (Z is by default 100%). Thus, by default, only Y is used for representative genome selection.
122122
123123
### Using the Dynamic Programming Dereplication Approach (skDER)
124124

125-
Unlike dRep, which implements a greedy approach for selecting representative genomes, the default dereplication method in skDER approximates selection of a single representative for coarser clusters of geneomes using a dynamic programming approach in which a set of genomes deemed as redundant is kept track of, avoiding the need to actually cluster genomes.
125+
The dynamic dereplication method in skDER approximates selection of a single representative for coarser clusters of geneomes using a dynamic programming approach in which a set of genomes deemed as redundant is kept track of, avoiding the need to actually cluster genomes.
126126

127127
Here is an overview of the algorithm:
128128

@@ -203,11 +203,23 @@ usage: skder [-h] [-g GENOMES [GENOMES ...]] [-t TAXA_NAME] [-r GTDB_RELEASE] -o
203203
before MGE filtering to not penalize genomes of high quality that simply have many
204204
MGEs and enable them to still be selected as representatives.
205205
206+
If you use skDER for your research, please kindly cite both:
207+
208+
Fast and robust metagenomic sequence comparison through sparse chaining with skani.
209+
Nature Methods. Shaw and Yu, 2023.
210+
211+
and
212+
213+
skDER & CiDDER: microbial genome dereplication approaches for comparative genomic
214+
and metagenomic applications. Salamzade, Kottapalli, and Kalan, 2024
215+
216+
206217
207218
options:
208219
-h, --help show this help message and exit
209220
-g GENOMES [GENOMES ...], --genomes GENOMES [GENOMES ...]
210-
Genome assembly files in (gzipped) FASTA format
221+
Genome assembly file paths or paths to containing
222+
directories. Files should be in FASTA format and can be gzipped
211223
(accepted suffices are: *.fasta,
212224
*.fa, *.fas, or *.fna) [Optional].
213225
-t TAXA_NAME, --taxa-name TAXA_NAME
@@ -221,7 +233,7 @@ options:
221233
-d DEREPLICATION_MODE, --dereplication-mode DEREPLICATION_MODE
222234
Whether to use a "dynamic" (more concise) or "greedy" (more
223235
comprehensive) approach to selecting representative genomes.
224-
[Default is "dynamic"]
236+
[Default is "greedy"]
225237
-i PERCENT_IDENTITY_CUTOFF, --percent-identity-cutoff PERCENT_IDENTITY_CUTOFF
226238
ANI cutoff for dereplication [Default is 99.0].
227239
-tc, --test-cutoffs Assess clustering using various pre-selected cutoffs.
@@ -235,7 +247,7 @@ options:
235247
-p SKANI_TRIANGLE_PARAMETERS, --skani-triangle-parameters SKANI_TRIANGLE_PARAMETERS
236248
Options for skani triangle. Note ANI and AF cutoffs
237249
are specified separately and the -E parameter is always
238-
requested. [Default is ""].
250+
requested. [Default is "-s 90.0"].
239251
-s, --sanity-check Confirm each FASTA file provided or downloaded is actually
240252
a FASTA file. Makes it slower, but generally
241253
good practice.
@@ -297,12 +309,23 @@ usage: cidder [-h] [-g GENOMES [GENOMES ...]] [-t TAXA_NAME] [-r GTDB_RELEASE] -
297309
of proteins overlapping, ANI) in the clustering reports will all be based on processed
298310
(MGE filtered) genomes. However, the final representative genomes in the
299311
Dereplicated_Representative_Genomes/ folder will be the original unprocesed genomes.
300-
312+
313+
If you use CiDDER for your research, please kindly cite both:
314+
315+
CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics.
316+
Fu et al. 2012
317+
318+
and
319+
320+
skDER & CiDDER: microbial genome dereplication approaches for comparative genomic and
321+
metagenomic applications. Salamzade, Kottapalli, and Kalan, 2024.
322+
301323
302324
options:
303325
-h, --help show this help message and exit
304326
-g GENOMES [GENOMES ...], --genomes GENOMES [GENOMES ...]
305-
Genome assembly files in (gzipped) FASTA format
327+
Genome assembly file paths or paths to containing
328+
directories. Files should be in FASTA format and can be gzipped
306329
(accepted suffices are: *.fasta,
307330
*.fa, *.fas, or *.fna) [Optional].
308331
-t TAXA_NAME, --taxa-name TAXA_NAME

0 commit comments

Comments
 (0)