-> In v1.3.3, we introduced a `low_mem_greedy` option for low-memory dereplication for the top 20 taxa which are particularly well sequenced (e.g. those which have >10k or >20k genomes available). As we showed in the manuscript, while dereplication by skDER/cidder or other methods is typically not very memory-intensive when applied to an input set of <5,000 genomes, memory needs can expand when you go beyond this. The `lom_mem_greedy` mode was not included in the manuscript and is still being benchmarked - I plan to update the wiki with details on how its representative selection compares to the standard greedy approach. I expect the quality of representatives selected to be slightly worse because it does not account for "connectivity" in prioritizing their selection, but it is considerably faster and more computationally efficient by leveraging skani's `search` function through a greedy/iterative approach that prioritizes based on only N50 when applied to large datasets. As an example, we were able to dereplicate >20,000 *Staphylococcus* from GTDB R220 in around 2.25 hours using 20 threads and ~1 GB of memory using the command: `skder -t Staphylococcus -d greedy -c 20 -r R220 -o Staph_R220_skDER_LMG_Results/ -auto -d low_mem_greedy`. For those interested in using this on their laptops, genomes can still add up in size, so make sure you have an appropriate amount of disk space available for the number of genomes you plan to dereplicate.
0 commit comments