Skip to content

Sylph overestimates percentage of unknown reads? #73

@pjoubert-git

Description

@pjoubert-git

Hello, I gave sylph profile (v0.9.0) a try on samples generated from cultures that were expected to be pure, with the -u flag to estimate unknown sequence percentage. I used sylph against a database that contained the previously generated genomes of the bacteria of interest. For all samples, I got adjusted ANI scores of 100%, but for some of them I am getting sequence abundances as low as 75%. The genomes are quite complete (BUSCO completeness of 98%+). There is definitely a good number of unmapped reads in the samples (8%) but the number of unmapped reads according to bwa mem is a lot lower than the unknown reads estimated by sylph (8% vs 25%). A quick BLAST reveals that these unmapped reads most likely come from parts of the genome that weren't properly assembled; it is unlikely to be a contaminant (possible plasmids?). Is this a known issue? I understand that sylph wasn't designed for this application but I was still curious to hear the authors' thoughts on the issue and if there was any advice to improve this behavior. Thank you for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions