Skip to content

Is it suitable for de-redundancy in transcriptome assembly? #1064

@zhangwenda0518

Description

@zhangwenda0518

Hello teacher, I am having trouble removing redundant assembly, can you give me some advice?

I am doing a common transcriptome-based mining identification of viruses, and I assemble the sequences downloaded from the SRA separately after removing the host. My plan is to aggregate these assembly results for candidate virus alignment identification. I saw two deredundancy methods of mmseqs2 easy-cluster and easy-linclust, but also retrieved the deredundancy of cd-hit-est, I don't know if mmseqs2 is suitable for the purpose of deredundancy of my transcriptome assembly and merging, if I want to set a stricter clustering threshold, what parameters do I need to pay attention to, I hope you can help me.

I also initially tried the mmseqs2 easy-linclust which is much faster than cd-hit-est.

mmseqs easy-linclust virus.candidate.fasta mmseqs.cluster ./mmseqs.tmp --threads 60

And the results of mmseqs.cluster_all_seqs.fasta, mmseqs.cluster_cluster.tsv, mmseqs.cluster_rep_seq.fasta are obtained. I know mmseqs.cluster_rep_seq.fasta should be the result of deredundancy, but I want to get the information for clustering in order to find the distribution of the virus sequence across different samples, which file should be viewed, or what parameters are set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions