Skip to content

Use of bambu with large plant genomes, splitted chromosome #478

@gianmore

Description

@gianmore

Hi, I'm using bambu as part of the nanoseq nextflow pipeline (https://nf-co.re/nanoseq/3.1.0/) to analyse samples from Wheat. Since the chromosomes are longer than 500MB and that gives problem during alignment and samtools indexing, I have splitted the fasta and gtf in chuncks of 500MB. My doubt is about the output I get. Both files count tables contains the transcript_id but that should only be present in the count_transcript, rigth? In that case I understand the difference in the counting numbers. Can you clarify what is happening? Is it a problem with the splitted reference?

Here you can see the results files:

grep TrturSVE1A02G00001630 counts_gene.txt

transcript_id TrturSVE1A02G00001630.1; TrturSVE1A02G00001630          64   60   21   37   22   36   39   20   29   30     31

transcript_id TrturSVE1A02G00001630.2; TrturSVE1A02G00001630          0    0    0    0    0    0    0    0    0    0     0

grep TrturSVE1A02G00001630 counts_transcript.txt

TrturSVE1A02G00001630.1    transcript_id TrturSVE1A02G00001630.1; TrturSVE1A02G00001630        58   50     16   31   18   30   31   14   24   25   25

TrturSVE1A02G00001630.2    transcript_id TrturSVE1A02G00001630.2; TrturSVE1A02G00001630        0    0     0    0    0    0    0    0    0    0    0

My gtf is in this format:

Chr1A_1 TrturSVE_EIv2.0 transcript 7741677 7744803 75 - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7741677 7742800 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7742897 7742957 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7743693 7743991 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7744542 7744803 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742528 7742800 . - 0 transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742897 7742957 . - 1 transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7743693 7743958 . - 0 transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 transcript 7741677 7744803 68 - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7741677 7742800 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7742897 7742957 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7743684 7743991 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7744542 7744803 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742528 7742800 . - 0 transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742897 7742957 . - 1 transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7743684 7743958 . - 0 transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630";

Is it like that bambu is parsing as gene_id the whole attribute field and then as a transcript_id the correct one?
Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions