-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hi, I'm using bambu as part of the nanoseq nextflow pipeline (https://nf-co.re/nanoseq/3.1.0/) to analyse samples from Wheat. Since the chromosomes are longer than 500MB and that gives problem during alignment and samtools indexing, I have splitted the fasta and gtf in chuncks of 500MB. My doubt is about the output I get. Both files count tables contains the transcript_id but that should only be present in the count_transcript, rigth? In that case I understand the difference in the counting numbers. Can you clarify what is happening? Is it a problem with the splitted reference?
Here you can see the results files:
grep TrturSVE1A02G00001630 counts_gene.txt
transcript_id TrturSVE1A02G00001630.1; TrturSVE1A02G00001630 64 60 21 37 22 36 39 20 29 30 31
transcript_id TrturSVE1A02G00001630.2; TrturSVE1A02G00001630 0 0 0 0 0 0 0 0 0 0 0
grep TrturSVE1A02G00001630 counts_transcript.txt
TrturSVE1A02G00001630.1 transcript_id TrturSVE1A02G00001630.1; TrturSVE1A02G00001630 58 50 16 31 18 30 31 14 24 25 25
TrturSVE1A02G00001630.2 transcript_id TrturSVE1A02G00001630.2; TrturSVE1A02G00001630 0 0 0 0 0 0 0 0 0 0 0
My gtf is in this format:
Chr1A_1 TrturSVE_EIv2.0 transcript 7741677 7744803 75 - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7741677 7742800 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7742897 7742957 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7743693 7743991 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7744542 7744803 . - . transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742528 7742800 . - 0 transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742897 7742957 . - 1 transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7743693 7743958 . - 0 transcript_id "TrturSVE1A02G00001630.1"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 transcript 7741677 7744803 68 - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7741677 7742800 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7742897 7742957 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7743684 7743991 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 exon 7744542 7744803 . - . transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742528 7742800 . - 0 transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7742897 7742957 . - 1 transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630"; Chr1A_1 TrturSVE_EIv2.0 CDS 7743684 7743958 . - 0 transcript_id "TrturSVE1A02G00001630.2"; gene_id "TrturSVE1A02G00001630";
Is it like that bambu is parsing as gene_id the whole attribute field and then as a transcript_id the correct one?
Thanks.