-
Notifications
You must be signed in to change notification settings - Fork 9
Support GenBank GFF3s with multiple gene lines #716
Description
This example is taken from GCA_900002375 on NCBI. If you download both the GenBank and RefSeq GFF3 files, you find some features represented quite differently. Currently Apollo fails when loading the GenBank version with the error "GFF3 features has multiple locations but is not a CDS", and then if I manually bypass that error, it fails with "Features with multiple locations may not have children". Both those error come from trying to import a feature that looks like this:
LK023116.2 EMBL gene 440063 440094 . - . ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2 EMBL gene 439840 439935 . - . ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2 EMBL gene 439582 439691 . - . ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2 EMBL gene 439389 439439 . - . ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2 EMBL gene 438265 438320 . - . ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2 EMBL CDS 440063 440094 . - 0 ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2 EMBL CDS 439840 439935 . - 1 ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2 EMBL CDS 439582 439691 . - 1 ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2 EMBL CDS 439389 439439 . - 2 ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2 EMBL CDS 438265 438320 . - 2 ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
In this example, the gene has multiple locations, each with a corresponding CDS. The RefSeq version of the gene looks like this (I think the GenBank gene corresponds to just the first transcript of the RefSeq gene):
NC_036159.2 RefSeq gene 438265 440094 . - . ID=gene-PBANKA_0111300.1;Dbxref=GeneID:55147732;Name=PBANKA_0111300.1;end_range=440094,.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=PBANKA_0111300.1;old_locus_tag=PBANKA_0111300.2%2CPBANKA_011130.1;partial=true;start_range=.,438265
NC_036159.2 RefSeq mRNA 438265 440094 . - . ID=rna-XM_034563897.1;Parent=gene-PBANKA_0111300.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;Name=XM_034563897.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438265;transcript_id=XM_034563897.1
NC_036159.2 RefSeq exon 440063 440094 . - . ID=exon-XM_034563897.1-1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2 RefSeq exon 439840 439935 . - . ID=exon-XM_034563897.1-2;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2 RefSeq exon 439582 439691 . - . ID=exon-XM_034563897.1-3;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2 RefSeq exon 439389 439439 . - . ID=exon-XM_034563897.1-4;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2 RefSeq exon 438265 438320 . - . ID=exon-XM_034563897.1-5;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438265;transcript_id=XM_034563897.1
NC_036159.2 RefSeq CDS 440063 440094 . - 0 ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2 RefSeq CDS 439840 439935 . - 1 ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2 RefSeq CDS 439582 439691 . - 1 ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2 RefSeq CDS 439389 439439 . - 2 ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2 RefSeq CDS 438265 438320 . - 2 ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2 RefSeq mRNA 438290 440094 . - . ID=rna-XM_034563908.1;Parent=gene-PBANKA_0111300.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;Name=XM_034563908.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438290;transcript_id=XM_034563908.1
NC_036159.2 RefSeq exon 440063 440094 . - . ID=exon-XM_034563908.1-1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2 RefSeq exon 439840 439935 . - . ID=exon-XM_034563908.1-2;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2 RefSeq exon 439582 439691 . - . ID=exon-XM_034563908.1-3;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2 RefSeq exon 439389 439439 . - . ID=exon-XM_034563908.1-4;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2 RefSeq exon 438639 438672 . - . ID=exon-XM_034563908.1-5;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2 RefSeq exon 438290 438320 . - . ID=exon-XM_034563908.1-6;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438290;transcript_id=XM_034563908.1
NC_036159.2 RefSeq CDS 440063 440094 . - 0 ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2 RefSeq CDS 439840 439935 . - 1 ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2 RefSeq CDS 439582 439691 . - 1 ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2 RefSeq CDS 439389 439439 . - 2 ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2 RefSeq CDS 438639 438672 . - 2 ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2 RefSeq CDS 438290 438320 . - 1 ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
The RefSeq GFF3 is handled fine in Apollo. I think we need to basically convert the GenBank gene to be like the RefSeq gene when importing into Apollo.