Skip to content

Support GenBank GFF3s with multiple gene lines #716

@garrettjstevens

Description

@garrettjstevens

This example is taken from GCA_900002375 on NCBI. If you download both the GenBank and RefSeq GFF3 files, you find some features represented quite differently. Currently Apollo fails when loading the GenBank version with the error "GFF3 features has multiple locations but is not a CDS", and then if I manually bypass that error, it fails with "Features with multiple locations may not have children". Both those error come from trying to import a feature that looks like this:

LK023116.2	EMBL	gene	440063	440094	.	-	.	ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2	EMBL	gene	439840	439935	.	-	.	ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2	EMBL	gene	439582	439691	.	-	.	ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2	EMBL	gene	439389	439439	.	-	.	ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2	EMBL	gene	438265	438320	.	-	.	ID=gene-PBANKA_0111300.1;Name=PBANKA_0111300.1;gbkey=Gene;gene_biotype=protein_coding;is_ordered=true;locus_tag=PBANKA_0111300.1
LK023116.2	EMBL	CDS	440063	440094	.	-	0	ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2	EMBL	CDS	439840	439935	.	-	1	ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2	EMBL	CDS	439582	439691	.	-	1	ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2	EMBL	CDS	439389	439439	.	-	2	ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1
LK023116.2	EMBL	CDS	438265	438320	.	-	2	ID=cds-VUC53849.1;Parent=gene-PBANKA_0111300.1;Dbxref=NCBI_GP:VUC53849.1;Name=VUC53849.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=VUC53849.1

In this example, the gene has multiple locations, each with a corresponding CDS. The RefSeq version of the gene looks like this (I think the GenBank gene corresponds to just the first transcript of the RefSeq gene):

NC_036159.2	RefSeq	gene	438265	440094	.	-	.	ID=gene-PBANKA_0111300.1;Dbxref=GeneID:55147732;Name=PBANKA_0111300.1;end_range=440094,.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=PBANKA_0111300.1;old_locus_tag=PBANKA_0111300.2%2CPBANKA_011130.1;partial=true;start_range=.,438265
NC_036159.2	RefSeq	mRNA	438265	440094	.	-	.	ID=rna-XM_034563897.1;Parent=gene-PBANKA_0111300.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;Name=XM_034563897.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438265;transcript_id=XM_034563897.1
NC_036159.2	RefSeq	exon	440063	440094	.	-	.	ID=exon-XM_034563897.1-1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2	RefSeq	exon	439840	439935	.	-	.	ID=exon-XM_034563897.1-2;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2	RefSeq	exon	439582	439691	.	-	.	ID=exon-XM_034563897.1-3;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2	RefSeq	exon	439389	439439	.	-	.	ID=exon-XM_034563897.1-4;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563897.1
NC_036159.2	RefSeq	exon	438265	438320	.	-	.	ID=exon-XM_034563897.1-5;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XM_034563897.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438265;transcript_id=XM_034563897.1
NC_036159.2	RefSeq	CDS	440063	440094	.	-	0	ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2	RefSeq	CDS	439840	439935	.	-	1	ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2	RefSeq	CDS	439582	439691	.	-	1	ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2	RefSeq	CDS	439389	439439	.	-	2	ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2	RefSeq	CDS	438265	438320	.	-	2	ID=cds-XP_034419713.1;Parent=rna-XM_034563897.1;Dbxref=GeneID:55147732,GenBank:XP_034419713.1;Name=XP_034419713.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419713.1
NC_036159.2	RefSeq	mRNA	438290	440094	.	-	.	ID=rna-XM_034563908.1;Parent=gene-PBANKA_0111300.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;Name=XM_034563908.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438290;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	exon	440063	440094	.	-	.	ID=exon-XM_034563908.1-1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;end_range=440094,.;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	exon	439840	439935	.	-	.	ID=exon-XM_034563908.1-2;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	exon	439582	439691	.	-	.	ID=exon-XM_034563908.1-3;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	exon	439389	439439	.	-	.	ID=exon-XM_034563908.1-4;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	exon	438639	438672	.	-	.	ID=exon-XM_034563908.1-5;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	exon	438290	438320	.	-	.	ID=exon-XM_034563908.1-6;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XM_034563908.1;gbkey=mRNA;locus_tag=PBANKA_0111300.1;partial=true;product=conserved Plasmodium protein%2C unknown function;start_range=.,438290;transcript_id=XM_034563908.1
NC_036159.2	RefSeq	CDS	440063	440094	.	-	0	ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2	RefSeq	CDS	439840	439935	.	-	1	ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2	RefSeq	CDS	439582	439691	.	-	1	ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2	RefSeq	CDS	439389	439439	.	-	2	ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2	RefSeq	CDS	438639	438672	.	-	2	ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1
NC_036159.2	RefSeq	CDS	438290	438320	.	-	1	ID=cds-XP_034419714.1;Parent=rna-XM_034563908.1;Dbxref=GeneID:55147732,GenBank:XP_034419714.1;Name=XP_034419714.1;gbkey=CDS;locus_tag=PBANKA_0111300.1;product=conserved Plasmodium protein%2C unknown function;protein_id=XP_034419714.1

The RefSeq GFF3 is handled fine in Apollo. I think we need to basically convert the GenBank gene to be like the RefSeq gene when importing into Apollo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions