Skip to content

retrieve_longest_isoforms not working on *most* genbank (NOT refseq) annotations? #50

@madeleineaaseremedios

Description

@madeleineaaseremedios

Hello,

I am using the orthologr function retrieve_longest_isoforms() on both refseq annotations and submitter/non-refseq annotations available on NCBI.

I have lists of species IDs, e.g., Dmel, and their corresponding accessions, e.g., GCF_000001215.4; one list is refseqs, the other is genbank IDs (GCA_###.#).

Using the code below, I was able to retrieve all the proteomes, gffs, and longest isoforms for all my refseqs:

for(i in 1:length(accs)) {
accession <- accs[i]
species <- spp[i]

#get data from GenBank
getProteome(db="refseq", accession, path=paste("./OFdata/", species, "/", sep=""))
getGFF(db="refseq", accession, path=paste("./OFdata/", species, "/", sep=""))

#get longest isoforms
retrieve_longest_isoforms(proteome_file=paste("./OFdata/", species, "/", accession, "_protein_refseq.faa.gz", sep=""),
                          annotation_file=paste("./OFdata/", species, "/", accession, "_genomic_refseq.gff.gz", sep=""),
                          new_file=paste("./of_proteomes/", species, "_protein.faa", sep=""))
}

but this did not work for the genbank retrieval:

for(i in 1:length(accsGB)) {
accession <- accsGB[i]
species <- sppGB[i]

#get data from GenBank
getProteome(db="genbank", accession, path=paste("./OFdata/", species, "/", sep=""), reference=F)
getGFF(db="genbank", accession, path=paste("./OFdata/", species, "/", sep=""), reference=F)

#get longest isoforms
retrieve_longest_isoforms(proteome_file=paste("./OFdata/", species, "/", accession, "_protein_genbank.faa.gz", sep=""),
                          annotation_file=paste("./OFdata/", species, "/", accession,  "_genomic_genbank.gff.gz", sep=""),
                          new_file=paste("./of_proteomes/", species, "_protein.faa", sep=""))
}

I managed to download the proteomes and gffs for the species, but retrieve_longest_isoforms() failed with this error:

Extracting longest isoform from GCA_000611955.2_protein_genbank.faa.gz ...
Importing proteome file GCA_000611955.2_protein_genbank.faa.gz ...
Importing './OFdata/Smim/GCA_000611955.2_genomic_genbank.gff.gz' ...
Importing gff file GCA_000611955.2_genomic_genbank.gff.gz ...                                                                                  
Retrieving longest isoforms ...
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
✖ Column `gene` is not found.
Run `rlang::last_trace()` to see where the error occurred.

> rlang::last_trace(drop = FALSE)
<error/rlang_error>
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
✖ Column `gene` is not found.
---
Backtrace:
    ▆
 1. └─orthologr::retrieve_longest_isoforms(...)
 2.   ├─dplyr::do(dplyr::group_by(pep_file_tibble_joined, gene), extract_longest_transcript(.))
 3.   ├─dplyr::group_by(pep_file_tibble_joined, gene)
 4.   └─dplyr:::group_by.data.frame(pep_file_tibble_joined, gene)
 5.     └─dplyr::group_by_prepare(.data, ..., .add = .add, error_call = current_env())
 6.       └─rlang::abort(bullets, call = error_call)

> list.files("./OFdata/Smim/")
[1] "doc_GCA_000611955.2_db_genbank.tsv"     "doc_GCA_000611955.2_db_genbank.txt"     "GCA_000611955.2_genomic_genbank.gff.gz"
[4] "GCA_000611955.2_protein_genbank.faa.gz"

Why does this not work on non-reference assemblies? The GFF formats seem identical between refseq and genbank assemblies.


EDIT: it somehow worked on one of my non-refseq assemblies, GCA_019974015.1, when I tried it independently but none of the others in my list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions