retrieve_longest_isoforms not working on *most* genbank (NOT refseq) annotations?

Hello,

I am using the orthologr function `retrieve_longest_isoforms()` on both refseq annotations and submitter/non-refseq annotations available on NCBI.

I have lists of species IDs, e.g., Dmel, and their corresponding accessions, e.g., GCF_000001215.4; one list is refseqs, the other is genbank IDs (GCA_###.#).


Using the code below, I was able to retrieve all the proteomes, gffs, and longest isoforms for all my refseqs:

```r
for(i in 1:length(accs)) {
accession <- accs[i]
species <- spp[i]

#get data from GenBank
getProteome(db="refseq", accession, path=paste("./OFdata/", species, "/", sep=""))
getGFF(db="refseq", accession, path=paste("./OFdata/", species, "/", sep=""))

#get longest isoforms
retrieve_longest_isoforms(proteome_file=paste("./OFdata/", species, "/", accession, "_protein_refseq.faa.gz", sep=""),
                          annotation_file=paste("./OFdata/", species, "/", accession, "_genomic_refseq.gff.gz", sep=""),
                          new_file=paste("./of_proteomes/", species, "_protein.faa", sep=""))
}
```

but this did not work for the genbank retrieval:

```r
for(i in 1:length(accsGB)) {
accession <- accsGB[i]
species <- sppGB[i]

#get data from GenBank
getProteome(db="genbank", accession, path=paste("./OFdata/", species, "/", sep=""), reference=F)
getGFF(db="genbank", accession, path=paste("./OFdata/", species, "/", sep=""), reference=F)

#get longest isoforms
retrieve_longest_isoforms(proteome_file=paste("./OFdata/", species, "/", accession, "_protein_genbank.faa.gz", sep=""),
                          annotation_file=paste("./OFdata/", species, "/", accession,  "_genomic_genbank.gff.gz", sep=""),
                          new_file=paste("./of_proteomes/", species, "_protein.faa", sep=""))
}
```

I managed to download the  proteomes and gffs for the species, but `retrieve_longest_isoforms()` failed with this error:

```
Extracting longest isoform from GCA_000611955.2_protein_genbank.faa.gz ...
Importing proteome file GCA_000611955.2_protein_genbank.faa.gz ...
Importing './OFdata/Smim/GCA_000611955.2_genomic_genbank.gff.gz' ...
Importing gff file GCA_000611955.2_genomic_genbank.gff.gz ...                                                                                  
Retrieving longest isoforms ...
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
✖ Column `gene` is not found.
Run `rlang::last_trace()` to see where the error occurred.

> rlang::last_trace(drop = FALSE)
<error/rlang_error>
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
✖ Column `gene` is not found.
---
Backtrace:
    ▆
 1. └─orthologr::retrieve_longest_isoforms(...)
 2.   ├─dplyr::do(dplyr::group_by(pep_file_tibble_joined, gene), extract_longest_transcript(.))
 3.   ├─dplyr::group_by(pep_file_tibble_joined, gene)
 4.   └─dplyr:::group_by.data.frame(pep_file_tibble_joined, gene)
 5.     └─dplyr::group_by_prepare(.data, ..., .add = .add, error_call = current_env())
 6.       └─rlang::abort(bullets, call = error_call)

> list.files("./OFdata/Smim/")
[1] "doc_GCA_000611955.2_db_genbank.tsv"     "doc_GCA_000611955.2_db_genbank.txt"     "GCA_000611955.2_genomic_genbank.gff.gz"
[4] "GCA_000611955.2_protein_genbank.faa.gz"
```

Why does this not work on non-reference assemblies? The GFF formats seem identical between refseq and genbank assemblies.

---------------------
EDIT: it somehow worked on one of my non-refseq assemblies, GCA_019974015.1, when I tried it independently but none of the others in my list.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retrieve_longest_isoforms not working on most genbank (NOT refseq) annotations? #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

retrieve_longest_isoforms not working on *most* genbank (NOT refseq) annotations? #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

retrieve_longest_isoforms not working on most genbank (NOT refseq) annotations? #50