-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Hello,
I am using the orthologr function retrieve_longest_isoforms() on both refseq annotations and submitter/non-refseq annotations available on NCBI.
I have lists of species IDs, e.g., Dmel, and their corresponding accessions, e.g., GCF_000001215.4; one list is refseqs, the other is genbank IDs (GCA_###.#).
Using the code below, I was able to retrieve all the proteomes, gffs, and longest isoforms for all my refseqs:
for(i in 1:length(accs)) {
accession <- accs[i]
species <- spp[i]
#get data from GenBank
getProteome(db="refseq", accession, path=paste("./OFdata/", species, "/", sep=""))
getGFF(db="refseq", accession, path=paste("./OFdata/", species, "/", sep=""))
#get longest isoforms
retrieve_longest_isoforms(proteome_file=paste("./OFdata/", species, "/", accession, "_protein_refseq.faa.gz", sep=""),
annotation_file=paste("./OFdata/", species, "/", accession, "_genomic_refseq.gff.gz", sep=""),
new_file=paste("./of_proteomes/", species, "_protein.faa", sep=""))
}but this did not work for the genbank retrieval:
for(i in 1:length(accsGB)) {
accession <- accsGB[i]
species <- sppGB[i]
#get data from GenBank
getProteome(db="genbank", accession, path=paste("./OFdata/", species, "/", sep=""), reference=F)
getGFF(db="genbank", accession, path=paste("./OFdata/", species, "/", sep=""), reference=F)
#get longest isoforms
retrieve_longest_isoforms(proteome_file=paste("./OFdata/", species, "/", accession, "_protein_genbank.faa.gz", sep=""),
annotation_file=paste("./OFdata/", species, "/", accession, "_genomic_genbank.gff.gz", sep=""),
new_file=paste("./of_proteomes/", species, "_protein.faa", sep=""))
}I managed to download the proteomes and gffs for the species, but retrieve_longest_isoforms() failed with this error:
Extracting longest isoform from GCA_000611955.2_protein_genbank.faa.gz ...
Importing proteome file GCA_000611955.2_protein_genbank.faa.gz ...
Importing './OFdata/Smim/GCA_000611955.2_genomic_genbank.gff.gz' ...
Importing gff file GCA_000611955.2_genomic_genbank.gff.gz ...
Retrieving longest isoforms ...
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
✖ Column `gene` is not found.
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace(drop = FALSE)
<error/rlang_error>
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
✖ Column `gene` is not found.
---
Backtrace:
▆
1. └─orthologr::retrieve_longest_isoforms(...)
2. ├─dplyr::do(dplyr::group_by(pep_file_tibble_joined, gene), extract_longest_transcript(.))
3. ├─dplyr::group_by(pep_file_tibble_joined, gene)
4. └─dplyr:::group_by.data.frame(pep_file_tibble_joined, gene)
5. └─dplyr::group_by_prepare(.data, ..., .add = .add, error_call = current_env())
6. └─rlang::abort(bullets, call = error_call)
> list.files("./OFdata/Smim/")
[1] "doc_GCA_000611955.2_db_genbank.tsv" "doc_GCA_000611955.2_db_genbank.txt" "GCA_000611955.2_genomic_genbank.gff.gz"
[4] "GCA_000611955.2_protein_genbank.faa.gz"
Why does this not work on non-reference assemblies? The GFF formats seem identical between refseq and genbank assemblies.
EDIT: it somehow worked on one of my non-refseq assemblies, GCA_019974015.1, when I tried it independently but none of the others in my list.