You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each of the character vectors in this object contain unique IDs for records in the named databases. These functions try to make the most useful bits of the returned files available to users, but they also return the original file in case you want to dive into the XML yourself.
68
68
69
69
In this case we'll get the protein sequences as fasta files, using ' `entrez_fetch`:
# Entrez search result with 235255 hits (object contains 20 IDs and no web_history object)
179
+
# Entrez search result with 234154 hits (object contains 20 IDs and no web_history object)
168
180
# Search term (as translated): (Y[CHR] AND "Homo"[Organism]) NOT 10001[CHRPOS] : ...
169
181
170
182
When I wrote this that was a little over 200 000 SNPs. It's probably not a good idea to set `retmax` to 200 000 and just download all of those identifiers. Instead, we could store this list of IDs on the NCBI's server and refer to them in later calles to functions like `entrez_link` and `entrez_fetch` that accept a web history object.
# Entrez search result with 235255 hits (object contains 20 IDs and a web_history object)
191
+
# Entrez search result with 234154 hits (object contains 20 IDs and a web_history object)
180
192
# Search term (as translated): (Y[CHR] AND "Homo"[Organism]) NOT 10001[CHRPOS] : ...
181
193
182
194
As you can see, the result of the search now includes a `web_history` object. We can use that object to refer to these IDs in later calls. Heree we will just fetch complete records of the first 5 SNPs.
@@ -202,21 +214,20 @@ entrez_dbs()
202
214
203
215
# [1] "pubmed" "protein" "nuccore"
204
216
# [4] "nucleotide" "nucgss" "nucest"
205
-
# [7] "structure" "genome" "gpipe"
206
-
# [10] "annotinfo" "assembly" "bioproject"
207
-
# [13] "biosample" "blastdbinfo" "books"
208
-
# [16] "cdd" "clinvar" "clone"
209
-
# [19] "gap" "gapplus" "grasp"
210
-
# [22] "dbvar" "epigenomics" "gene"
211
-
# [25] "gds" "geoprofiles" "homologene"
212
-
# [28] "medgen" "mesh" "ncbisearch"
213
-
# [31] "nlmcatalog" "omim" "orgtrack"
214
-
# [34] "pmc" "popset" "probe"
215
-
# [37] "proteinclusters" "pcassay" "biosystems"
216
-
# [40] "pccompound" "pcsubstance" "pubmedhealth"
217
-
# [43] "seqannot" "snp" "sra"
218
-
# [46] "taxonomy" "unigene" "gencoll"
219
-
# [49] "gtr"
217
+
# [7] "structure" "genome" "annotinfo"
218
+
# [10] "assembly" "bioproject" "biosample"
219
+
# [13] "blastdbinfo" "books" "cdd"
220
+
# [16] "clinvar" "clone" "gap"
221
+
# [19] "gapplus" "grasp" "dbvar"
222
+
# [22] "epigenomics" "gene" "gds"
223
+
# [25] "geoprofiles" "homologene" "medgen"
224
+
# [28] "mesh" "ncbisearch" "nlmcatalog"
225
+
# [31] "omim" "orgtrack" "pmc"
226
+
# [34] "popset" "probe" "proteinclusters"
227
+
# [37] "pcassay" "biosystems" "pccompound"
228
+
# [40] "pcsubstance" "pubmedhealth" "seqannot"
229
+
# [43] "snp" "sra" "taxonomy"
230
+
# [46] "unigene" "gencoll" "gtr"
220
231
221
232
Some of the names are a little opaque, so you can get some more descriptive information about each with `entrez_db_summary()`
222
233
@@ -229,7 +240,7 @@ entrez_db_summary("cdd")
229
240
# Description: Conserved Domain Database
230
241
# DbBuild: Build150814-1106.1
231
242
# Count: 50648
232
-
# LastUpdate: 2015/08/14 18:35
243
+
# LastUpdate: 2015/08/14 18:42
233
244
234
245
`entrez_db_searchable()` lets you discover the fields available for search terms for a given database. You get back a named-list, with names are fields. Each element has additional information about each named search field (you can also use `as.data.frame` to create a dataframe, with one search-field per row):
235
246
@@ -241,7 +252,7 @@ search_fields$GRNT
241
252
# Name: GRNT
242
253
# FullName: Grant Number
243
254
# Description: NIH Grant Numbers
244
-
# TermCount: 2230658
255
+
# TermCount: 2272841
245
256
# IsDate: N
246
257
# IsNumerical: N
247
258
# SingleToken: Y
@@ -259,8 +270,8 @@ entrez_db_links("omim")
259
270
# [6] gene genetests geoprofiles gtr homologene
260
271
# [11] mapview medgen medgen nuccore nucest
261
272
# [16] nucgss omim pcassay pccompound pcsubstance
262
-
# [21] pmc protein pubmed pubmed snp
263
-
# [26] snp snp sra structure unigene
273
+
# [21] pmc protein pubmed pubmed sra
274
+
# [26] structure unigene
264
275
265
276
### Trendy topics in genetics
266
277
@@ -276,7 +287,7 @@ Let's start by making a function that finds the number of records matching a giv
276
287
With that we can fetch the data for each term and, by searching with no term, find the total number of papers published in each year:
0 commit comments