Skip to content

Commit 181213e

Browse files
committed
add 2015 data to README buzzword eg
1 parent e551a2f commit 181213e

File tree

2 files changed

+41
-30
lines changed

2 files changed

+41
-30
lines changed

README.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ find the total number of papers published in each year:
283283

284284

285285
```r
286-
years <- 1990:2014
286+
years <- 1990:2015
287287
total_papers <- papers_by_year(years, "")
288288
omics <- c("genomic", "epigenomic", "metagenomic", "proteomic", "transcriptomic", "pharmacogenomic", "connectomic" )
289289
trend_data <- sapply(omics, function(t) papers_by_year(years, t))

README.md

Lines changed: 40 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -56,20 +56,25 @@ hox_data$links
5656
```
5757

5858
# elink result with information from 14 databases:
59-
# [1] pubmed_medgen pubmed_mesh_major
60-
# [3] pubmed_nuccore pubmed_nucleotide
61-
# [5] pubmed_pmc_refs pubmed_protein
62-
# [7] pubmed_pubmed pubmed_pubmed_alsoviewed
63-
# [9] pubmed_pubmed_citedin pubmed_pubmed_combined
64-
# [11] pubmed_pubmed_five pubmed_pubmed_reviews
65-
# [13] pubmed_pubmed_reviews_five pubmed_taxonomy_entrez
59+
# [1] pubmed_medgen pubmed_pmc_refs
60+
# [3] pubmed_pubmed pubmed_pubmed_alsoviewed
61+
# [5] pubmed_pubmed_citedin pubmed_pubmed_combined
62+
# [7] pubmed_pubmed_five pubmed_pubmed_reviews
63+
# [9] pubmed_pubmed_reviews_five pubmed_mesh_major
64+
# [11] pubmed_nuccore pubmed_nucleotide
65+
# [13] pubmed_protein pubmed_taxonomy_entrez
6666

6767
Each of the character vectors in this object contain unique IDs for records in the named databases. These functions try to make the most useful bits of the returned files available to users, but they also return the original file in case you want to dive into the XML yourself.
6868

6969
In this case we'll get the protein sequences as fasta files, using ' `entrez_fetch`:
7070

7171
``` r
7272
hox_proteins <- entrez_fetch(db="protein", id=hox_data$links$pubmed_protein, rettype="fasta")
73+
```
74+
75+
# No encoding supplied: defaulting to UTF-8.
76+
77+
``` r
7378
cat(substr(hox_proteins, 1, 237))
7479
```
7580

@@ -127,9 +132,16 @@ Let's just get the two mitochondrial loci (COI and trnL), using `entrez_fetch`:
127132
COI_ids <- katipo_search$ids[c(2,6)]
128133
trnL_ids <- katipo_search$ids[5]
129134
COI <- entrez_fetch(db="popset", id=COI_ids, rettype="fasta")
135+
```
136+
137+
# No encoding supplied: defaulting to UTF-8.
138+
139+
``` r
130140
trnL <- entrez_fetch(db="popset", id=trnL_ids, rettype="fasta")
131141
```
132142

143+
# No encoding supplied: defaulting to UTF-8.
144+
133145
The "fetched" results are fasta formatted characters, which can be written to disk easily:
134146

135147
``` r
@@ -164,7 +176,7 @@ snp_search <- entrez_search(db="snp",
164176
snp_search
165177
```
166178

167-
# Entrez search result with 235255 hits (object contains 20 IDs and no web_history object)
179+
# Entrez search result with 234154 hits (object contains 20 IDs and no web_history object)
168180
# Search term (as translated): (Y[CHR] AND "Homo"[Organism]) NOT 10001[CHRPOS] : ...
169181

170182
When I wrote this that was a little over 200 000 SNPs. It's probably not a good idea to set `retmax` to 200 000 and just download all of those identifiers. Instead, we could store this list of IDs on the NCBI's server and refer to them in later calles to functions like `entrez_link` and `entrez_fetch` that accept a web history object.
@@ -176,7 +188,7 @@ snp_search <- entrez_search(db="snp",
176188
snp_search
177189
```
178190

179-
# Entrez search result with 235255 hits (object contains 20 IDs and a web_history object)
191+
# Entrez search result with 234154 hits (object contains 20 IDs and a web_history object)
180192
# Search term (as translated): (Y[CHR] AND "Homo"[Organism]) NOT 10001[CHRPOS] : ...
181193

182194
As you can see, the result of the search now includes a `web_history` object. We can use that object to refer to these IDs in later calls. Heree we will just fetch complete records of the first 5 SNPs.
@@ -202,21 +214,20 @@ entrez_dbs()
202214

203215
# [1] "pubmed" "protein" "nuccore"
204216
# [4] "nucleotide" "nucgss" "nucest"
205-
# [7] "structure" "genome" "gpipe"
206-
# [10] "annotinfo" "assembly" "bioproject"
207-
# [13] "biosample" "blastdbinfo" "books"
208-
# [16] "cdd" "clinvar" "clone"
209-
# [19] "gap" "gapplus" "grasp"
210-
# [22] "dbvar" "epigenomics" "gene"
211-
# [25] "gds" "geoprofiles" "homologene"
212-
# [28] "medgen" "mesh" "ncbisearch"
213-
# [31] "nlmcatalog" "omim" "orgtrack"
214-
# [34] "pmc" "popset" "probe"
215-
# [37] "proteinclusters" "pcassay" "biosystems"
216-
# [40] "pccompound" "pcsubstance" "pubmedhealth"
217-
# [43] "seqannot" "snp" "sra"
218-
# [46] "taxonomy" "unigene" "gencoll"
219-
# [49] "gtr"
217+
# [7] "structure" "genome" "annotinfo"
218+
# [10] "assembly" "bioproject" "biosample"
219+
# [13] "blastdbinfo" "books" "cdd"
220+
# [16] "clinvar" "clone" "gap"
221+
# [19] "gapplus" "grasp" "dbvar"
222+
# [22] "epigenomics" "gene" "gds"
223+
# [25] "geoprofiles" "homologene" "medgen"
224+
# [28] "mesh" "ncbisearch" "nlmcatalog"
225+
# [31] "omim" "orgtrack" "pmc"
226+
# [34] "popset" "probe" "proteinclusters"
227+
# [37] "pcassay" "biosystems" "pccompound"
228+
# [40] "pcsubstance" "pubmedhealth" "seqannot"
229+
# [43] "snp" "sra" "taxonomy"
230+
# [46] "unigene" "gencoll" "gtr"
220231

221232
Some of the names are a little opaque, so you can get some more descriptive information about each with `entrez_db_summary()`
222233

@@ -229,7 +240,7 @@ entrez_db_summary("cdd")
229240
# Description: Conserved Domain Database
230241
# DbBuild: Build150814-1106.1
231242
# Count: 50648
232-
# LastUpdate: 2015/08/14 18:35
243+
# LastUpdate: 2015/08/14 18:42
233244

234245
`entrez_db_searchable()` lets you discover the fields available for search terms for a given database. You get back a named-list, with names are fields. Each element has additional information about each named search field (you can also use `as.data.frame` to create a dataframe, with one search-field per row):
235246

@@ -241,7 +252,7 @@ search_fields$GRNT
241252
# Name: GRNT
242253
# FullName: Grant Number
243254
# Description: NIH Grant Numbers
244-
# TermCount: 2230658
255+
# TermCount: 2272841
245256
# IsDate: N
246257
# IsNumerical: N
247258
# SingleToken: Y
@@ -259,8 +270,8 @@ entrez_db_links("omim")
259270
# [6] gene genetests geoprofiles gtr homologene
260271
# [11] mapview medgen medgen nuccore nucest
261272
# [16] nucgss omim pcassay pccompound pcsubstance
262-
# [21] pmc protein pubmed pubmed snp
263-
# [26] snp snp sra structure unigene
273+
# [21] pmc protein pubmed pubmed sra
274+
# [26] structure unigene
264275

265276
### Trendy topics in genetics
266277

@@ -276,7 +287,7 @@ Let's start by making a function that finds the number of records matching a giv
276287
With that we can fetch the data for each term and, by searching with no term, find the total number of papers published in each year:
277288

278289
``` r
279-
years <- 1990:2014
290+
years <- 1990:2015
280291
total_papers <- papers_by_year(years, "")
281292
omics <- c("genomic", "epigenomic", "metagenomic", "proteomic", "transcriptomic", "pharmacogenomic", "connectomic" )
282293
trend_data <- sapply(omics, function(t) papers_by_year(years, t))

0 commit comments

Comments
 (0)