Skip to content

Commit 5a43a39

Browse files
author
chaodengusc
committed
update the documents
1 parent 686d74b commit 5a43a39

30 files changed

+134
-130
lines changed

DESCRIPTION

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
Package: preseqR
22
Type: Package
3-
Title: Predicting the Number of Species in a Random Sample
3+
Title: Predicting Species Accumulation Curves
44
Version: 4.0.0
5-
Date: 2017-12-26
5+
Date: 2018-06-27
66
Author: Chao Deng, Timothy Daley and Andrew D. Smith
77
Maintainer: Chao Deng <[email protected]>
8-
Description: The relation between the number of species and the number of individuals in a random sample is a classic problem back to Fisher (1943) <doi:10.2307/1411>. We generalize this problem to predict the number of species represented at least r times in a random sample. In particular when r=1, it becomes the classic problem. We use a mixture of Poisson processes to model sampling procedures and apply an empirical Bayes approach to obtain a rational function estimator. The approach can be applied to assess the quality of DNA sequencing libraries and optimize depths of sequencing experiments. For more information on 'preseqR', see Deng C, Daley T and Smith AD (2015) <doi:10.1007/s40484-015-0049-7> and Deng C and Smith AD (2016) <arXiv:1607.02804v2>.
8+
Description: Originally as an R version of Preseq <doi:10.1038/nmeth.2375>, the package has extended its functionality to predict the r-species accumulation curve (r-SAC), which is the number of species represented at least r times as a function of the sampling effort. When r = 1, the curve is known as the species accumulation curve, or the library complexity curve in high-throughput genomic sequencing. The package includes both parametric and nonparametric methods, as described by Deng C, et al. (2018) <arXiv:1607.02804v3>.
99
License: GPL-3
1010
Imports:
1111
polynom, graphics, stats

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ export(preseqR.interpolate.rSAC)
99
export(preseqR.rSAC)
1010
export(preseqR.rSAC.bootstrap)
1111
export(ds.rSAC)
12+
export(ds.rSAC.bootstrap)
1213
export(ztnb.rSAC)
1314
export(ztp.rSAC)
1415
export(bbc.rSAC)

R/kmer.R

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,18 @@ kmer.frac <- function(n, r=2, mt=20) {
2626

2727
## the fraction of k-mers represented at least r times as a function of
2828
## sample sizes
29-
kmer.frac.curve <- function(n, k, read.len, seq.gb, r=2, mt=20) {
29+
kmer.frac.curve <- function(n, k, read.len, seq, r=2, mt=20) {
3030
f <- kmer.frac(n, r=r, mt=mt)
3131
if (is.null(f))
3232
return(NULL)
3333
n[, 2] <- as.numeric(n[, 2])
3434
N <- n[, 1] %*% n[, 2]
3535
## average number of k-mers per read
3636
m <- read.len - k + 1
37-
unit.gb <- N / m * read.len / 1e9
38-
seq.effort <- seq.gb / unit.gb
39-
result <- matrix(c(seq.gb, f(seq.effort)), ncol=2, byrow=FALSE)
40-
colnames(result) <- c("bases(GB)", paste("frac(X>=", r, ")", sep=""))
37+
unit <- N / m * read.len
38+
seq.effort <- seq / unit
39+
result <- matrix(c(seq, f(seq.effort)), ncol=2, byrow=FALSE)
40+
colnames(result) <- c("bases", paste("frac(X>=", r, ")", sep=""))
4141
return(result)
4242
}
4343

@@ -50,7 +50,7 @@ kmer.frac.bootstrap <- function(n, r=2, mt=20, times=30, conf=0.95) {
5050

5151
## the fraction of k-mers represented at least r times as a function of
5252
## sample sizes
53-
kmer.frac.curve.bootstrap <- function(n, k, read.len, seq.gb, r=2, mt=20,
53+
kmer.frac.curve.bootstrap <- function(n, k, read.len, seq, r=2, mt=20,
5454
times=30, conf=0.95)
5555
{
5656
f <- kmer.frac.bootstrap(n, r=r, mt=mt, times=times, conf=conf)
@@ -60,11 +60,11 @@ kmer.frac.curve.bootstrap <- function(n, k, read.len, seq.gb, r=2, mt=20,
6060
N <- n[, 1] %*% n[, 2]
6161
## average number of k-mers per read
6262
m <- read.len - k + 1
63-
unit.gb <- N / m * read.len / 1e9
64-
seq.effort <- seq.gb / unit.gb
65-
result <- matrix(c(seq.gb, f$f(seq.effort), f$lb(seq.effort),
63+
unit <- N / m * read.len
64+
seq.effort <- seq / unit
65+
result <- matrix(c(seq, f$f(seq.effort), f$lb(seq.effort),
6666
f$ub(seq.effort)), ncol=4, byrow=FALSE)
67-
colnames(result) <- c("bases(GB)", paste("frac(X>=", r, ")", sep=""),
67+
colnames(result) <- c("bases", paste("frac(X>=", r, ")", sep=""),
6868
"lb", "ub")
6969
return(result)
7070
}

R/sequencing.R

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,7 @@
2020

2121
## predict the optimal number of sequenced bases using cost-benefit ratio
2222
preseqR.optimal.sequencing <- function(
23-
n, efficiency=0.05, bin=1e8, r=1, mt=20, size=SIZE.INIT,
24-
mu=MU.INIT, times=30, conf=0.95)
23+
n, efficiency=0.05, bin=1e8, r=1, mt=20, times=30, conf=0.95)
2524
{
2625
find.start <- function(f, N, bin, efficiency) {
2726
y = sapply(1:100, function(x) (f(x + bin / N) - f(x)) / bin - efficiency)
@@ -36,8 +35,8 @@ preseqR.optimal.sequencing <- function(
3635
N <- n[, 1] %*% n[, 2]
3736

3837
## r-species accumulation curve as a function of relative sample size
39-
f.rSAC <- preseqR.rSAC.bootstrap(
40-
n=n, r=r, mt=mt, size=size, mu=mu,times=times, conf=conf)
38+
f.rSAC <- ds.rSAC.bootstrap(
39+
n=n, r=r, mt=mt, times=times, conf=conf)
4140

4241
## hint: using r-SAC as a function of the number of sequenced bases
4342
f <- f.rSAC$f
@@ -73,7 +72,7 @@ preseqR.optimal.sequencing <- function(
7372
## the function is designed for EXOME sequencing, where aligned reads that
7473
## map to the same location are removed to avoid potential duplicate
7574
preseqR.rSAC.sequencing.rmdup <- function(
76-
n_base, n_read, r=1, mt=20, times=100, conf=0.95)
75+
n_base, n_read, r=1, mt=20, times=30, conf=0.95)
7776
{
7877
checking.hist(n_read)
7978
checking.hist(n_base)

inst/CITATION

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,14 @@ citEntry(entry = "article",
1919

2020
citEntry(entry = "article",
2121
title = "Estimating the number of species to attain sufficient representation in a random sample",
22-
author = personList(as.person("Chao Deng"), as.person("Andrew D. Smith")),
22+
author = personList(as.person("Chao Deng"), as.person("Timothy Daley"), as.person("Peter Calabrese"), as.person("Jie Ren"), as.person("Andrew D. Smith")),
2323
journal = "arXiv",
24-
year = "2016",
25-
url = "https://arxiv.org/abs/1607.02804v2",
24+
year = "2018",
25+
url = "https://arxiv.org/abs/1607.02804v3",
2626

2727
textVersion =
28-
paste("Deng C and Smith AD (2016).",
28+
paste("Deng C, Daley T, Calabrese P, Ren J and Smith AD (2018).",
2929
"Estimating the number of species to attain sufficient representation in a random sample.",
3030
"arXiv preprint.",
31-
"URL https://arxiv.org/abs/1607.02804v2.")
31+
"URL https://arxiv.org/abs/1607.02804v3.")
3232
)

man/Dickens.Rd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
\details{
99
A two-column matrix.
1010
The first column is the frequency \eqn{j = 1,2,\dots}; and the second column
11-
is \eqn{N_j}, the number of unique words appeared \eqn{j}
11+
is \eqn{N_j}, the number of unique words appeared exactly \eqn{j}
1212
times in a collection of Charles Dickens.
1313
}
1414

man/Twitter.Rd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
\details{
88
A two-column matrix.
99
The first column is the frequency \eqn{j = 1,2,\dots}; and the second column
10-
is \eqn{n_j}, the number of users with \eqn{j} followers.
10+
is \eqn{n_j}, the number of users with exactly \eqn{j} followers.
1111
}
1212
1313
\references{

man/WillButterfly.Rd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Animal Population, Journal of Animal Ecology, 12, 42-58, Table 3.
1313
\details{
1414
A two-column matrix.
1515
The first column is the frequency \eqn{j = 1,2,\dots}; and the second column
16-
is \eqn{n_j}, the number of butterflies captured \eqn{j}
16+
is \eqn{n_j}, the number of butterflies captured exactly \eqn{j}
1717
times in the sample.
1818
}
1919

man/bbc.rSAC.Rd

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ bbc.rSAC(n, r=1)
1818
\item{n}{
1919
A two-column matrix.
2020
The first column is the frequency \eqn{j = 1,2,\dots}; and the second column
21-
is \eqn{N_j}, the number of species with each species represented \eqn{j}
21+
is \eqn{N_j}, the number of species with each species represented exactly \eqn{j}
2222
times in the initial sample. The first column must be sorted in an
2323
ascending order.
2424
}
@@ -41,6 +41,10 @@ bbc.rSAC(n, r=1)
4141
Boneh, S., Boneh, A., & Caron, R. J. (1998). Estimating the prediction function
4242
and the number of unseen species in sampling with replacement.
4343
Journal of the American Statistical Association, 93(441), 372-379.
44+
45+
Deng, C., Daley, T., Calabrese, P., Ren, J., & Smith, A.D. (2016). Estimating
46+
the number of species to attain sufficient representation in a random sample.
47+
arXiv preprint arXiv:1607.02804v3.
4448
}
4549

4650
\examples{

0 commit comments

Comments
 (0)