Skip to content

Commit 3b17571

Browse files
committed
release v1.3
1 parent fcb6ce4 commit 3b17571

21 files changed

+222
-85
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: msImpute
22
Type: Package
33
Title: Peptide imputation in label-free proteomics
4-
Version: 1.2.0
4+
Version: 1.3.0
55
Authors@R:
66
person(given = "Soroor",
77
family = "Hediyeh-zadeh",

NAMESPACE

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,11 @@ export(KNC)
55
export(KNN)
66
export(betweenness)
77
export(computeStructuralMetrics)
8+
export(findVariableFeatures)
89
export(gromov_wasserstein)
910
export(msImpute)
1011
export(scaleData)
1112
export(selectFeatures)
1213
export(withinness)
14+
importFrom(scran,decomposeVar)
15+
importFrom(scran,trendVar)

R/CPD.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
#' CPD quantifies preservation of the global structure after imputation.
55
#' Requires complete datasets - for developers/use in benchmark studies only.
66
#'
7-
#' @param xorigin numeric matrix. The original data. Can not contain missing values.
8-
#' @param ximputed numeric matrix. The imputed data. Can not contain missing values.
7+
#' @param xorigin numeric matrix. The original log-intensity data. Can not contain missing values.
8+
#' @param ximputed numeric matrix. The imputed log-intensity data. Can not contain missing values.
99
#'
1010
#' @return numeric
1111
#'

R/KNC.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
#' quantifies preservation of the mesoscopic structure after imputation.
55
#' Requires complete datasets - for developers/use in benchmark studies only.
66
#'
7-
#' @param xorigin numeric matrix. The original data. Can contain missing values.
8-
#' @param ximputed numeric matrix. The imputed data.
7+
#' @param xorigin numeric matrix. The original log-intensity data. Can contain missing values.
8+
#' @param ximputed numeric matrix. The imputed log-intensity data.
99
#' @param class factor. A vector of length number of columns (samples) in the data specifying the class/label (i.e. experimental group) of each sample.
1010
#' @param k number of nearest class means. default to k=3.
1111
#'

R/KNN.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
#' KNN quantifies preservation of the local, or microscopic structure.
55
#' Requires complete datasets - for developers/use in benchmark studies only.
66
#'
7-
#' @param xorigin numeric matrix. The original data. Can not contain missing values.
8-
#' @param ximputed numeric matrix. The imputed data. Can not contain missing values.
7+
#' @param xorigin numeric matrix. The original log-intensity data. Can not contain missing values.
8+
#' @param ximputed numeric matrix. The imputed log-intensity data. Can not contain missing values.
99
#' @param k number of nearest neighbours. default to k=3.
1010
#'
1111
#' @return numeric The proportion of preserved k-nearest neighbours in imputed data.

R/computeStructuralMetrics.R

Lines changed: 61 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
#' Metrics for the assessment of post-imputation structural preservation
22
#'
33
#' For an imputed dataset, it computes within phenotype/experimental condition similarity (i.e. preservation of local structures),
4-
#' between phenotype distances (preservation of global structures), and the Gromov-Wasserstein (GW) distance between original and
4+
#' between phenotype distances (preservation of global structures), and the Gromov-Wasserstein (GW) distance between original (source) and
55
#' imputed data.
66
#'
7-
#' @param x numeric matrix. An imputed data matrix.
7+
#' @param x numeric matrix. An imputed data matrix of log-intensity.
88
#' @param group factor. A vector of biological groups, experimental conditions or phenotypes (e.g. control, treatment).
9-
#' @param xna numeric matrix. Data matrix with missing values (i.e. the original intensity matrix with NAs)
9+
#' @param y numeric matrix. The source data (i.e. the original log-intensity matrix), preferably subsetted on highly variable peptides (see \code{findVariableFeatures}).
10+
#' @param k numeric. Number of Principal Components used to compute the GW distance. default to 2.
1011
#'
1112
#' @details For each group of experimental conditions (e.g. treatment and control), the group centroid is calculated as the average
1213
#' of observed peptide intensities. Withinness for each group is computed as sum of the squared distances between samples in that group and
@@ -16,15 +17,21 @@
1617
#' The GW metric considers preservation of both local and global structures simultaneously. A small GW distance suggests that
1718
#' imputation has introduced small distortions to global and local structures overall, whereas a large distance implies significant
1819
#' distortions. When comparing two or more imputation methods, the optimal method is the method with smallest GW distance.
19-
#' To compute the GW distance, the missing values in each column of \code{xna} are replaced by mean of observed values in that column.
20-
#' This is equivalent to imputation by KNN, where k is set to the total number of identified peptides (i.e. number of rows in the input matrix).
21-
#' GW distance estimation requires \code{python}. See example.
22-
#' All metrics are on log scale.
20+
#' The GW distance is computed on Principal Components (PCs) of the source and imputed data, instead of peptides. Principal components capture the
21+
#' geometry of the data, hence GW computed on PCs is a better measure of preservation of local and global structures. The PCs in the source data are
22+
#' recommended to be computed on peptides with high biological variance. Hence, users are recommended to subset the source data only on highly variable peptides (hvp)
23+
#' (see \code{findVariableFeatures}). Since the hvp peptides have high biological variance, they are likely to have enough information to discriminate samples
24+
#' from different experimental groups. Hence, PCs computed on those peptides should be representative of the original source data with missing values.
25+
#' If the samples cluster by experimental group in the first couple of PCs, then a choice of k=2 is reasonable. If the desired separation/clustering of samples
26+
#' occurs in later PCs (i.e. the first few PCs are dominated by batches or unwanted variability), then it is recommended to use a larger number of PCs to compute the
27+
#' GW metric. If you are interested in how well the imputed data represent the original data in all possible dimensions, then set k to the number of samples
28+
#' in the data (i.e. the number of columns in the intensity matrix).
29+
#' GW distance estimation requires \code{python}. See example. All metrics are on log scale.
2330
#'
2431
#'
2532
#' @return list of three metrics: withinness (sum of squared distances within a phenotype group),
2633
#' betweenness (sum of squared distances between the phenotypes), and gromov-wasserstein distance (if \code{xna} is not NULL).
27-
#' All metrics are on log scale.
34+
#' if \code{group} is NULL only the GW distance is returned. All metrics are on log scale.
2835
#'
2936
#'
3037
#' @examples
@@ -49,28 +56,35 @@
4956
#' # you can then run the computeStructuralMetrics() function.
5057
#' # Note that the reticulate package should be loaded before loading msImpute.
5158
#' set.seed(101)
52-
#' n=200
53-
#' p=100
54-
#' J=50
59+
#' n=12000
60+
#' p=10
61+
#' J=5
5562
#' np=n*p
5663
#' missfrac=0.3
57-
#' x=matrix(rnorm(n*J),n,J)%*%matrix(rnorm(J*p),J,p)+matrix(rnorm(np),n,p)/5
64+
#' x=matrix(rnorm(n*J,mean = 5,sd = 0.2),n,J)%*%matrix(rnorm(J*p, mean = 5,sd = 0.2),J,p)+
65+
#' matrix(rnorm(np,mean = 5,sd = 0.2),n,p)/5
5866
#' ix=seq(np)
5967
#' imiss=sample(ix,np*missfrac,replace=FALSE)
6068
#' xna=x
6169
#' xna[imiss]=NA
70+
#' keep <- (rowSums(!is.na(xna)) >= 4)
71+
#' xna <- xna[keep,]
72+
#' rownames(xna) <- 1:nrow(xna)
6273
#' y <- xna
6374
#' xna <- scaleData(xna)
6475
#' xcomplete <- msImpute(object=xna)
65-
#' G <- as.factor(sample(1:5, 100, replace = TRUE))
66-
#' computeStructuralMetrics(xcomplete, G, y)
76+
#' G <- as.factor(sample(1:3, p, replace = TRUE))
77+
#' top.hvp <- findVariableFeatures(y)
78+
#' computeStructuralMetrics(xcomplete, G, y[rownames(top.hvp)[1:50],], k = 2)
6779
#' @export
68-
computeStructuralMetrics <- function(x, group, xna = NULL){
69-
out <- list(withinness = log(withinness(x, group)),
70-
betweenness = log(betweenness(x,group)))
80+
computeStructuralMetrics <- function(x, group=NULL, y = NULL, k=2){
81+
if(!is.null(group)){
82+
out <- list(withinness = log(withinness(x, group)),
83+
betweenness = log(betweenness(x,group)))
84+
}
7185

72-
if(!is.null(xna)){
73-
GW <- gromov_wasserstein(xna, x)
86+
if(!is.null(y)){
87+
GW <- gromov_wasserstein(x, y, k=k)
7488
out[['gw_dist']] <- GW[[2]]$gw_dist
7589
}
7690
return(out)
@@ -101,8 +115,33 @@ betweenness <- function(x, class_label){
101115

102116

103117
#' @export
104-
gromov_wasserstein <- function(xna, ximputed){
118+
gromov_wasserstein <- function(x, y, k, min.mean = 0.1){
119+
if (k > ncol(x)) stop("Number of Principal Components cannot be greater than number of columns (samples) in the data.")
120+
if (any(!is.finite(x))) stop("Non-finite values (NA, Inf, NaN) encountered in imputed data")
121+
if (any(!is.finite(y))) stop("Non-finite values (NA, Inf, NaN) encountered in source data")
122+
123+
means <- rowMeans(x)
124+
vars <- matrixStats::rowSds(x)
125+
126+
# Filtering out zero-variance and low-abundance peptides
127+
is.okay <- !is.na(vars) & vars > 1e-8 & means >= min.mean
128+
129+
xt <- t(x)
130+
yt <- t(y)
131+
132+
# compute PCA
133+
xt_pca <- prcomp(xt[,is.okay], scale. = TRUE, center = TRUE)
134+
yt_pca <- prcomp(yt, scale. = TRUE, center = TRUE)
135+
136+
C1 <- yt_pca$x[,1:k]
137+
C2 <- xt_pca$x[,1:k]
138+
139+
140+
cat("Computing GW distance using k=", k, "Principal Components")
105141
reticulate::source_python(system.file("python", "gw.py", package = "msImpute"))
106-
xna <- apply(xna, 2, FUN=function(x) {x[is.na(x)] <- mean(x, na.rm=TRUE); return(x)})
107-
return(gw(t(xna), t(ximputed), ncol(xna)))
142+
return(gw(C1,C2, ncol(x)))
108143
}
144+
145+
146+
147+

R/findVariableFeatures.R

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
#' Find highly variable peptides
2+
#'
3+
#' For each peptide, the total variance is decomposed into biological and technical variance using package \code{scran}
4+
#' @param y numeric matrix giving log-intensity. Can contain NA values.
5+
#'
6+
#' @return A data frame where rows are peptides and columns contain estimates of biological and technical variances. Peptides are ordered by biological variance.
7+
#'
8+
#' @details A loess trend is fitted to total sample variances and mean intensities. For each peptide, the biological variance is then
9+
#' computed by subtracting the estimated technical variance from the loess fit from the total sample variance.
10+
#'
11+
#' @seealso computeStructuralMetrics
12+
#'
13+
#' @export
14+
#' @importFrom scran trendVar decomposeVar
15+
findVariableFeatures <- function(y){
16+
fit <- trendVar(y)
17+
results <- decomposeVar(y, fit)
18+
plot(results$mean, results$total)
19+
o <- order(results$mean)
20+
lines(results$mean[o], results$tech[o], col="red", lwd=2)
21+
results <- as.data.frame(results)
22+
top.dec <- results[order(results$bio, decreasing=TRUE), ]
23+
return(top.dec)
24+
25+
}

R/msImpute.R

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
#' \code{msImpute} operates on the softImpute-ALS algorithm.
99
#' For more details on the underlying algorithm, please see \code{\link[softImpute]{softImpute}} package.
1010
#'
11-
#' @param object Numeric matrix where missing values are denoted by NA. Rows are peptides, columns are samples.
11+
#' @param object Numeric matrix giving log-intensity where missing values are denoted by NA. Rows are peptides, columns are samples.
1212
#' @param rank.max Numeric. This restricts the rank of the solution. is set to min(dim(\code{object})-1) by default.
1313
#' @param lambda Numeric. Nuclear-norm regularization parameter. Controls the low-rank property of the solution
1414
#' to the matrix completion problem. By default, it is determined at the scaling step. If set to zero
@@ -24,16 +24,19 @@
2424
#'
2525
#' @examples
2626
#' set.seed(101)
27-
#' n=200
28-
#' p=100
29-
#' J=50
27+
#' n=12000
28+
#' p=10
29+
#' J=5
3030
#' np=n*p
3131
#' missfrac=0.3
32-
#' x=matrix(rnorm(n*J),n,J)%*%matrix(rnorm(J*p),J,p)+matrix(rnorm(np),n,p)/5
32+
#' x=matrix(rnorm(n*J,mean = 5,sd = 0.2),n,J)%*%matrix(rnorm(J*p, mean = 5,sd = 0.2),J,p)+
33+
#' matrix(rnorm(np,mean = 5,sd = 0.2),n,p)/5
3334
#' ix=seq(np)
3435
#' imiss=sample(ix,np*missfrac,replace=FALSE)
3536
#' xna=x
3637
#' xna[imiss]=NA
38+
#' keep <- (rowSums(!is.na(xna)) >= 4)
39+
#' xna <- xna[keep,]
3740
#' xna <- scaleData(xna)
3841
#' xcomplete <- msImpute(object=xna)
3942
#' @seealso selectFeatures, scaleData
@@ -50,20 +53,21 @@ msImpute <- function(object, rank.max = NULL, lambda = NULL, thresh = 1e-05,
5053
if(is(object, "matrix")) {
5154
x <- object
5255
xnas <- x
56+
warning("Input is not scaled. Data scaling is recommended for msImpute optimal performance.")
5357
}
5458
# MAList object
5559
# or \code{MAList} object from \link{limma}
5660
# if(is(object,"MAList")) x <- object$E
5761

58-
62+
if(any(is.nan(x) | is.infinite(x))) stop("Inf or NaN values encountered.")
5963
if(any(rowSums(!is.na(x)) <= 3)) stop("Peptides with excessive NAs are detected. Please revisit your fitering step. At least 4 non-missing measurements are required for any peptide.")
6064
if(any(x < 0, na.rm = TRUE)){
6165
warning("Negative values encountered in imputed data. Please consider revising filtering and/or normalisation steps.")
6266
}
6367
if(is.null(rank.max)) rank.max <- min(dim(x) - 1)
6468
cat("maximum rank is", rank.max, "\n")
6569
cat("computing lambda0 ... \n")
66-
if(is.null(lambda)) lambda <- softImpute::lambda0(x)
70+
if(is.null(lambda)) lambda <- softImpute::lambda0(xnas)
6771
cat("lambda0 is", lambda, "\n")
6872
cat("fit the low-rank model ... \n")
6973
fit <- softImpute::softImpute(xnas,rank=rank.max,lambda=lambda, type = "als", thresh = thresh,

R/scaleData.R

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#' Standardize a matrix to have optionally row means zero and variances one, and/or column means zero and variances one.
22
#'
33
#'
4-
#' @param object numeric matrix where missing values are denoted by NA. Rows are peptides, columns are samples.
4+
#' @param object numeric matrix giving log-intensity where missing values are denoted by NA. Rows are peptides, columns are samples.
55
#' @param maxit numeric. maximum iteration for the algorithm to converge (default to 20). When both row and column centering/scaling is requested, iteration may be necessary.
66
#' @param thresh numeric. Convergence threshold (default to 1e-09).
77
#' @param row.center logical. if row.center==TRUE (the default), row centering will be performed resulting in a matrix with row means zero. If row.center is a vector, it will be used to center the rows. If row.center=FALSE nothing is done.
@@ -12,23 +12,27 @@
1212
#'
1313
#' @details
1414
#' Standardizes rows and/or columns of a matrix with missing values, according to the \code{biScale} algorithm in Hastie et al. 2015.
15+
#' Data is assumed to be normalised and log-transformed.
1516
#'
1617
#' @return
1718
#' A list of two components: E and E.scaled. E contains the input matrix, E.scaled contains the scaled data
1819
#'
1920
#'
2021
#' @examples
2122
#' set.seed(101)
22-
#' n=200
23-
#' p=100
24-
#' J=50
23+
#' n=12000
24+
#' p=10
25+
#' J=5
2526
#' np=n*p
2627
#' missfrac=0.3
27-
#' x=matrix(rnorm(n*J),n,J)%*%matrix(rnorm(J*p),J,p)+matrix(rnorm(np),n,p)/5
28+
#' x=matrix(rnorm(n*J,mean = 5,sd = 0.2),n,J)%*%matrix(rnorm(J*p, mean = 5,sd = 0.2),J,p)+
29+
#' matrix(rnorm(np,mean = 5,sd = 0.2),n,p)/5
2830
#' ix=seq(np)
2931
#' imiss=sample(ix,np*missfrac,replace=FALSE)
3032
#' xna=x
3133
#' xna[imiss]=NA
34+
#' keep <- (rowSums(!is.na(xna)) >= 4)
35+
#' xna <- xna[keep,]
3236
#' xna <- scaleData(xna)
3337
#' @seealso selectFeatures, msImpute
3438
#' @export
@@ -39,7 +43,7 @@ scaleData <- function(object, maxit = 20, thresh = 1e-09, row.center = TRUE, row
3943
}else{
4044
x <- object
4145
}
42-
46+
if(any(is.nan(x) | is.infinite(x))) stop("Inf or NaN values encountered.")
4347
if(any(rowSums(!is.na(x)) <= 3)) stop("Peptides with excessive NAs are detected. Please revisit your fitering step. At least 4 non-missing measurements are required for any peptide.")
4448
if(any(x < 0, na.rm = TRUE)){
4549
warning("Negative values encountered in imputed data. Please consider revisting the filtering and/or normalisation steps, if appropriate.")

R/selectFeatures.R

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#' used to determine if data is Missing Not At Random (MNAR). Users should note that \code{msImpute} assumes peptides
55
#' are Missing At Random (MAR).
66
#'
7-
#' @param object Numeric matrix where missing values are denoted by NA.
7+
#' @param object Numeric matrix giving log-intensity where missing values are denoted by NA.
88
#' Rows are peptides, columns are samples.
99
#' @param n_features Numeric, number of features with high dropout rate. 500 by default.
1010
#' @param suppress_plot Logical show plot of dropouts vs abundances.
@@ -13,16 +13,19 @@
1313
#'
1414
#' @examples
1515
#' set.seed(101)
16-
#' n=800
17-
#' p=100
18-
#' J=50
16+
#' n=12000
17+
#' p=10
18+
#' J=5
1919
#' np=n*p
2020
#' missfrac=0.3
21-
#' x=matrix(rnorm(n*J),n,J)%*%matrix(rnorm(J*p),J,p)+matrix(rnorm(np),n,p)/5
21+
#' x=matrix(rnorm(n*J,mean = 5,sd = 0.2),n,J)%*%matrix(rnorm(J*p, mean = 5,sd = 0.2),J,p)+
22+
#' matrix(rnorm(np,mean = 5,sd = 0.2),n,p)/5
2223
#' ix=seq(np)
2324
#' imiss=sample(ix,np*missfrac,replace=FALSE)
2425
#' xna=x
2526
#' xna[imiss]=NA
27+
#' keep <- (rowSums(!is.na(xna)) >= 4)
28+
#' xna <- xna[keep,]
2629
#' rownames(xna) <- 1:nrow(xna)
2730
#' hdp <- selectFeatures(xna, n_features=500, suppress_plot=FALSE)
2831
#' # construct matrix M to capture missing entries
@@ -59,6 +62,8 @@ selectFeatures <- function(object, n_features=500, suppress_plot = FALSE) {
5962
}
6063

6164
if(is.null(rownames(x))) stop("No row names in input. Please provide input with named rows.")
65+
if(any(is.nan(x) | is.infinite(x))) stop("Inf or NaN values encountered.")
66+
6267
AveExpr <- rowMeans(x, na.rm = TRUE)
6368
dropout <- rowMeans(is.na(x))
6469

0 commit comments

Comments
 (0)