-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hello!
I ran upon a strange behavior of the DModX() function when projecting new data into my PCA model. I've pasted some code to reproduce this below. First, I train a PCA model on the majority of the observations in the mtcars dataset:
# load package and demo data
set.seed(32)
data("mtcars")
library(pcaMethods)
# split dataset
ind <- sample(1:nrow(mtcars), 5)
test <- mtcars[ind,]
train <- mtcars[-ind,]
# train pca model
pca.mod <- pca(train, scale = "uv", center = TRUE, nPcs = 3)When evaluating the DModX values of new observations, I noticed that I get different results depending on how many new observations I evaluate:
> DModX(pca.mod, dat = test, newdata = TRUE, type = "normalized")
Maserati Bora Valiant Merc 280C Toyota Corolla Merc 230
3.6666773 0.9110368 1.7617564 1.7419626 1.6084993
> DModX(pca.mod, dat = test[1:3,], newdata = TRUE, type = "normalized")
Maserati Bora Valiant Merc 280C
4.218257 1.048084 2.026778 You also get the same value for a single new observation (no matter which observation):
> DModX(pca.mod, dat = test[1,], newdata = TRUE, type = "normalized")
Maserati Bora
4.795832
> DModX(pca.mod, dat = test[3,], newdata = TRUE, type = "normalized")
Merc 280C
4.795832 Finally, it only seems to occur for normalized DModX values (type = "normalized"), which is the default setting. The example below show that the same DModX values are obtained when using absolute DModX values (type = "absolute").
> DModX(pca.mod, dat = test, newdata = TRUE, type = "absolute")
Maserati Bora Valiant Merc 280C Toyota Corolla Merc 230
27.268176 6.775156 13.101749 12.954547 11.962013
> DModX(pca.mod, dat = test[1:3,], newdata = TRUE, type = "absolute")
Maserati Bora Valiant Merc 280C
27.268176 6.775156 13.101749In the documentation you write that the "Normalized values are adjusted to the total RSD of the model". Hence, as I am providing the same PCA model, the generated DModX value of a single new observation should not vary depending on the set of new observations that are projected into the model space.
When looking at the function code (pasted below), I have a guess on where this issue might arise. In the calculation of the normalization factor s0 it looks like the squared residuals of the new observations are summed (sum(E2)). Should this not be the sum of the squared residuals of the training data (i.e., sum(resid(object, object@completeObs)^2))?
setMethod("DModX", "pcaRes",
function(object, dat, newdata=FALSE, type=c("normalized","absolute"), ...) {
type <- match.arg(type)
if(missing(dat)) {
if(!is.null(completeObs(object)))
dat <- completeObs(object)
else stop("missing data when calculating DModX")
}
A0 <- as.integer(centered(object))
ny <- ifelse(newdata, 1,
sqrt(nObs(object) /
(nObs(object) - nP(object) - A0)))
E2 <- resid(object, dat)^2
s <- sqrt(rowSums(E2) / (nVar(object) - nP(object))) * ny
if(type == "absolute")
return(s)
s0 <- sqrt(sum(E2) / ((nObs(object) - nP(object) - A0) *
(nVar(object) - nP(object))))
s / s0
})