Skip to content

User data should be rCLR transformed before projection #35

@rabdill

Description

@rabdill

User data will be loaded as a samples x taxon table of read counts, which we then rarefy to 3000 reads per sample see #34):

Sample Tax 1 Tax 2 Tax 3 Tax 4 Tax 5 READ TOTAL
s123 300 150 600 650 1300 3000
s124 0 0 1400 1300 300 3000
s125 250 1000 1700 50 0 3000

At that point, we need to perform the robust centered log-ratio transformation. This is how they define it:

$rclr = log\frac{x}{g(x > 0)}$

The g here is a value for each sample indicating the geometric mean of the read counts found in all taxa. For s123 above would be

$$\sqrt[5]{(300 \times 150 \times 600 \times 650 \times 1300)} = 469.5092$$

This is where the controversial step comes in. Zeroes are a problem here, for both the geometric mean (which uses multiplication, so all rows with a zero would have a mean of zero) and for the logarithms. There are a few options to replace zeroes with other numbers, but the rCLR transformation we're going with skips the zeroes:

  • The geometric mean calculated for s124 above would be (1400*1300*300)^(1/3), for example.
  • For the subsequent steps using logs, the zeroes in the matrix (s124, taxa 1 and 2; s125 taxon 5) would remain unchanged.

Once each sample has its geometric mean, that number is used to "center" the data in its row by dividing each read count by the geometric mean, then taking the log. So the read counts for samples mentioned here would be transformed into:

Sample Tax 1 Tax 2 Tax 3 Tax 4 Tax 5
s123 log(300/469.5) log(150/469.5) log(600/469.5) log(650/469.5) log(1300/469.5)
s124 0 0 log(1400/817.3) log(1300/817.3) log(300/817.3)

Here are two helpers in R that do this. In both cases, the input is an array of read counts for a single sample:

gm_mean = function(x){
  exp(mean(log(x[x > 0])))
}

rclr <- function(a) {
  answer <- log(a/gm_mean(a))
  answer[] <- lapply(answer, function(i) if(is.numeric(i)) ifelse(is.infinite(i), 0, i) else i)
  return(answer)
}

In case it's relevant, I'll also note that the final rCLR-transformed matrix will have negative values—taxa with fewer reads than the sample's average (s123, taxa 1 and 2; s124, taxon 4) will have negative values here. In the preliminary ordinations, the axes we visualized have generally been limited to something like (-10, 10) overall.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions