User data will be loaded as a samples x taxon table of read counts, which we then rarefy to 3000 reads per sample see #34):
| Sample |
Tax 1 |
Tax 2 |
Tax 3 |
Tax 4 |
Tax 5 |
READ TOTAL |
| s123 |
300 |
150 |
600 |
650 |
1300 |
3000 |
| s124 |
0 |
0 |
1400 |
1300 |
300 |
3000 |
| s125 |
250 |
1000 |
1700 |
50 |
0 |
3000 |
At that point, we need to perform the robust centered log-ratio transformation. This is how they define it:
$rclr = log\frac{x}{g(x > 0)}$
The g here is a value for each sample indicating the geometric mean of the read counts found in all taxa. For s123 above would be
$$\sqrt[5]{(300 \times 150 \times 600 \times 650 \times 1300)} = 469.5092$$
This is where the controversial step comes in. Zeroes are a problem here, for both the geometric mean (which uses multiplication, so all rows with a zero would have a mean of zero) and for the logarithms. There are a few options to replace zeroes with other numbers, but the rCLR transformation we're going with skips the zeroes:
- The geometric mean calculated for
s124 above would be (1400*1300*300)^(1/3), for example.
- For the subsequent steps using logs, the zeroes in the matrix (s124, taxa 1 and 2; s125 taxon 5) would remain unchanged.
Once each sample has its geometric mean, that number is used to "center" the data in its row by dividing each read count by the geometric mean, then taking the log. So the read counts for samples mentioned here would be transformed into:
| Sample |
Tax 1 |
Tax 2 |
Tax 3 |
Tax 4 |
Tax 5 |
| s123 |
log(300/469.5) |
log(150/469.5) |
log(600/469.5) |
log(650/469.5) |
log(1300/469.5) |
| s124 |
0 |
0 |
log(1400/817.3) |
log(1300/817.3) |
log(300/817.3) |
Here are two helpers in R that do this. In both cases, the input is an array of read counts for a single sample:
gm_mean = function(x){
exp(mean(log(x[x > 0])))
}
rclr <- function(a) {
answer <- log(a/gm_mean(a))
answer[] <- lapply(answer, function(i) if(is.numeric(i)) ifelse(is.infinite(i), 0, i) else i)
return(answer)
}
In case it's relevant, I'll also note that the final rCLR-transformed matrix will have negative values—taxa with fewer reads than the sample's average (s123, taxa 1 and 2; s124, taxon 4) will have negative values here. In the preliminary ordinations, the axes we visualized have generally been limited to something like (-10, 10) overall.
User data will be loaded as a samples x taxon table of read counts, which we then rarefy to 3000 reads per sample see #34):
At that point, we need to perform the robust centered log-ratio transformation. This is how they define it:
The
ghere is a value for each sample indicating the geometric mean of the read counts found in all taxa. Fors123above would beThis is where the controversial step comes in. Zeroes are a problem here, for both the geometric mean (which uses multiplication, so all rows with a zero would have a mean of zero) and for the logarithms. There are a few options to replace zeroes with other numbers, but the rCLR transformation we're going with skips the zeroes:
s124above would be(1400*1300*300)^(1/3), for example.Once each sample has its geometric mean, that number is used to "center" the data in its row by dividing each read count by the geometric mean, then taking the log. So the read counts for samples mentioned here would be transformed into:
Here are two helpers in R that do this. In both cases, the input is an array of read counts for a single sample:
In case it's relevant, I'll also note that the final rCLR-transformed matrix will have negative values—taxa with fewer reads than the sample's average (s123, taxa 1 and 2; s124, taxon 4) will have negative values here. In the preliminary ordinations, the axes we visualized have generally been limited to something like
(-10, 10)overall.