User data should be rCLR transformed before projection

User data will be loaded as a samples x taxon table of read counts, which we then rarefy to 3000 reads per sample see https://github.com/blekhmanlab/compendium_website/issues/34):

| Sample | Tax 1 | Tax 2 | Tax 3 | Tax 4 | Tax 5 | READ TOTAL |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| s123  | 300 | 150 | 600 | 650 | 1300 | 3000 |
| s124  | 0 | 0 | 1400 | 1300 | 300 | 3000 |
| s125 | 250 | 1000 | 1700 | 50 | 0 | 3000 |

At that point, we need to perform the [robust centered log-ratio transformation](https://journals.asm.org/doi/10.1128/msystems.00016-19). This is how they define it:

$rclr = log\frac{x}{g(x > 0)}$

The `g` here is a value for each sample indicating the geometric mean of the read counts found in all taxa. For `s123` above would be

$$\sqrt[5]{(300 \times 150 \times 600 \times 650 \times 1300)} = 469.5092$$

This is where the controversial step comes in. Zeroes are a problem here, for both the geometric mean (which uses multiplication, so all rows with a zero would have a mean of zero) and for the logarithms. There are a few options to replace zeroes with other numbers, but the rCLR transformation we're going with **skips the zeroes**:
* The geometric mean calculated for `s124` above would be `(1400*1300*300)^(1/3)`, for example.
* For the subsequent steps using logs, the zeroes in the matrix (s124, taxa 1 and 2; s125 taxon 5) would **remain unchanged**.

Once each sample has its geometric mean, that number is used to "center" the data in its row by dividing each read count by the geometric mean, then taking the log. So the read counts for samples mentioned here would be transformed into:

| Sample | Tax 1 | Tax 2 | Tax 3 | Tax 4 | Tax 5 |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| s123  | log(300/469.5) | log(150/469.5)  | log(600/469.5)  | log(650/469.5)  | log(1300/469.5) | 
| s124  | **0** | **0** | log(1400/817.3) | log(1300/817.3) | log(300/817.3) | log(3000/817.3)) |

Here are two helpers in R that do this. In both cases, the input is an array of read counts for a single sample:

```r
gm_mean = function(x){
  exp(mean(log(x[x > 0])))
}

rclr <- function(a) {
  answer <- log(a/gm_mean(a))
  answer[] <- lapply(answer, function(i) if(is.numeric(i)) ifelse(is.infinite(i), 0, i) else i)
  return(answer)
}
```

In case it's relevant, I'll also note that the final rCLR-transformed matrix will have negative values—taxa with fewer reads than the sample's average (s123, taxa 1 and 2; s124, taxon 4) will have negative values here. In the preliminary ordinations, the axes we visualized have generally been limited to something like `(-10, 10)` overall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User data should be rCLR transformed before projection #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sample	Tax 1	Tax 2	Tax 3	Tax 4	Tax 5	READ TOTAL
s123	300	150	600	650	1300	3000
s124	0	0	1400	1300	300	3000
s125	250	1000	1700	50	0	3000

Sample	Tax 1	Tax 2	Tax 3	Tax 4	Tax 5
s123	log(300/469.5)	log(150/469.5)	log(600/469.5)	log(650/469.5)	log(1300/469.5)
s124	0	0	log(1400/817.3)	log(1300/817.3)	log(300/817.3)

User data should be rCLR transformed before projection #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions