Why remove cells with 0 reads during LSI projection? #1345
-
I think ArchR's implementation of iterative LSI is the greatest thing since sliced bread for feature selection in an ATAC dataset. The LSI itself is constructed using top features, which leads to the (probably rare) situation that a cell could be a valid cell but still not have any reads in the top, say, 25,000 most variable features. Lines 1269 to 1273 in 968e442 This is probably sensible, but does it lead downstream to the possibility that valid cells don't get incorporated into the LSI projection (and thus fall out of downstream steps like UMAP and clustering)? This came up because I modified the code to better run on peak matrices. Support for my hackery is well outside the range of support, but curious if I am correct in this observation. Or if I've managed to fub up something. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 2 replies
-
agreed
This seems totally possible but I'm not sure how it would best be addressed. We could throw a warning saying how many cells are excluded and provide a recommendation to use more features?
If you ever want to contribute to making ArchR better, just let me know. |
Beta Was this translation helpful? Give feedback.
-
In a way, I think you did already deal with it (in a non-obvious way) -- there is a slot in the LSI object for As for contributing, we should probably have that talk...probably when we invite you to give a talk as most of us here at GNE are now using ArchR for our scATAC....more soon. |
Beta Was this translation helpful? Give feedback.
-
I am curious about this. If we use less varFeatures and some cells are left out in LSI projection. Although we even can list these cells, we need to handle them for downstream analysis - and I am not sure whether it would make some errors. So maybe varFeatures should be recommended to set that each cell have reads in at least one top feature? I tried this parameter to see how it affect batch effects - I should try other parameters and addHarmony if it cause cell lost. Although decreasing varFeatures is really helpful to reduce batch effects. |
Beta Was this translation helpful? Give feedback.
-
I have yet to actually encounter a case in which I lost cells during "normal" use of LSI (e.g. on the tile matrix). It was only messing about with edge cases and peak-call level signals My own suggestions would be:
For context for anyone reading this, my own messing about with an edge case led to the removal of 321 cells out of ~250,000 cells. This is very much in the category of no big deal. |
Beta Was this translation helpful? Give feedback.
-
I had the same problem when running on 211571 cells and only got 207209 in the LSI when I used top 35,000 variable tiles. Those removed cells could be low quality cells, but it could be problematic for downstream analysis when the dimension of reducedDim and matrix do not match. Maybe one idea is to automatically use more tiles so all the cells are covered? |
Beta Was this translation helpful? Give feedback.
agreed
This seems totally possible but I'm not sure how it would best be addressed. We could throw a warning saying how many cells are excluded and provide a recommendation to use more features?
If you ever want to contribute to making ArchR better, just let me know.