UMAP cluster separation with bins vs. peaks #1116

AnjaliC4 · 2021-10-28T17:31:04Z

AnjaliC4
Oct 28, 2021

I compared 500 bp bins (estimated LSI: iterations=2, first selection= 'top', varfeatures=50,000, dims=1:30) and Signac top features (q0 - all the features) after calling peaks with macs2. I noticed that even after tweaking several parameters such as increasing the iterations or decreasing the varfeatures, I get more separated clusters using peaks than bins. Here are the plots (first one bins and second peaks - in both cluster resolution=0.3). I really like the idea of using bins and understand that separation between UMAP clusters is more technical than biological, however, I am wondering if you can suggest which parameters to tweak to get more separated clusters?

In addition, I am having a bit of difficulty in understanding the use of the parameters provided in LSI.
First, in selecting top features in the first iteration - how many top features are selected? Is there a way to change this number?
Second, by tweaking which parameter can I perform LSI on all the cells instead of an estimated LSI. Basically, I think it has to be between these options - sampleCellsPre, projectCellsPre, and sampleCellsFinal- not sure which one exactly.

Thank you very much!!

Answered by rcorces

Oct 29, 2021

To clarify, clusters are called in the LSI space, regardless of how they are plotted in UMAP space.
I think you would have better luck tweaking the parameters of the UMAP if you are looking to get more separation. for example minDist

You can also force addIterativeLSI() to mimic your results from Signac by tweaking the parameters. In our hands, using defined metrics for assessing clustering (i.e. not subjective measures like how separate the clusters are), ArchR performs very well.

View full answer

rcorces · 2021-10-29T14:42:41Z

rcorces
Oct 29, 2021
Maintainer

To clarify, clusters are called in the LSI space, regardless of how they are plotted in UMAP space.
I think you would have better luck tweaking the parameters of the UMAP if you are looking to get more separation. for example minDist

You can also force addIterativeLSI() to mimic your results from Signac by tweaking the parameters. In our hands, using defined metrics for assessing clustering (i.e. not subjective measures like how separate the clusters are), ArchR performs very well.

2 replies

AnjaliC4 Oct 29, 2021
Author

Thank you Ryan, I totally agree that it is subjective to just say how separated the clusters are. I noticed that with Signac, I was using all the features (about 300,000 peaks) for LSI. Perhaps it is partly a matter of how many bins/peaks you provide for dimensionality reduction. I will now look into more defined metrics for assessing clusters.

Could you please clarify my other questions -
First, In selecting top features in the first iteration - how many top features are selected? Is there a way to change this number?
Second, by tweaking which parameter can I perform LSI on all the cells instead of an estimated LSI. Is it projectCellsPre = TRUE? I am not sure if I am using it correctly.
Whenever I tweak any of these parameters - sampleCellsPre, projectCellsPre, and sampleCellsFinal, I get an error:
<simpleError in .logThis(sampledCellNames, "cellNames supplied", logFile = logFile): object 'logFile' not found>
I can report this issue separately if you'd prefer that.

Thank you very much.

rcorces Oct 29, 2021
Maintainer

First, in selecting top features in the first iteration - how many top features are selected? Is there a way to change this number?

varFeatures would specify this (see parameter definitions). However, things like filterQuantile also affect this so if you really want all features you would set filterQuantile = 1

varFeatures	The number of N variable features to use for LSI. The top N features will be used based on the selectionMethod.

Second, by tweaking which parameter can I perform LSI on all the cells instead of an estimated LSI. Basically, I think it has to be between these options - sampleCellsPre, projectCellsPre, and sampleCellsFinal- not sure which one exactly.

Correct. projectCellsPre = FALSE and setting both sampleCellsPre and sampleCellsFinal to NULL should do this.

Brawni · 2022-02-10T17:09:48Z

Brawni
Feb 10, 2022

Hi @AnjaliC4! I have had similar observations with our data, I was wondering if you figured out how to get more separated clusters like with using Signac?
Thank you!

3 replies

AnjaliC4 Feb 13, 2022
Author

Hi @Brawni , Since Signac uses peaks and ArchR uses bins for clustering - I believe the main differences come from this. 500bp bins are more likely to be similar across clusters than enriched peak regions. If you wish to use bins for clustering, there are several things you can try to get more separated clusters: 1. Use larger bin size (this will effect your gene activity and other downstream result). 2. If you wish to use 500bp bins, then you can decrease the no. of varFeatures, reduce no. of dimsToUse, or change UMAP parameters (especially, play with spread and minDist). Evenafter, you will notice peaks tend to give slightly more separated clusters than bins - however, this will not mean that clusters are necessarily more distinct. Hope this helps!

Brawni Feb 13, 2022

Thanks for your feedbacks! I played a bit with number of iterations, var features (decreasing them seem to give worse separation) but will give it a try with UMAP settings and bins. Do you know how to change bin size? Also, i did try to input PeakMatrix in addIterativeLSI but it generated an almost identical UMAP as with TileMatrix. Did you try this yourself? Ultimately, what strategy worked best for you?
Thanks a lot!

rcorces Feb 14, 2022
Maintainer

Thank you both for having this dialogue. Just adding a few things:
As @AnjaliC4 mentioned - ArchR uses bins and Signac uses peaks. Unless something has changed since the last time I looked at Signac, Signac also uses the peaks from the input peak x cell matrix which are peaks called on the bulk-ified sample. Its hard to imagine how this approach would be "better" at discriminating true cell types as it would be heavily biased towards peaks in the most prominent cell types. I think the analysis that we did in the paper comparing dimensionality reduction and clustering was quite fair and unbiased.

@Brawni - I'm not too surprised that TileMatrix and PeakMatrix give similar results since the interative LSI procedure selects a subset of bins, typically the most accessible elements for the first round.

UMAP cluster separation with bins vs. peaks #1116

Uh oh!

Uh oh!

AnjaliC4 Oct 28, 2021

Replies: 2 comments · 5 replies

Uh oh!

rcorces Oct 29, 2021 Maintainer

Uh oh!

Uh oh!

AnjaliC4 Oct 29, 2021 Author

Uh oh!

rcorces Oct 29, 2021 Maintainer

Uh oh!

Brawni Feb 10, 2022

Uh oh!

Uh oh!

AnjaliC4 Feb 13, 2022 Author

Uh oh!

Uh oh!

Brawni Feb 13, 2022

Uh oh!

rcorces Feb 14, 2022 Maintainer

AnjaliC4
Oct 28, 2021

Replies: 2 comments 5 replies

rcorces
Oct 29, 2021
Maintainer

AnjaliC4 Oct 29, 2021
Author

rcorces Oct 29, 2021
Maintainer

Brawni
Feb 10, 2022

AnjaliC4 Feb 13, 2022
Author

rcorces Feb 14, 2022
Maintainer