Reproducibility question: Non-deterministic LSI #1420

evaham1 · 2022-05-06T11:34:21Z

evaham1
May 6, 2022

Hello, thank you for the great software + documentation!

I wanted to ask about the non-deterministic nature of the iterative LSI (https://www.archrproject.com/bookdown/iterative-latent-semantic-indexing-lsi.html). Although the changes are small when I keep all parameters the same, the randomness introduced here means that downstream analysis is not reproducible. We (and I imagine other groups) are very keen to create analysis pipelines that are fully reproducible, and this seems to not be possible with this way of running dimensionality reduction. The problem is perhaps even more prominent because dim reduction is one of the first steps in the workflow.

My understanding is that TF-IDF normalisation, SVD dim reduction and graph-based clustering processes are not inherently non-deterministic (I think Signac runs these steps in a deterministic way). Is the process that introduces randomness to the iterative LSI the subsampling of cells, or is it another step? Is there any way to circumvent or fix this randomness to produce reproducible results?

Iterative LSI is really fantastic and my UMAP visualisations look much better using ArchR's implementation as opposed to Signac's dimensionality reduction, so I am really keen to understand how it works and try to implement it in a reproducible workflow.

I appreciate appreciate any insights, thank you!

rcorces · 2022-05-06T18:29:51Z

rcorces
May 6, 2022
Maintainer

This is an error in the documentation that is slated to be fixed in the next release. Iterative LSI is actually deterministic.

You can confirm this for yourself:

> projHeme5 <- addIterativeLSI(ArchRProj = projHeme5, name = "IterativeLSI_test1", seed = 1)
Checking Inputs...
ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-9d617dacd95e-Date-2022-05-06_Time-08-33-06.log
If there is an issue, please report to github with logFile!
2022-05-06 08:33:06 : Computing Total Across All Features, 0.005 mins elapsed.
2022-05-06 08:33:07 : Computing Top Features, 0.021 mins elapsed.
###########
2022-05-06 08:33:09 : Running LSI (1 of 2) on Top Features, 0.053 mins elapsed.
###########
2022-05-06 08:33:09 : Sampling Cells (N = 10002) for Estimated LSI, 0.054 mins elapsed.
2022-05-06 08:33:09 : Creating Sampled Partial Matrix, 0.054 mins elapsed.
2022-05-06 08:33:14 : Computing Estimated LSI (projectAll = FALSE), 0.133 mins elapsed.
2022-05-06 08:33:50 : Identifying Clusters, 0.742 mins elapsed.
Warning: The following arguments are not used: row.names
2022-05-06 08:34:06 : Identified 6 Clusters, 0.994 mins elapsed.
2022-05-06 08:34:06 : Saving LSI Iteration, 0.994 mins elapsed.
2022-05-06 08:34:18 : Creating Cluster Matrix on the total Group Features, 1.204 mins elapsed.
2022-05-06 08:34:26 : Computing Variable Features, 1.337 mins elapsed.
###########
2022-05-06 08:34:26 : Running LSI (2 of 2) on Variable Features, 1.339 mins elapsed.
###########
2022-05-06 08:34:26 : Creating Partial Matrix, 1.339 mins elapsed.
2022-05-06 08:34:31 : Computing LSI, 1.419 mins elapsed.
2022-05-06 08:34:56 : Finished Running IterativeLSI, 1.838 mins elapsed.

> projHeme5 <- addIterativeLSI(ArchRProj = projHeme5, name = "IterativeLSI_test2", seed = 1)
Checking Inputs...
ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-9d6171b4afb6-Date-2022-05-06_Time-10-32-53.log
If there is an issue, please report to github with logFile!
2022-05-06 10:32:54 : Computing Total Across All Features, 0.005 mins elapsed.
2022-05-06 10:32:55 : Computing Top Features, 0.023 mins elapsed.
###########
2022-05-06 10:32:57 : Running LSI (1 of 2) on Top Features, 0.056 mins elapsed.
###########
2022-05-06 10:32:57 : Sampling Cells (N = 10002) for Estimated LSI, 0.057 mins elapsed.
2022-05-06 10:32:57 : Creating Sampled Partial Matrix, 0.057 mins elapsed.
2022-05-06 10:33:02 : Computing Estimated LSI (projectAll = FALSE), 0.139 mins elapsed.
2022-05-06 10:33:37 : Identifying Clusters, 0.727 mins elapsed.
Warning: The following arguments are not used: row.names
2022-05-06 10:33:52 : Identified 6 Clusters, 0.978 mins elapsed.
2022-05-06 10:33:52 : Saving LSI Iteration, 0.978 mins elapsed.
2022-05-06 10:34:05 : Creating Cluster Matrix on the total Group Features, 1.192 mins elapsed.
2022-05-06 10:34:13 : Computing Variable Features, 1.33 mins elapsed.
###########
2022-05-06 10:34:14 : Running LSI (2 of 2) on Variable Features, 1.332 mins elapsed.
###########
2022-05-06 10:34:14 : Creating Partial Matrix, 1.332 mins elapsed.
2022-05-06 10:34:19 : Computing LSI, 1.417 mins elapsed.
2022-05-06 10:34:46 : Finished Running IterativeLSI, 1.873 mins elapsed.

> digest::digest(projHeme5@reducedDims$IterativeLSI_test1) == digest::digest(projHeme5@reducedDims$IterativeLSI_test2)
[1] TRUE
> all.equal(projHeme5@reducedDims$IterativeLSI_test1$matSVD, projHeme5@reducedDims$IterativeLSI_test2$matSVD)
[1] TRUE

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility question: Non-deterministic LSI #1420

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reproducibility question: Non-deterministic LSI #1420

Uh oh!

evaham1 May 6, 2022

Replies: 1 comment

Uh oh!

rcorces May 6, 2022 Maintainer

evaham1
May 6, 2022

rcorces
May 6, 2022
Maintainer