The original Scissor implementation can become memory-intensive on large-scale single-cell datasets, particularly in the network-regularized workflow, because several steps materialize large dense objects; as a result, Out-of-Memory (OOM) errors may occur when cell numbers become very large.
LargeScissor Spent half of the Chinese New Year holiday on LargeScissor, I didn't do anything else, basically just three things: First Seurat V5 compatibility, Second preventing dense gene expression matrices, Third keeping the network matrix strictly sparse. I'm quite ashamed, just did a little bit of tiny work. Thank you everyone!
Sincere thanks to Sun and others, the original authors of Scissor.
LargeScissor is now maintained with AI-assisted updates to preserve compatibility with the evolving upstream Scissor software environment, including Seurat and related tools.
LargeScissor v1.2 is a Scissor-compatible fork focused on three practical goals:
- making large-scale runs more memory-aware
- improving compatibility with modern Seurat objects
- fixing several result-affecting issues in the original workflow
The package keeps the original phenotype-guided cell selection interface while making the implementation safer for current single-cell analysis pipelines.
In one practical sc_use benchmark on a host with 1.0 TiB RAM and dual AMD EPYC 7402 CPUs (96 hardware threads), the single-cell input contained 151,837 cells, the bulk cohort contained 479 samples, and the bulk-single-cell overlap contained 2,160 genes. On this task, the upstream Scissor code with only a minimal Seurat v5 access patch completed in 3 h 36 min with a peak RSS of about 786.4 GiB, whereas the current LargeScissor code completed the same run using 24 threads in 14 min 19 s with a peak RSS of about 19.2 GiB.
Install the current GitHub release with:
remotes::install_github("lxpsxx/LargeScissor@v1.2.0")library(LargeScissor)
library(Seurat)
# sc_dataset: a preprocessed Seurat object with an RNA_snn graph
# bulk_dataset: gene x sample bulk expression matrix
# phenotype: numeric vector, binary phenotype, or survival matrix
sc_dataset <- NormalizeData(sc_dataset, verbose = FALSE)
sc_dataset <- FindVariableFeatures(sc_dataset, verbose = FALSE)
sc_dataset <- ScaleData(sc_dataset, verbose = FALSE)
sc_dataset <- RunPCA(sc_dataset, features = VariableFeatures(sc_dataset), verbose = FALSE)
sc_dataset <- FindNeighbors(sc_dataset, dims = 1:10, verbose = FALSE)
fit <- Scissor(
bulk_dataset = bulk_dataset,
sc_dataset = sc_dataset,
phenotype = phenotype,
family = "cox",
alpha = 0.05,
Save_file = "Scissor_inputs.RData",
Mthread = TRUE,
Mcore = 8
)
str(fit$Scissor_pos)
str(fit$Scissor_neg)
head(fit$Coefs)- Added compatibility with Seurat v5 / Assay5 by reading RNA expression from the
datalayer when appropriate - Preserved the cell-cell graph as a sparse matrix instead of forcing an early dense conversion
- Reduced avoidable dense operations in the preprocessing path where possible
-
Corrected network penalty alignment for
gaussianandcox- In the original implementation, the R wrapper padded the network matrix with an extra intercept-like dimension even though the solver optimized only the original cell coefficients
- LargeScissor v1.2 keeps the network penalty on the same cell dimension as the optimized coefficient vector
binomialstill retains one explicit intercept dimension because the logistic solver uses an augmented design matrix
-
Corrected binomial phenotype preprocessing
- Two-level factor input is now converted to
0/1instead of1/2 - Logical and binary numeric inputs are also normalized to a stable binary response
- Two-level factor input is now converted to
-
Added proper continuous phenotype handling for
gaussian- Continuous phenotypes are now treated as numeric responses rather than grouped categories
tagis optional forgaussianand is used only for summary reporting when appropriate
-
Normalized
familyhandling in bothScissor()andreliability.test()- This removes the invalid default-vector behavior inherited from the original package
-
Added checkpoint safety for
Load_file- Saved preprocessing inputs now record the original regression family
- Reloading the same checkpoint with a mismatched
familynow fails early with a clear message
-
Added optional multiprocessing
Scissor()andreliability.test()now acceptMthreadandMcore- Cross-validation folds and permutation loops can run through
parallel::mclapply()on Unix-like systems - Unsupported platforms or disabled multiprocessing fall back to serial execution
-
Preserved cell names in returned coefficients
Coefsnow keep the column identity fromX, which makes downstream comparison and visualization more reliable
LargeScissor v1.2 was validated with focused installation checks and smoke tests.
R CMD INSTALLcompleted successfullyfamilynormalization inScissor()andreliability.test()was verified- factor-to-
0/1conversion forbinomialphenotypes was verified - continuous phenotype handling for
gaussianwas verified Load_filefamily consistency checking was verified- serial and parallel smoke tests matched for:
APML1gaussianAPML1binomialAPML1coxreliability.test()gaussian
LargeScissor v1.2 improves the original package substantially, but it is still an incremental fork rather than a full redesign of the Scissor pipeline.
- Quantile normalization still requires a dense expression block and can remain memory-intensive on very large inputs
- Multiprocessing is process-based rather than thread-based, so large
Xandnetworkobjects can still increase memory pressure - Alpha-grid parallelization has not been added in this release
- The original graph-regularized design in Scissor remains a key strength for stabilizing related cells, but on atlas-scale data, where many near-redundant cells accumulate within a much denser SNN graph, the binarized neighborhood structure can encourage smoother coefficient sharing across broader local neighborhoods rather than sharply isolated single-cell selection; in such settings, a metacell-first workflow is often a pragmatic execution strategy before running Scissor or LargeScissor. This is also consistent with the upstream discussions on large matrices and high RAM usage, together with the original author's public suggestion to merge cells into pseudo-cells or metacells before running Scissor:
LargeScissor is built on top of the original Scissor project and should be understood as a compatibility-focused, memory-aware fork rather than a new method.
- Original Scissor repository: https://github.com/sunduanchen/Scissor
- Original tutorial: https://sunduanchen.github.io/Scissor/vignettes/Scissor_Tutorial.html
If you use LargeScissor in academic work, please cite the original Scissor paper and the LargeScissor repository as follows:
-
Scissor: Sun, D., Guan, X., Moran, A. E., Wu, L. Y., Qian, D. Z., Schedin, P., Dai, M. S., Danilov, A. V., Alumkal, J. J., Adey, A. C., Spellman, P. T., & Xia, Z. (2021). Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nature Biotechnology. https://doi.org/10.1038/s41587-021-01091-3
-
LargeScissor: lxpsxx. LargeScissor: Optimized Scissor for Large-Scale Single-Cell Data. GitHub repository, version v1.2.0, 2026. https://github.com/lxpsxx/LargeScissor
LargeScissor is released under the GNU General Public License v3.0, consistent with the upstream Scissor project.