Integrating single-cell and single-nucleus datasets improves bulk RNA-seq deconvolution

This project evaluates transformation strategies for using single-nucleus RNA-seq (snRNA-seq) as a reference for bulk RNA-seq deconvolution, compared to single-cell RNA-seq (scRNA-seq).

Because snRNA-seq captures primarily nuclear RNA, it differs systematically from scRNA-seq, which contains both nuclear and cytoplasmic RNA. We test whether these differences degrade deconvolution performance and whether specific transformations can mitigate them. Comparisons are performed using both simulated and real bulk datasets.

Reproducing the results

1. Clone the repository

git clone https://github.com/greenelab/deconvolution_sc_sn_comparison.git
cd deconvolution_sc_sn_comparison

2. Create the conda environments

Create the required Python and R environments:

bash environments/create_envs.sh

This creates:

env_deconv (Python)
env_deconv_R (R / Bioconductor)

3. Download the data

Download all datasets into the data/ directory using the appropriate dataset identifiers:

data/
 └── DATASET_ID/

All datasets are publicly available. Links, preprocessing steps, and metadata are provided in:

data/details/Data_Details.xlsx

4. Run the analysis pipeline

Run the following scripts in order (typically using sbatch on an HPC system).

Data preprocessing

scripts/0_preprocess_data.sh

Runs preprocessing and QC notebooks for all datasets.

Simulation pipeline

scripts/1_train_scvi_models_sim.sh

Trains scVI models (conditional and non-conditional; with and without DE genes).

scripts/2_prepare_deconvolution_sim.sh

Prepares transformed references and pseudobulks for simulations.

scripts/3_run_bayesprism_sim.sh

Runs BayesPrism / InstaPrism deconvolution.

scripts/4_process_results_sim.sh

Processes simulation results and computes RMSE and Pearson correlation.

scripts/5_results_notebook_sim.sh

Generates notebooks and figures for simulation results.

Real bulk comparison

scripts/6_comparison_with_sc_and_bulks.sh

Trains models on real bulks and generates comparison notebooks.

5. Inspect results

All figures used in the manuscript are generated in the result notebooks located in:

notebooks/

Adding your own method

Adding a new transformation

Preprocess data using:

scripts/0_preprocess_data.sh

If training is required, add your code to:

scripts/train_scvi_models_allgenes.py

Add your transformation to:

scripts/prepare_deconvolution_sim.py

(see the section marked “Add your transformation here”).

Adding new datasets

Add a preprocessing notebook to:

notebooks/

Add the dataset under:

data/YOUR_DATASET_ID/

Register the new dataset ID in the relevant shell scripts, for example:

datasets=("ADP" "PBMC" "MBC" "MSB")

Change to:

datasets=("ADP" "PBMC" "MBC" "MSB" "YOUR_DATASET_ID")

For real bulk analyses, add the dataset ID to scripts that include Real_ADP.

Data access and processing

Detailed information on all datasets used in this study — including download links, filtering steps, and preprocessing details — is available in:

data/details/Data_Details.xlsx

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
data/details		data/details
environments		environments
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Integrating single-cell and single-nucleus datasets improves bulk RNA-seq deconvolution

Reproducing the results

1. Clone the repository

2. Create the conda environments

3. Download the data

4. Run the analysis pipeline

Data preprocessing

Simulation pipeline

Real bulk comparison

5. Inspect results

Adding your own method

Adding a new transformation

Adding new datasets

Data access and processing

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

greenelab/deconvolution_sc_sn_comparison

Folders and files

Latest commit

History

Repository files navigation

Integrating single-cell and single-nucleus datasets improves bulk RNA-seq deconvolution

Reproducing the results

1. Clone the repository

2. Create the conda environments

3. Download the data

4. Run the analysis pipeline

Data preprocessing

Simulation pipeline

Real bulk comparison

5. Inspect results

Adding your own method

Adding a new transformation

Adding new datasets

Data access and processing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages