Replicating analysis by So et al. 2025
- snRNAseq data was made available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE241987
- note, I use Rye for package management.
- Click me to see how what step I'm working on now.
.
└── snRNAseq-analysis
├── .venv #gitignored
├── data #gitignored
│ ├── cellbender
│ │ ├── experiment_set1
│ │ │ ├── experiment1-ANNOTATION-sample1
│ │ │ │ ├── output_filtered.h5
│ │ │ │ ├── output_filtered_seurat.h5
│ │ │ │ └── #cellbender outputs...
│ │ │ └── #...
│ │ └── #...
│ ├── cellranger
│ │ ├── experiment_set1
│ │ │ ├── experiment1
│ │ │ │ ├── sample1
│ │ │ │ │ └── #cellranger outputs...
│ │ │ │ └── #...
│ │ │ └── #...
│ │ └── #...
│ ├── raw
│ │ ├── experiment1
│ │ │ ├── sample1.fastq.gz
│ │ │ └── #...
│ │ └── #...
│ └── preprocessed
│ ├── object.pickle
│ ├── object.Rds
│ ├── object.h5ad
│ ├── object.h5
│ └── #...
├── plots # plots if desired
├── references # ref genomes, databases, etc.
├── src # Anything starting with a # is not yet implemented
│ ├── Python
│ │ ├── preprocess.py
│ │ ├── analysis.py
│ │ └── R.py
│ ├── R
│ │ ├── preprocesss.R
│ │ ├── preprocesssing.Rmd
│ │ ├── #analysis.R
│ │ ├── analysis.Rmd
│ │ └── pipeline.Rmd
│ └── scripts
│ ├── download_data.sh
│ ├── cellcounting.sh
│ ├── cellbending.sh
│ ├── #preprocess-py.sh
│ ├── #analyze-py.sh
│ ├── preprocess-R.sh
│ └── #analyze-R.sh
├── snRNAseq_analysis.Rproj
├── .gitignore
├── pyproject.toml # Rye project management
├── requirements-dev.lock # Rye project management
└── requirements.lock # Rye project managementIn this example, I have raw data from 4 experiments GSM7747185 through GSM7747188 (each is it's own condition, and has 1-2 samples each). These 4 experiments are grouped into 1 experiment set called paper_processed.
-
PREPROCESSED DATA:
- Download .tsv & .mtx data. This is a CellRanger output, as described here.
- Unzip data as necessary using
gzip -d [filename].[ext].gz. Rename the corresponding files tobarcodes.tsv,genes.tsv, andmatrix.mtx, and save to a desired directory (using the repo structure above, I used./data/CellRanger/[filename]/). - Skip to Step 3!
-
UNPROCESSED DATA:
- Pull data from European Nucleotide Archive. Search by accession for the to find this site. The downloaded script is in
./data/preprocessed/PRJNA1010853-fastq.sh, which you can run to pull the ~170GB FASTQ dataset. - You'll also need CellRanger to align the sequencing data.
- You'll also need the relevant transcriptomes (authors used mm10/GRCm38, but I'm using mm39/GRCm39 as of May 2025). You can also build a custom one here following these instructions.
- Python Dependencies:
- Cellbender; should follow this tutorial for more details.
- There's an issue document on the README of the GitHub page, regarding checkpoint/saving issues on v0.3.1. A discussion on this can be found here. The suggested solution was to pull from this recent commit:
4334e8966217c3591bf7c545f31ab979cdc6590d:
For rye:
or normal pip installation:rye add cellbender --git=https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590dpip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590d - There's an issue document on the README of the GitHub page, regarding checkpoint/saving issues on v0.3.1. A discussion on this can be found here. The suggested solution was to pull from this recent commit:
- PyTables
- Cellbender; should follow this tutorial for more details.
- Pull data from European Nucleotide Archive. Search by accession for the to find this site. The downloaded script is in
- R Dependencies:
- anything in the scripts
- Seurat (and other Seurat packages, see here)
- DoubletFinder (use
chris-mcginnis-ucsf/DoubletFinder) - presto (use
immunogenics/presto) Example on how to install remote Github package:
- anything in the scripts
library(remotes)
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
cellranger countused to align/map FASTQ reads.- See
cellcounting.sh, which searches./data/rawfor various experiments (each of which should have the fastq.gz files inside). - Authors use CellBender to call cells, but CellRanger has this functionality. See here for a discussion of differences.
To run
cellcounting.sh, see the example below:
./src/scripts/cellcounting.sh -i GSM7747185-LFD_eWAT,GSM7747186-LFD_iWAT,GSM7747187-HFD_eWAT,GSM7747188-HFD_iWAT -o cellcounting.out
cellbender remove-backgroundused to clean technical artifacts from sequencing data.- The script checks for the cellbender env. If you have one, just change the name in the script. Otherwise, the script will create one using
mamba(changemamba->condaif you don't have mamba) - See
cellbending.sh, which searches the provided directory (e.g. experiment_set1) for various experiments, which should have at least one sample that's in cellranger output format. - To complete analysis using Seurat in R (see the tutorial the command
ptrepackmust also be run for compatibility, which requires PyTables. This is implemented in the shell script. To runcellbending.sh, see the example below:
./src/scripts/cellbending.sh -i ./data/cellranger/paper_processed -o cellbending.out
See the .Rmd/.ipynb file for details. Broadly, the steps are:
- Process each dataset:
- Pull the CellBender data into Seurat.
- Filter samples based on UMIs (counts), genes (features), mitochrondial gene ratio, and UMIs/gene.
- Run DoubletFinder without ground truth.
- If using Seurat, can run cell stage scoring & regress out those genes (and mt/rb genes).
- Pool datasets togther and integrate.
- Identify graph-based clusters & visualize (PCA, UMAP, tSNE).
- Using marker genes (e.g. markers are provided in the paper or from other databases), annotate clutters. May also use a cell atlas as reference.
- For improved granularity, select a
subsetof cells within a clusters, then cluster again.
- Pseudobulking
- more robust, but requires sufficient biological replicates
- compare within cell types only
- Options:
- edgeR/DEseq2 for R
- decoupler + PyDESeq2 for Python (see the decoupler tutorial)
- Identify DEGs based on 0.05 FDR and absolute 2 FC.
- Single-cell methods
- treats each cell as independent, though not actually
- Options:
- Seurat's
FindMarker, typically with "wilcox" or "t", check the V3 tutorial - Mast, see tutorial
- Scanpy's
tl.rank_genes_groups, see this online example
- Seurat's
The paper uses these steps, but I do my own thing kinda...
- Cluster genes within each condition based with K-means using Morpheus, per the example paper.
- Pathway analysis for each gene cluster using Metascape, per the example paper.
I will create a .bib at some point, but until then here are some useful reads:
- SCVerse Best Practices (Python)
- BioConductor Intro to SComics (R, but not Seurat-specific)
- SeuratV3 tutorial (R) & matching paper
- CellBender (Fleming et al. 2023)
- Normalization review, well-known (Ahlmann-Eltze & Huber 2017)
- Normalization review, newer (Lytal et al. 2020)
- Doublet detection review (Xi & Li 2021)
- DoubletDetection (Gayoso et al. 2020) + documentation
- DoubletFinder (McGinnes et al. 2019) + GitHub
- scDblFinder (Germain et al. 2022)
- Integration review (Luecken et al. 2021)
- Integration metric (Lyu et al. 2024)
- Harmony integration (Korsunsky et al. 2019)
- SCTransform (regularized NB GLM) in Seurat (Hafemeister & Satija 2019)
- SCVI integration (Xu et al. 2021)
Other useful papers from the paper I won't be using:
- Issue: can't add CellBender into Rye
- Solution: Install cellbender into a conda environment.
conda_cellbender.ymlattached for convenience.
- Solution: Install cellbender into a conda environment.
- Issue: Rye installation of
packageled to clang not found error:- Notes: probably occurs because Presidio doesn't have clang
- Solved by: astral-sh/rye#836
- Solution:
CC=$(which gcc) CXX=$(which g++) rye add [package]