snRNAseq-analysis

Replicating analysis by So et al. 2025

snRNAseq data was made available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE241987
note, I use Rye for package management.
Click me to see how what step I'm working on now.

Repo Structure

.
└── snRNAseq-analysis
    ├── .venv #gitignored
    ├── data #gitignored
    │   ├── cellbender
    │   │   ├── experiment_set1
    │   │   │   ├── experiment1-ANNOTATION-sample1
    │   │   │   │   ├── output_filtered.h5
    │   │   │   │   ├── output_filtered_seurat.h5
    │   │   │   │   └── #cellbender outputs...
    │   │   │   └── #...
    │   │   └── #...
    │   ├── cellranger
    │   │   ├── experiment_set1
    │   │   │   ├── experiment1
    │   │   │   │   ├── sample1
    │   │   │   │   │   └── #cellranger outputs...
    │   │   │   │   └── #...
    │   │   │   └── #...
    │   │   └── #...
    │   ├── raw
    │   │   ├── experiment1
    │   │   │   ├── sample1.fastq.gz
    │   │   │   └── #...
    │   │   └── #...
    │   └── preprocessed
    │       ├── object.pickle
    │       ├── object.Rds
    │       ├── object.h5ad
    │       ├── object.h5
    │       └── #...
    ├── plots                   # plots if desired
    ├── references              # ref genomes, databases, etc.
    ├── src                     # Anything starting with a # is not yet implemented
    │   ├── Python
    │   │   ├── preprocess.py
    │   │   ├── analysis.py
    │   │   └── R.py
    │   ├── R
    │   │   ├── preprocesss.R
    │   │   ├── preprocesssing.Rmd
    │   │   ├── #analysis.R
    │   │   ├── analysis.Rmd
    │   │   └── pipeline.Rmd
    │   └── scripts
    │       ├── download_data.sh
    │       ├── cellcounting.sh
    │       ├── cellbending.sh
    │       ├── #preprocess-py.sh
    │       ├── #analyze-py.sh
    │       ├── preprocess-R.sh
    │       └── #analyze-R.sh
    ├── snRNAseq_analysis.Rproj
    ├── .gitignore
    ├── pyproject.toml          # Rye project management
    ├── requirements-dev.lock   # Rye project management
    └── requirements.lock       # Rye project management

Steps

In this example, I have raw data from 4 experiments GSM7747185 through GSM7747188 (each is it's own condition, and has 1-2 samples each). These 4 experiments are grouped into 1 experiment set called paper_processed.

Step 0a. Access Data

PREPROCESSED DATA:
- Download .tsv & .mtx data. This is a CellRanger output, as described here.
- Unzip data as necessary using gzip -d [filename].[ext].gz. Rename the corresponding files to barcodes.tsv, genes.tsv, and matrix.mtx, and save to a desired directory (using the repo structure above, I used ./data/CellRanger/[filename]/).
- Skip to Step 3!
UNPROCESSED DATA:
- Pull data from European Nucleotide Archive. Search by accession for the to find this site. The downloaded script is in ./data/preprocessed/PRJNA1010853-fastq.sh, which you can run to pull the ~170GB FASTQ dataset.
- You'll also need CellRanger to align the sequencing data.
  - You'll also need the relevant transcriptomes (authors used mm10/GRCm38, but I'm using mm39/GRCm39 as of May 2025). You can also build a custom one here following these instructions.
- Python Dependencies:
  - Cellbender; should follow this tutorial for more details.
    - There's an issue document on the README of the GitHub page, regarding checkpoint/saving issues on v0.3.1. A discussion on this can be found here. The suggested solution was to pull from this recent commit: 4334e8966217c3591bf7c545f31ab979cdc6590d:
      For rye:
```
rye add cellbender --git=https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590d
```
    or normal pip installation:
```
pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590d
```
  - PyTables

Step 0b: Prepare Analysis Dependencies

R Dependencies:
- anything in the scripts
  - Seurat (and other Seurat packages, see here)
  - DoubletFinder (use chris-mcginnis-ucsf/DoubletFinder)
  - presto (use immunogenics/presto) Example on how to install remote Github package:

library(remotes)
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')

Step 1. Generate Count Matrix (CellRanger)

cellranger count used to align/map FASTQ reads.
See cellcounting.sh, which searches ./data/raw for various experiments (each of which should have the fastq.gz files inside).
Authors use CellBender to call cells, but CellRanger has this functionality. See here for a discussion of differences. To run cellcounting.sh, see the example below:

./src/scripts/cellcounting.sh -i GSM7747185-LFD_eWAT,GSM7747186-LFD_iWAT,GSM7747187-HFD_eWAT,GSM7747188-HFD_iWAT -o cellcounting.out

Step 2. Call Cells (CellBender)

cellbender remove-background used to clean technical artifacts from sequencing data.
The script checks for the cellbender env. If you have one, just change the name in the script. Otherwise, the script will create one using mamba (change mamba -> conda if you don't have mamba)
See cellbending.sh, which searches the provided directory (e.g. experiment_set1) for various experiments, which should have at least one sample that's in cellranger output format.
To complete analysis using Seurat in R (see the tutorial the command ptrepack must also be run for compatibility, which requires PyTables. This is implemented in the shell script. To run cellbending.sh, see the example below:

./src/scripts/cellbending.sh -i ./data/cellranger/paper_processed -o cellbending.out

Step 3. Data Pre-Processing

See the .Rmd/.ipynb file for details. Broadly, the steps are:

Process each dataset:
1. Pull the CellBender data into Seurat.
2. Filter samples based on UMIs (counts), genes (features), mitochrondial gene ratio, and UMIs/gene.
3. Run DoubletFinder without ground truth.
4. If using Seurat, can run cell stage scoring & regress out those genes (and mt/rb genes).
Pool datasets togther and integrate.
Identify graph-based clusters & visualize (PCA, UMAP, tSNE).

Step 4a. Cluster Identification (Seurat)

Using marker genes (e.g. markers are provided in the paper or from other databases), annotate clutters. May also use a cell atlas as reference.
For improved granularity, select a subset of cells within a clusters, then cluster again.

Step 4b. Differentially Expressed Genes (DEGs) & GSEA/Pathway Analysis

Pseudobulking
- more robust, but requires sufficient biological replicates
- compare within cell types only
- Options:
  - edgeR/DEseq2 for R
  - decoupler + PyDESeq2 for Python (see the decoupler tutorial)
- Identify DEGs based on 0.05 FDR and absolute 2 FC.
Single-cell methods
- treats each cell as independent, though not actually
- Options:
  - Seurat's FindMarker, typically with "wilcox" or "t", check the V3 tutorial
  - Mast, see tutorial
  - Scanpy's tl.rank_genes_groups, see this online example

The paper uses these steps, but I do my own thing kinda...

Cluster genes within each condition based with K-means using Morpheus, per the example paper.
Pathway analysis for each gene cluster using Metascape, per the example paper.

Step 4c. Cell-Cell Interactions (NicheNet & CellChat)

Note

I'm here!

Bibliography & Readings

I will create a .bib at some point, but until then here are some useful reads:

Overviews / Tutorials

Cell Calling

CellRanger (Lun et al. 2019))

Preprocessing

DEGs

GSEA

C2C

Other useful papers from the paper I won't be using:

A list of other problems I ran into:

Issue: can't add CellBender into Rye
- Solution: Install cellbender into a conda environment. conda_cellbender.yml attached for convenience.
Issue: Rye installation of package led to clang not found error:
- Notes: probably occurs because Presidio doesn't have clang
- Solved by: astral-sh/rye#836
- Solution: CC=$(which gcc) CXX=$(which g++) rye add [package]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

snRNAseq-analysis

Repo Structure

Steps

Step 0a. Access Data

Step 0b: Prepare Analysis Dependencies

Step 1. Generate Count Matrix (CellRanger)

Step 2. Call Cells (CellBender)

Step 3. Data Pre-Processing

Step 4a. Cluster Identification (Seurat)

Step 4b. Differentially Expressed Genes (DEGs) & GSEA/Pathway Analysis

Step 4c. Cell-Cell Interactions (NicheNet & CellChat)

Bibliography & Readings

Overviews / Tutorials

Cell Calling

Preprocessing

DEGs

GSEA

C2C

A list of other problems I ran into:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
README.md		README.md
conda_cellbender.yml		conda_cellbender.yml
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
snRNAseq_analysis.Rproj		snRNAseq_analysis.Rproj

ethanhung11/snRNAseq_analysis

Folders and files

Latest commit

History

Repository files navigation

snRNAseq-analysis

Repo Structure

Steps

Step 0a. Access Data

Step 0b: Prepare Analysis Dependencies

Step 1. Generate Count Matrix (CellRanger)

Step 2. Call Cells (CellBender)

Step 3. Data Pre-Processing

Step 4a. Cluster Identification (Seurat)

Step 4b. Differentially Expressed Genes (DEGs) & GSEA/Pathway Analysis

Step 4c. Cell-Cell Interactions (NicheNet & CellChat)

Bibliography & Readings

Overviews / Tutorials

Cell Calling

Preprocessing

DEGs

GSEA

C2C

A list of other problems I ran into:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages