Skip to content

ethanhung11/snRNAseq_analysis

Repository files navigation

snRNAseq-analysis

Replicating analysis by So et al. 2025

Repo Structure

.
└── snRNAseq-analysis
    ├── .venv #gitignored
    ├── data #gitignored
    │   ├── cellbender
    │   │   ├── experiment_set1
    │   │   │   ├── experiment1-ANNOTATION-sample1
    │   │   │   │   ├── output_filtered.h5
    │   │   │   │   ├── output_filtered_seurat.h5
    │   │   │   │   └── #cellbender outputs...
    │   │   │   └── #...
    │   │   └── #...
    │   ├── cellranger
    │   │   ├── experiment_set1
    │   │   │   ├── experiment1
    │   │   │   │   ├── sample1
    │   │   │   │   │   └── #cellranger outputs...
    │   │   │   │   └── #...
    │   │   │   └── #...
    │   │   └── #...
    │   ├── raw
    │   │   ├── experiment1
    │   │   │   ├── sample1.fastq.gz
    │   │   │   └── #...
    │   │   └── #...
    │   └── preprocessed
    │       ├── object.pickle
    │       ├── object.Rds
    │       ├── object.h5ad
    │       ├── object.h5
    │       └── #...
    ├── plots                   # plots if desired
    ├── references              # ref genomes, databases, etc.
    ├── src                     # Anything starting with a # is not yet implemented
    │   ├── Python
    │   │   ├── preprocess.py
    │   │   ├── analysis.py
    │   │   └── R.py
    │   ├── R
    │   │   ├── preprocesss.R
    │   │   ├── preprocesssing.Rmd
    │   │   ├── #analysis.R
    │   │   ├── analysis.Rmd
    │   │   └── pipeline.Rmd
    │   └── scripts
    │       ├── download_data.sh
    │       ├── cellcounting.sh
    │       ├── cellbending.sh
    │       ├── #preprocess-py.sh
    │       ├── #analyze-py.sh
    │       ├── preprocess-R.sh
    │       └── #analyze-R.sh
    ├── snRNAseq_analysis.Rproj
    ├── .gitignore
    ├── pyproject.toml          # Rye project management
    ├── requirements-dev.lock   # Rye project management
    └── requirements.lock       # Rye project management

Steps

In this example, I have raw data from 4 experiments GSM7747185 through GSM7747188 (each is it's own condition, and has 1-2 samples each). These 4 experiments are grouped into 1 experiment set called paper_processed.

Step 0a. Access Data

  • PREPROCESSED DATA:

    • Download .tsv & .mtx data. This is a CellRanger output, as described here.
    • Unzip data as necessary using gzip -d [filename].[ext].gz. Rename the corresponding files to barcodes.tsv, genes.tsv, and matrix.mtx, and save to a desired directory (using the repo structure above, I used ./data/CellRanger/[filename]/).
    • Skip to Step 3!
  • UNPROCESSED DATA:

    • Pull data from European Nucleotide Archive. Search by accession for the to find this site. The downloaded script is in ./data/preprocessed/PRJNA1010853-fastq.sh, which you can run to pull the ~170GB FASTQ dataset.
    • You'll also need CellRanger to align the sequencing data.
      • You'll also need the relevant transcriptomes (authors used mm10/GRCm38, but I'm using mm39/GRCm39 as of May 2025). You can also build a custom one here following these instructions.
    • Python Dependencies:
      • Cellbender; should follow this tutorial for more details.
        rye add cellbender --git=https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590d
        
        or normal pip installation:
        pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590d
        
      • PyTables

Step 0b: Prepare Analysis Dependencies

  • R Dependencies:
    • anything in the scripts
      • Seurat (and other Seurat packages, see here)
      • DoubletFinder (use chris-mcginnis-ucsf/DoubletFinder)
      • presto (use immunogenics/presto) Example on how to install remote Github package:
library(remotes)
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')

Step 1. Generate Count Matrix (CellRanger)

  • cellranger count used to align/map FASTQ reads.
  • See cellcounting.sh, which searches ./data/raw for various experiments (each of which should have the fastq.gz files inside).
  • Authors use CellBender to call cells, but CellRanger has this functionality. See here for a discussion of differences. To run cellcounting.sh, see the example below:
./src/scripts/cellcounting.sh -i GSM7747185-LFD_eWAT,GSM7747186-LFD_iWAT,GSM7747187-HFD_eWAT,GSM7747188-HFD_iWAT -o cellcounting.out

Step 2. Call Cells (CellBender)

  • cellbender remove-background used to clean technical artifacts from sequencing data.
  • The script checks for the cellbender env. If you have one, just change the name in the script. Otherwise, the script will create one using mamba (change mamba -> conda if you don't have mamba)
  • See cellbending.sh, which searches the provided directory (e.g. experiment_set1) for various experiments, which should have at least one sample that's in cellranger output format.
  • To complete analysis using Seurat in R (see the tutorial the command ptrepack must also be run for compatibility, which requires PyTables. This is implemented in the shell script. To run cellbending.sh, see the example below:
./src/scripts/cellbending.sh -i ./data/cellranger/paper_processed -o cellbending.out

Step 3. Data Pre-Processing

See the .Rmd/.ipynb file for details. Broadly, the steps are:

  1. Process each dataset:
    1. Pull the CellBender data into Seurat.
    2. Filter samples based on UMIs (counts), genes (features), mitochrondial gene ratio, and UMIs/gene.
    3. Run DoubletFinder without ground truth.
    4. If using Seurat, can run cell stage scoring & regress out those genes (and mt/rb genes).
  2. Pool datasets togther and integrate.
  3. Identify graph-based clusters & visualize (PCA, UMAP, tSNE).

Step 4a. Cluster Identification (Seurat)

  • Using marker genes (e.g. markers are provided in the paper or from other databases), annotate clutters. May also use a cell atlas as reference.
  • For improved granularity, select a subset of cells within a clusters, then cluster again.

Step 4b. Differentially Expressed Genes (DEGs) & GSEA/Pathway Analysis

The paper uses these steps, but I do my own thing kinda...

  • Cluster genes within each condition based with K-means using Morpheus, per the example paper.
  • Pathway analysis for each gene cluster using Metascape, per the example paper.

Step 4c. Cell-Cell Interactions (NicheNet & CellChat)

Note

I'm here!

Bibliography & Readings

I will create a .bib at some point, but until then here are some useful reads:

Overviews / Tutorials

Cell Calling

Preprocessing

DEGs

GSEA

C2C

Other useful papers from the paper I won't be using:

A list of other problems I ran into:

  • Issue: can't add CellBender into Rye
    • Solution: Install cellbender into a conda environment. conda_cellbender.yml attached for convenience.
  • Issue: Rye installation of package led to clang not found error:
    • Notes: probably occurs because Presidio doesn't have clang
    • Solved by: astral-sh/rye#836
    • Solution: CC=$(which gcc) CXX=$(which g++) rye add [package]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages