University of Bristol – MSc Data Science 2025
Group Project: Orestas Dulinskas, Adrian Dinulescu, Elena Bettison, Elizabeth Williams
This project investigates the transcriptional dynamics of mouse gastrulation using Single-Cell RNA-Sequencing (scRNA-seq) data from two public datasets. It aims to:
- Uncover the gene expression changes that drive early embryonic development
- Integrate datasets from different technologies using SCVI
- Identify differentially expressed genes (DEGs) across time points
- Provide a no-code interface using Chatmol, enabling domain experts to run the pipeline using natural language
Covers:
- Biological background and motivation
- Data acquisition and preprocessing
- Dimensionality reduction and batch correction
- Differential expression analysis
- Chatmol integration and interface testing
- Results, insights, and future work
This Python script integrates the project's core functionality into the Chatmol framework. It defines callable functions for:
- Data preprocessing
- Exploratory dimensionality reduction
- Dataset integration using SCVI
- Differential gene expression (DEG) analysis
Chatmol interprets natural language prompts and triggers these functions.
More about this in technical report
Contains four Jupyter notebooks, one from each team member. Each notebook performs:
- Dataset-specific quality control (mitochondrial content, UMI thresholds, doublet detection)
- Normalization and log transformation
- Highly variable gene (HVG) selection
- UMAP visualizations of raw batches
These notebooks explore different subsets of the data (e.g. ARG, PJ1, PJ2) and help ensure robustness across preprocessing workflows.
Also contains four Jupyter notebooks, each focusing on:
- Concatenating preprocessed datasets
- Configuring and training SCVI (Single-Cell Variational Inference)
- Removing batch effects
- Comparing latent space representations (UMAPs)
- Performing DEG analysis between stages
Each notebook explores different SCVI configurations (e.g., dispersion settings, likelihood functions, latent dimensions).
| Dataset | Study | Method | Stages | Cells |
|---|---|---|---|---|
| ARG | Argelaguet et al. (2019) | Plate-based (Smart-seq2) | E4.5–E7.5 | ~2,500 |
| PJ | Pijuan-Sala et al. (2019) | Droplet-based (10x) | E6.5–E8.5 | ~116,000 |
Datasets were quality-checked and annotated with metadata including embryonic day and cell type labels.
- Python (ScanPy, AnnData, Seaborn, Matplotlib)
- scVI-tools (GPU-accelerated batch integration)
- Scrublet (doublet detection)
- Chatmol (LLM interface)
- llama 3.2 via Ollama for local LLM inference
- Google Colab / Kaggle for GPU experimentation
- UMAP visualizations show clear lineage progression from epiblast to germ layers
- SCVI successfully removes batch effects across protocols (Smart-seq2 vs 10x)
- DEGs identified at E4.5–E7.5 align with biological transitions: pre-gastrulation, streak formation, and germ layer specification
- Chatmol allows users to run the entire pipeline with commands like:
"Perform exploratory analysis on the dataset"
Chatmol enables non-coders (e.g., lab biologists) to analyze scRNA-seq data via natural language. This project adds support for:
preprocess the dataperform exploratory analysisintegrate batches using SCVIanalyze differentially expressed genes
The assistant runs selected functions and returns figures or tables without code from the user.
- Migrate from SCVI to scANVI for semi-supervised integration (leverages known labels)
- Switch Chatmol's function calling to Model Context Protocol (MCP) for broader LLM support
- Add Chatmol features for:
- Custom DEG comparisons
- Marker gene lookups
- Cluster-specific visualizations
| Name |
|---|
| Orestas Dulinskas |
| Adrian Dinulescu |
| Elena Bettison |
| Elizabeth Williams |