Skip to content

AlexMitsis/seabream-thesis

Repository files navigation

Seabream Population Genomics Analysis Pipeline

A comprehensive bioinformatics pipeline for analyzing population genetics data from seabream fish using pooled sequencing data. This project processes genomic data through multiple analytical approaches to understand population structure and genetic diversity.

🔬 Project Overview

This pipeline analyzes Minor Allele Frequency (MAF) data from 24 seabream chromosomes (LR537121-LR537144) to study population genetics using three complementary methods:

  • PCA Analysis - Visualizes population structure and genetic relationships
  • BayPass - Detects selection signatures and population differentiation using Bayesian methods
  • PoPoolation - Specialized analysis toolkit for pooled sequencing data

🚀 Technical Features

  • 4,000+ lines of Python code with modular, reusable architecture
  • High-performance data processing: 300x speed improvement using NumPy vectorization over standard pandas operations
  • Scalable design: Processes 25 datasets (24 chromosomes + combined analysis) in batch
  • Memory optimization: Custom 3D array operations for efficient genomic data handling
  • Interactive visualizations with Plotly for exploratory data analysis
  • Containerized deployment using Docker for reproducible research environments

📁 Project Structure

seabream-thesis/
├── main.py                    # Main pipeline orchestrator
├── preprocessing/             # Core analysis modules
│   ├── BayPass.py            # Bayesian population analysis
│   ├── PCA.py                # Principal component analysis
│   └── PoPoolation.py        # Pool-seq methods
├── src/                      # Extended analysis workflows  
├── data/                     # Raw MAF files
├── *.ipynb                   # Jupyter notebooks for interactive analysis
└── dockerfile                # Container configuration

🛠️ Technology Stack

Core Libraries:

  • Python 3.10+
  • NumPy/Pandas for data manipulation and analysis
  • Scikit-learn for PCA and statistical methods
  • Matplotlib/Plotly for data visualization
  • tqdm for progress tracking during long operations

Development Tools:

  • Jupyter Notebooks for interactive analysis
  • Docker for containerization and reproducibility
  • Git version control

🧬 Data Processing Workflow

  1. Data Import & Validation

    • Loads MAF files with allele count data (A, T, C, G) per population
    • Detects and filters positions with zero allele sums
    • Validates data integrity across all samples
  2. Population Genetics Analysis

    • PCA: Calculates major allele frequencies and performs dimensionality reduction
    • BayPass: Prepares genotype data for Bayesian analysis of population differentiation
    • PoPoolation: Formats data for specialized pool-seq statistical methods
  3. Visualization & Results

    • Interactive PCA plots showing population clustering
    • Statistical summaries and quality control metrics
    • Export-ready datasets for downstream analysis

🔧 Key Implementation Highlights

  • Performance Optimization: Converted DataFrame operations to NumPy arrays achieving 300x speedup (3 minutes vs 9+ minutes)
  • Memory Efficiency: 3D tensor reshaping for handling multi-population genomic data
  • Error Handling: Robust exception handling throughout the pipeline with detailed logging
  • Modular Design: Separate modules for each analysis type enabling easy extension
  • Data Quality Control: Automated detection and removal of problematic genomic positions

🐳 Getting Started

Using Docker (Recommended)

# Build the container
docker build -t seabream-pipeline .

# Run with Jupyter notebook interface
docker run -p 8888:8888 seabream-pipeline

Local Installation

# Install dependencies
pip install -r requirements.txt

# Run the main pipeline
python main.py

📊 Data Requirements

  • Input: CSV files containing Minor Allele Frequency data
  • Format: Chromosome, position, reference, and allele counts (A,T,C,G) per population
  • Size: Handles multi-gigabyte datasets efficiently

🎯 Applications

This pipeline is designed for:

  • Population genetics research
  • Conservation genomics studies
  • Aquaculture breeding programs
  • Evolutionary biology investigations
  • Marine biodiversity assessments

This project demonstrates practical experience in bioinformatics pipeline development, high-performance data processing, statistical analysis, and reproducible research practices.

About

Bioinformatics pipeline for population genetics data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors