Skip to content

EarlhamInst/earlham_cellxgene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CellXGene Explorer

A self-contained Docker-based environment for exploring curated single-cell datasets using CellXGene. Researchers can browse available datasets through a web interface and launch CellXGene for interactive visualization.

Features

  • πŸ”¬ Dataset Catalog: Browse curated single-cell datasets with metadata
  • πŸš€ One-Click Launch: Launch CellXGene viewer for any dataset with smart status polling
  • οΏ½ Progress Tracking: Real-time loading progress bar with estimated completion times
  • ⏱️ Smart Estimates: File size-based loading time predictions
  • 🟒 Status Indicators: Visual badges showing running/stopped container status
  • ⏹️ Manual Control: Stop button to close containers on-demand
  • πŸ”§ Admin Panel: Monitor and manage all active containers with memory estimates
  • πŸ”„ Auto-Retry: Automatic retry mechanism for failed launches (OOM/timeout)
  • πŸ’¬ Better Errors: Context-aware error messages with actionable recovery hints
  • 🐳 Docker-Based: Fully containerized for easy deployment
  • πŸ“¦ Volume-Mounted Storage: Add datasets without rebuilding containers
  • πŸ”Œ Extensible: Add additional services via Docker Compose
  • ⚑ High Concurrency: Dynamic container spawning supports multiple concurrent users
  • ⏰ Auto-Cleanup: Containers automatically close after 48 hours of inactivity
  • 🎨 Earlham Institute Branding: Custom styling with institutional brand colors
  • πŸ›‘οΈ Memory Management: 4GB per-container limits prevent OOM crashes

Quick Start

Prerequisites

  • Docker 20.10+ and Docker Compose 2.0+
  • 48GB+ available RAM (recommended: 16 cores, 48GB RAM for 10-worker configuration)
  • Linux host (Ubuntu 20.04+, CentOS 8+) or macOS with Docker Desktop

Installation

  1. Clone the repository:
git clone <repository-url>
cd cellxgene_stack
  1. Copy the environment template:
cp .env.example .env
  1. Add your datasets to the data directory:
mkdir -p data/datasets data/logs
# Copy your .h5ad files to data/datasets/
# Metadata is read directly from the h5ad files
  1. Start the services:
docker-compose up -d
  1. Access the landing page at http://localhost (or your configured port)

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Nginx                               β”‚
β”‚         (Reverse Proxy, Routing & Error Handling)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                  β”‚                β”‚
    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Landing   β”‚  β”‚  Static        β”‚  β”‚  Dynamic       β”‚
    β”‚   Page     β”‚  β”‚  CellXGene     β”‚  β”‚  CellXGene     β”‚
    β”‚  (Flask +  β”‚  β”‚  Service       β”‚  β”‚  Containers    β”‚
    β”‚   APSched) β”‚  β”‚  (Optional)    β”‚  β”‚  (On-demand)   β”‚
    β”‚            β”‚  β”‚                β”‚  β”‚                β”‚
    β”‚ - Catalog  β”‚  β”‚ - Port 5005    β”‚  β”‚ - Ports 5006+  β”‚
    β”‚ - API      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ - Per dataset  β”‚
    β”‚ - Containerβ”‚                      β”‚ - Auto-cleanup β”‚
    β”‚   Manager  β”‚                      β”‚   48h timeout  β”‚
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                                  β”‚
          β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚        β”‚
    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
    β”‚   Docker Socket    β”‚
    β”‚  (Container Mgmt)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Volume Mount     β”‚
    β”‚  data/datasets/    β”‚
    β”‚  - *.h5ad files    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

  • Nginx: Reverse proxy with intelligent routing and error handling

    • / β†’ Landing page web interface
    • /api/ β†’ Landing page REST API
    • /cellxgene-{dataset_id}/ β†’ Dynamic per-dataset containers
    • Custom error pages for closed containers
  • Landing Page Service: Python Flask application with container orchestration

    • Scans data directory for h5ad files using memory-mapped reading
    • Extracts embedded metadata from each file
    • Provides REST API for dataset catalog and container status
    • Manages dynamic CellXGene container lifecycle
    • Background scheduler for automatic cleanup (48-hour inactivity)
    • Status polling endpoint for smooth container startup
    • Admin panel for monitoring active containers
  • Dynamic CellXGene Containers: On-demand instances

    • Spawned automatically when dataset is launched
    • Each dataset gets isolated container on unique port (5006-5100)
    • 4GB memory limit per container (configurable via CELLXGENE_MEMORY_PER_WORKER_GB)
    • Production spec: 10 workers on 16-core, 48GB RAM VM
    • Automatic cleanup after 48 hours of inactivity
    • Health checking ensures ready before user access
    • 180-second startup timeout for large files (4.5GB+)

User Interface

Main Landing Page

  • Dataset Cards: Grid view with metadata (organism, tissue, assay, cell/gene counts)
  • Search & Filter: Find datasets by name, organism, tissue, or assay
  • Sort Options: By name, cell count, or file size
  • Launch Button: One-click launch with progress bar
  • Stop Button: Appears after launch to manually close containers
  • Status Badge: Green "Running" indicator for active containers
  • Loading Progress: Real-time progress bar with estimated completion time
  • Estimated Times: File size-based predictions (< 100MB: ~30s, > 3GB: ~3 mins)

Admin Panel

Access at /admin to:

  • View all active containers with status
  • See dataset names, ports, file sizes
  • Monitor last accessed time and inactive duration
  • Estimate total memory usage
  • Stop individual containers
  • Auto-refreshes every 30 seconds

API Endpoints

  • GET /api/datasets - List all datasets
  • GET /api/datasets/{id} - Get dataset details
  • POST /api/datasets/{id}/launch - Launch container (returns URL and timeout info)
  • GET /api/datasets/{id}/status - Check container status (ready/starting)
  • POST /api/datasets/{id}/keepalive - Update access time
  • POST /api/datasets/{id}/stop - Stop running container
  • GET /api/admin/containers - List active containers (admin)
  • GET /api/health - Health check
  • GET /api/statistics - Get catalog statistics

Configuration

Edit .env to customize:

  • Ports: Change NGINX_PORT, LANDING_PAGE_PORT, CELLXGENE_PORT
  • Workers: Adjust CELLXGENE_WORKERS (default: 10 for production)
  • Memory: Modify CELLXGENE_MEMORY_PER_WORKER_GB (default: 4GB per worker)
  • Host Paths: Set HOST_DATA_DIRECTORY and HOST_LOG_DIRECTORY to absolute paths on your host machine
  • Container Paths: Set DATA_DIRECTORY, LOG_DIRECTORY (internal container paths)

Production Configuration (16 cores, 48GB RAM):

  • CELLXGENE_WORKERS=10
  • CELLXGENE_MEMORY_PER_WORKER_GB=4

Development Configuration (8GB RAM):

  • CELLXGENE_WORKERS=2
  • CELLXGENE_MEMORY_PER_WORKER_GB=2

Important: Before deploying, copy .env.example to .env and update HOST_DATA_DIRECTORY and HOST_LOG_DIRECTORY with your actual paths.

Adding Datasets

  1. Place your .h5ad file in data/datasets/
  2. Ensure your h5ad file has metadata embedded in the .uns attribute
  3. Restart the services: docker-compose restart landing-page

The system will automatically extract metadata from the h5ad file's .uns attribute. Required metadata fields can be stored under adata.uns['metadata']:

  • name: Dataset name
  • description: Dataset description
  • organism: Organism name
  • tissue: Tissue type
  • assay: Assay technology

Cell and gene counts are automatically extracted from the data dimensions.

Example of adding metadata to an h5ad file:

import anndata

adata = anndata.read_h5ad("your_dataset.h5ad")
adata.uns['metadata'] = {
    "name": "PBMC 3k Dataset",
    "description": "3k PBMCs from a Healthy Donor",
    "organism": "Homo sapiens",
    "tissue": "peripheral blood",
    "assay": "10x 3' v2"
}
adata.write_h5ad("your_dataset.h5ad")

Documentation

All documentation is located in the docs/ directory:

Development

Run tests with:

# Unit tests
pytest services/landing-page/tests/unit/

# Integration tests
pytest services/landing-page/tests/integration/

# End-to-end tests
pytest tests/e2e/

Troubleshooting

Common Issues

Out of Memory (OOM) Errors

  • Symptom: Containers exit with code 137 or crash after ~20 seconds
  • Cause: Dataset too large for available RAM
  • Solutions:
    1. Close other containers via Admin Panel (/admin)
    2. Increase Docker memory limit in Docker Desktop settings
    3. Reduce number of concurrent containers
    4. Increase per-container memory limit in container_manager.py

Slow Loading / Timeouts

  • Large files (>4GB) may take 2-3 minutes to load
  • The system will retry automatically (up to 2 retries)
  • Progress bar shows estimated time
  • Check Admin Panel to see if containers are stuck

Container Not Starting

  • Check docker logs cellxgene-landing-page for errors
  • Verify dataset file exists and is valid h5ad format
  • Ensure Docker has sufficient resources
  • Check if port range (5006-5100) is available

Can't Stop Container

  • Container may have already stopped automatically
  • Check Admin Panel for current status
  • Use docker ps | grep cellxgene to verify
  • Restart landing-page service if manager state is inconsistent

See docs/troubleshooting.md for more detailed solutions.

Constitutional Compliance

This project follows the constitution defined in .specify/memory/constitution.md:

  • βœ… Unit Testing: 80%+ test coverage with pytest
  • βœ… Modular Architecture: Containerized services with clear boundaries
  • βœ… Code Clarity: Comprehensive documentation and comments
  • βœ… Fail-Fast: Startup validation with explicit error messages
  • βœ… Documentation: README, API docs, deployment guides, troubleshooting
  • βœ… Accessibility: Designed for users with varying technical expertise

License

[Specify your license here]

Support

For issues or questions, please open an issue or consult the documentation in the docs/ directory.

About

reusable dockerised instance of cellxgene with custom landing page to select between single cell datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors