# Single-Cell RNA-seq Analysis Tutorial

This comprehensive tutorial demonstrates how to perform complete single-cell RNA-seq analysis using Lobster AI, from data acquisition to biological interpretation.

## Overview

In this tutorial, you will learn to:
- Download single-cell datasets from GEO using natural language
- Perform quality control and filtering
- Normalize and cluster cells
- Identify cell types and marker genes
- Create publication-ready visualizations
- Export and interpret results

## Prerequisites

- Lobster AI installed and configured (see [Installation Guide](02-installation.md))
- API keys set up in your `.env` file
- Basic understanding of single-cell RNA-seq concepts

## Tutorial Dataset

We'll use **GSE109564**, a well-characterized single-cell dataset of immune cells from the tumor microenvironment. This dataset contains:
- ~5,000 cells
- ~20,000 genes
- Multiple immune cell types
- Excellent for demonstrating clustering and annotation

## Step 1: Starting Lobster AI

Start the interactive chat interface with the enhanced CLI:

```bash
# Start Lobster AI with Rich CLI interface
lobster chat
```

You'll see the welcome screen with orange branding:

```
🦞 lobster chat

┌─ Welcome ────────────────────────────────────────────────────────────┐
│  🦞 LOBSTER by Omics-OS                                             │
│  Multi-Agent Bioinformatics Analysis System v0.2                    │
│                                                                      │
│  🧬 Key Tasks:                                                       │
│  • Analyze RNA-seq data                                  │
│  • Generate visualizations and plots                                │
│  • Extract insights from bioinformatics datasets                    │
│  • Access GEO & literature databases                               │
└──────────────────────────────────────────────────────────────────────┘

🦞 ~/projects ▸
```

## Step 2: Download and Load Data

Use natural language to download the dataset:

```bash
🦞 You: "Download GSE109564 from GEO - it's a single-cell RNA-seq dataset of immune cells from the tumor microenvironment"
```

**Expected Output:**
```
🦞 Lobster: I'll help you download the GSE109564 dataset from GEO...

✓ Found GSE109564: Single-cell RNA sequencing of tumor-infiltrating immune cells
✓ Dataset contains: 5,063 cells × 20,729 genes
✓ Downloaded and processed count matrix
✓ Added sample metadata and experimental design
✓ Stored as modality: 'geo_gse109564'

Dataset successfully loaded! This appears to be a high-quality single-cell dataset
from tumor-infiltrating immune cells. Ready for analysis.
```

**Verify the data was loaded:**
```bash
🦞 You: "/data"
```

This shows current dataset information including cell counts, gene counts, and basic statistics.

## Step 3: Quality Assessment

Assess the quality of the loaded dataset:

```bash
🦞 You: "Assess the quality of the loaded single-cell data and show me basic statistics"
```

**Expected Output:**
```
🦞 Lobster: I'll perform comprehensive quality assessment of your single-cell data...

✓ Quality metrics calculated for 5,063 cells
✓ Mitochondrial gene percentage: mean 8.2%, max 45%
✓ Ribosomal gene percentage: mean 12.4%
✓ Total UMI count: mean 2,847 per cell
✓ Genes detected: mean 1,234 per cell
✓ Generated quality control visualizations

Quality Assessment Summary:
- 287 cells (5.7%) have high mitochondrial content (>25%)
- 134 cells (2.6%) have very low gene counts (<200 genes)
- 89 cells (1.8%) have extremely high UMI counts (potential doublets)
- Overall data quality: EXCELLENT
- Recommended filtering: Remove 510 low-quality cells
```

**Check the generated plots:**
```bash
🦞 You: "/plots"
```

You'll see quality control plots including:
- `qc_violin_plot.html` - Distribution of QC metrics
- `qc_scatter_plot.html` - Gene count vs UMI count correlation
- `mitochondrial_gene_plot.html` - Mitochondrial gene content analysis

## Step 4: Data Preprocessing and Filtering

Filter out low-quality cells and prepare data for analysis:

```bash
🦞 You: "Filter low-quality cells based on the QC metrics, normalize the data, and identify highly variable genes"
```

**Expected Output:**
```
🦞 Lobster: I'll preprocess your single-cell data with quality-based filtering...

✓ Filtered cells: 5,063 → 4,553 cells (510 cells removed)
✓ Filtered genes: 20,729 → 18,345 genes (2,384 low-expression genes removed)
✓ Log1p normalization applied
✓ Highly variable genes identified: 2,000 genes selected
✓ Data scaled and centered for downstream analysis
✓ Stored filtered data as: 'geo_gse109564_filtered_normalized'

Preprocessing complete! Your data is now ready for clustering and dimensional reduction.
Quality-filtered dataset: 4,553 high-quality cells × 18,345 genes
```

## Step 5: Clustering and Dimensionality Reduction

Perform clustering to identify cell populations:

```bash
🦞 You: "Perform PCA, compute neighbors, run Leiden clustering, and generate UMAP visualization"
```

**Expected Output:**
```
🦞 Lobster: I'll perform clustering and dimensional reduction analysis...

✓ Principal component analysis: 50 components computed
✓ Neighbor graph constructed (15 neighbors)
✓ Leiden clustering completed: 12 clusters identified
✓ UMAP embedding generated (2D projection)
✓ Cluster validation metrics calculated
✓ Stored clustered data as: 'geo_gse109564_clustered'

Clustering Results:
- 12 distinct cell clusters identified
- Silhouette score: 0.73 (excellent separation)
- Largest cluster: 847 cells (18.6%)
- Smallest cluster: 89 cells (2.0%)
- Generated interactive UMAP visualization
```

**View the clustering results:**
```bash
🦞 You: "/plots"
```

New visualizations include:
- `umap_clusters.html` - Interactive UMAP with cluster colors
- `cluster_composition.html` - Cell count per cluster
- `pca_variance_plot.html` - Principal component variance explained

## Step 6: Marker Gene Identification

Find genes that characterize each cluster:

```bash
🦞 You: "Find marker genes for each cluster using differential expression analysis and create a heatmap"
```

**Expected Output:**
```
🦞 Lobster: I'll identify marker genes for each cluster...

✓ Differential expression analysis completed (Wilcoxon rank-sum test)
✓ 2,847 significant marker genes identified (FDR < 0.05)
✓ Top markers per cluster: 25 genes selected
✓ Expression heatmap generated
✓ Marker gene table exported
✓ Stored results as: 'geo_gse109564_markers'

Top Marker Genes by Cluster:
- Cluster 0: CD8A, CD8B, GZMB (CD8+ T cells)
- Cluster 1: CD4, IL7R, CCR7 (CD4+ T cells)
- Cluster 2: CD14, LYZ, S100A9 (Monocytes)
- Cluster 3: CD79A, MS4A1, CD19 (B cells)
- Cluster 4: KLRF1, NCR1, NKG7 (NK cells)
[... additional clusters]
```

## Step 7: Cell Type Annotation

### ⚠️ Important: Marker Validation Required

**CRITICAL: Built-in marker gene templates are preliminary and not scientifically validated.**

Before using automatic annotation, you will be prompted to:
1. **Provide custom validated markers** for your specific tissue/context, OR
2. **Explicitly acknowledge** the limitations of built-in preliminary markers

**Recommended workflow for production analysis:**

```bash
# Option A: Provide custom markers (RECOMMENDED)
🦞 You: "I want to annotate cell types using custom markers. Here are my validated markers for PBMC:
- CD8+ T cells: CD3D, CD3E, CD8A, CD8B, GZMK
- CD4+ T cells: CD3D, CD3E, CD4, IL7R, CCR7
- B cells: CD19, MS4A1, CD79A, CD79B
- NK cells: GNLY, NKG7, KLRD1, NCR1, PRF1
- Monocytes: CD14, LYZ, S100A8, S100A9
Please annotate the clusters using these markers."

# Option B: Use reference-based tools (RECOMMENDED)
🦞 You: "Use Azimuth reference-based annotation for PBMC cell types"

# Option C: Acknowledge limitations and use built-in (NOT RECOMMENDED for production)
# The agent will warn you and ask for explicit confirmation
```

**Why custom markers matter:**
- Built-in templates lack evidence scoring (AUC, logFC, specificity)
- Not validated against reference atlases (Azimuth, CellTypist, HCA)
- May contain mouse genes or activation/injury markers
- SASP/senescence and tumor detection are not reliable

**See [Manual Annotation Guide](35-manual-annotation-service.md) for details on providing custom markers.**

### Example: Using Built-in Markers (After Acknowledgment)

If you proceed with built-in markers after acknowledging limitations:

```bash
🦞 You: "Based on the marker genes, annotate the clusters with immune cell types. Use standard immune cell markers for annotation."
```

**Expected Output:**
```
🦞 Lobster: I'll annotate clusters based on marker gene expression patterns...

✓ Cell type annotation completed using marker genes
✓ 12 clusters annotated with immune cell types
✓ Annotation confidence scores calculated
✓ Updated UMAP with cell type labels
✓ Stored annotated data as: 'geo_gse109564_annotated'

Cell Type Annotations:
- Cluster 0 (847 cells): CD8+ T cells - Cytotoxic
- Cluster 1 (623 cells): CD4+ T cells - Helper
- Cluster 2 (543 cells): Monocytes - Classical
- Cluster 3 (421 cells): B cells - Mature
- Cluster 4 (298 cells): NK cells - Activated
- Cluster 5 (287 cells): CD8+ T cells - Memory
- Cluster 6 (234 cells): Dendritic cells - Conventional
- Cluster 7 (198 cells): CD4+ T cells - Regulatory
- Cluster 8 (156 cells): Macrophages - M1
- Cluster 9 (123 cells): Plasma cells
- Cluster 10 (89 cells): Dendritic cells - Plasmacytoid
- Cluster 11 (67 cells): Neutrophils

Annotation confidence: 94% (high confidence based on canonical markers)
```

## Step 8: Advanced Visualization

Create comprehensive visualizations:

```bash
🦞 You: "Create a comprehensive visualization dashboard showing UMAP with cell types, marker gene expression, and cluster statistics"
```

**Expected Output:**
```
🦞 Lobster: I'll create a comprehensive visualization dashboard...

✓ Multi-panel UMAP visualization created
✓ Marker gene expression overlays generated
✓ Cell type composition plots created
✓ Interactive dashboard with filtering options
✓ Cluster statistics summary generated

Generated Visualizations:
- umap_celltypes.html: UMAP colored by cell type
- marker_expression_umap.html: Top marker genes overlaid on UMAP
- celltype_composition.html: Cell type proportions and statistics
- expression_heatmap.html: Marker genes × cell types heatmap
- qc_by_celltype.html: Quality metrics by cell type
- comprehensive_dashboard.html: Interactive multi-panel dashboard
```

## Step 9: Export Results

Export your analysis for further use:

```bash
🦞 You: "/export results"
```

**Expected Output:**
```
🦞 Lobster: Exporting analysis results...

✓ Exported annotated data (H5AD format): geo_gse109564_annotated.h5ad
✓ Exported marker genes: marker_genes_table.csv
✓ Exported cell metadata: cell_annotations.csv
✓ Exported cluster statistics: cluster_summary.csv
✓ Exported visualizations: plots/ directory
✓ Analysis provenance: analysis_history.json

Export complete! Files saved to: /path/to/workspace/exports/
```

## Step 10: Biological Interpretation

Ask Lobster to interpret the results:

```bash
🦞 You: "Interpret the biological significance of these immune cell populations in the tumor microenvironment context"
```

**Expected Output:**
```
🦞 Lobster: Based on the single-cell analysis, here's the biological interpretation...

🧬 Tumor Microenvironment Analysis:

Key Findings:
1. **Diverse Immune Infiltration**: 11 distinct immune cell types identified
2. **T Cell Dominance**: CD8+ and CD4+ T cells comprise 32% of total cells
3. **Active Immune Response**: High expression of cytotoxic markers (GZMB, PRF1)
4. **Regulatory Elements**: Presence of Tregs (198 cells) may suppress immunity

Clinical Implications:
- High CD8+ T cell infiltration suggests active anti-tumor response
- Presence of M1 macrophages indicates pro-inflammatory environment
- Treg population may limit therapeutic efficacy
- NK cell activation (NKG7+) supports innate immunity

Recommended Follow-up:
- Pseudobulk analysis for population-level statistics
- Trajectory analysis for T cell activation states
- Ligand-receptor analysis for cell-cell communication
```

## Working with the Results

### Accessing Your Data

```bash
# Check available datasets
🦞 You: "/files"

# Read specific files
🦞 You: "/read marker_genes_table.csv"

# View workspace structure
🦞 You: "/tree"
```

### Advanced Analysis Options

```bash
# Convert to pseudobulk for statistical analysis
🦞 You: "Convert the annotated single-cell data to pseudobulk format for differential expression analysis between cell types"

# Perform pathway analysis
🦞 You: "Run pathway enrichment analysis on the marker genes for each cell type"

# Export for external tools
🦞 You: "Export the data in Seurat format for R analysis"
```

## Troubleshooting Common Issues

### Issue 1: Download Fails
```bash
🦞 You: "The GEO download failed with a timeout error"
```
**Solution**: Check internet connection and try smaller datasets first.

### Issue 2: Poor Clustering
```bash
🦞 You: "The clustering results don't look good - I see poorly separated clusters"
```
**Solution**: Adjust resolution parameter or filtering thresholds.

### Issue 3: Missing Cell Types
```bash
🦞 You: "Some clusters don't have clear cell type annotations"
```
**Solution**: Check additional marker genes or use reference-based annotation.

## Best Practices

1. **Quality Control**: Always inspect QC metrics before filtering
2. **Parameter Testing**: Try different clustering resolutions for optimal results
3. **Marker Validation**: Verify cell type annotations with literature
4. **Visualization**: Use interactive plots to explore data thoroughly
5. **Documentation**: Export analysis history for reproducibility

## Next Steps

After completing this tutorial, consider:

1. **[Bulk RNA-seq Tutorial](24-tutorial-bulk-rnaseq.md)** - Convert to pseudobulk and perform population-level analysis
2. **[Proteomics Tutorial](25-tutorial-proteomics.md)** - Integrate with proteomics data
3. **[Advanced Analysis](27-examples-cookbook.md)** - Trajectory analysis, cell-cell communication
4. **[Custom Workflows](26-tutorial-custom-agent.md)** - Create specialized analysis agents

## Summary

You have successfully:
- ✅ Downloaded and loaded a single-cell dataset from GEO
- ✅ Performed comprehensive quality control
- ✅ Filtered and normalized the data
- ✅ Identified 12 distinct immune cell populations
- ✅ Annotated clusters with biological cell types
- ✅ Generated publication-ready visualizations
- ✅ Exported results for further analysis
- ✅ Interpreted biological significance

This complete workflow demonstrates Lobster AI's power for single-cell RNA-seq analysis using natural language interactions and professional-grade bioinformatics algorithms.