# Data Analysis Workflows ## Overview This guide provides step-by-step workflows for analyzing different types of biological data using Lobster AI. Each workflow combines natural language interaction with specialized AI agents to perform publication-quality analysis. ## Single-Cell RNA-seq Analysis Workflow ### Workflow Overview **Goal**: Analyze single-cell RNA-seq data to identify cell types, find marker genes, and understand cellular heterogeneity. **Agent**: Single-Cell Expert handles all aspects of scRNA-seq analysis. **Time**: 15-30 minutes for typical dataset (10K-50K cells) ### Step 1: Data Loading and Initial Assessment ```bash # Load your single-cell data /read my_singlecell_data.h5ad # Alternative: Load from multiple formats /read counts_matrix.csv /read filtered_feature_bc_matrix/ # 10X format /read *.h5 # Multiple files ``` **Natural Language Alternative**: ``` "Load my single-cell RNA-seq data from the h5ad file" ``` **Expected Output**: - Data shape (cells × genes) - File format confirmation - Initial data structure summary ### Step 2: Data Quality Assessment ```bash # Check data overview /data # Request quality control analysis "Perform quality control analysis on this single-cell data" ``` **Quality Control Includes**: - **Mitochondrial Gene Percentage**: Cell viability indicator - **Ribosomal Gene Percentage**: Translation activity - **Total Gene Counts**: Library complexity - **Total UMI Counts**: Sequencing depth - **Doublet Detection**: Multi-cell artifacts **Expected Results**: - Quality control metrics for each cell - Distribution plots for QC metrics - Recommendations for filtering thresholds ### Step 3: Data Filtering and Preprocessing ``` "Filter low-quality cells and normalize the data using standard parameters" ``` **Or specify custom parameters**: ``` "Filter cells with less than 200 genes and more than 20% mitochondrial content, then normalize using log1p transformation" ``` **Processing Steps**: 1. **Cell Filtering**: Remove low-quality cells 2. **Gene Filtering**: Remove rarely expressed genes 3. **Normalization**: Library size normalization + log1p 4. **Highly Variable Genes**: Identify most informative features **Expected Output**: - Filtered dataset dimensions - Normalization parameters used - Quality metrics after filtering ### Step 4: Dimensionality Reduction and Clustering ``` "Perform PCA, compute neighbors, and cluster the cells using Leiden algorithm" ``` **Or request comprehensive analysis**: ``` "Run the complete single-cell workflow: PCA, UMAP, clustering, and find marker genes" ``` **Analysis Steps**: 1. **Principal Component Analysis (PCA)**: Reduce dimensionality 2. **Neighborhood Graph**: Build cell-cell similarity network 3. **Leiden Clustering**: Identify cell communities 4. **UMAP Embedding**: 2D visualization **Expected Results**: - UMAP plot with colored clusters - Cluster statistics and cell counts - Quality assessment of clustering ### Step 5: Cell Type Annotation ``` "Identify the cell types in each cluster using marker genes" ``` **For specific tissue**: ``` "Annotate cell types in this liver single-cell data using known liver cell markers" ``` **Annotation Methods**: 1. **Marker Gene Analysis**: Find top genes per cluster 2. **Reference Mapping**: Compare to cell atlases 3. **Manual Annotation**: User-guided cell type assignment 4. **Automated Annotation**: ML-based cell type prediction **Expected Results**: - Marker genes table for each cluster - Cell type annotations - UMAP plot with cell type labels - Confidence scores for annotations ### Step 6: Differential Expression Analysis ``` "Find differentially expressed genes between cell types" ``` **For specific comparison**: ``` "Compare hepatocytes and stellate cells to find differentially expressed genes" ``` **Or condition-based analysis**: ``` "Find genes differentially expressed between control and treatment conditions in each cell type" ``` **Analysis Features**: - **Statistical Testing**: Wilcoxon rank-sum test - **Multiple Testing Correction**: Benjamini-Hochberg FDR - **Effect Size Filtering**: Log fold change thresholds - **Visualization**: Volcano plots and heatmaps ### Step 7: Advanced Analysis (Optional) #### Trajectory Analysis ``` "Perform trajectory analysis to identify developmental paths" ``` #### Pseudobulk Analysis ``` "Aggregate cells by type and perform bulk RNA-seq differential expression" ``` #### Gene Set Enrichment ``` "Perform pathway enrichment analysis on the differentially expressed genes" ``` ### Complete Workflow Example ```bash # 1. Load data /read liver_scrnaseq.h5ad # 2. Comprehensive analysis request "Analyze this liver single-cell RNA-seq data: perform quality control, filter low-quality cells, normalize, cluster cells, identify cell types, and find marker genes for each cluster" # 3. Specific follow-up "Compare hepatocytes between control and fibrotic conditions" # 4. Visualization /plots # View all generated plots # 5. Save results /save ``` ## Bulk RNA-seq Analysis Workflow ### Workflow Overview **Goal**: Analyze bulk RNA-seq data to identify differentially expressed genes between conditions. **Agent**: Bulk RNA-seq Expert specializes in count-based differential expression analysis. **Time**: 10-20 minutes for typical experiment ### Step 1: Data Preparation #### Option A: Load Kallisto/Salmon Quantification Files (Recommended) **⚠️ NEW in v0.2+**: Use CLI `/read` command directly for quantification files. ```bash # Load Kallisto quantification files /read /path/to/kallisto_output # Or load Salmon quantification files /read /path/to/salmon_output ``` **Expected Directory Structure**: ``` quantification_output/ ├── sample1/ │ └── abundance.tsv (Kallisto) or quant.sf (Salmon) ├── sample2/ │ └── abundance.tsv (Kallisto) or quant.sf (Salmon) └── sample3/ └── abundance.tsv (Kallisto) or quant.sf (Salmon) ``` **Features**: - **Direct CLI Loading**: Use `/read` command - no agent interaction needed - **Automatic Tool Detection**: CLI detects Kallisto vs Salmon from file patterns - **Per-Sample Merging**: Merges quantification from all sample subdirectories - **Correct Orientation**: Transposes to samples × genes (bulk RNA-seq standard) - **Sample Names**: Extracted from subdirectory names - **Quality Validation**: Verifies file integrity and consistency #### Option B: Load Count Matrix (Traditional) ```bash # Load count matrix /read counts_matrix.csv # Load with metadata /read counts.csv "Load the sample metadata file to define experimental conditions" ``` **Expected Data Format**: - Rows: Genes/transcripts - Columns: Samples - Raw or normalized counts ### Step 2: Experimental Design Setup ``` "Set up differential expression analysis comparing treatment vs control groups" ``` **For complex designs**: ``` "Analyze differential expression using the formula: ~condition + batch + gender" ``` **Features**: - **R-style Formulas**: Support complex experimental designs - **Batch Effect Handling**: Automatic detection and correction - **Multiple Factors**: Age, gender, batch, treatment interactions - **Contrasts**: Flexible comparison specifications ### Step 3: Quality Control ``` "Generate quality control plots and assess data distribution" ``` **QC Analysis Includes**: - **Count Distribution**: Library size assessment - **PCA Plots**: Sample clustering and batch effects - **Correlation Heatmaps**: Sample relationships - **Dispersion Plots**: Model fitting quality ### Step 4: Differential Expression with pyDESeq2 ``` "Perform differential expression analysis using DESeq2" ``` **Analysis Features**: - **Normalization**: Size factor estimation - **Dispersion Modeling**: Gene-wise and fitted dispersions - **Statistical Testing**: Wald test or likelihood ratio test - **Shrinkage**: Effect size shrinkage for better estimates **Results Include**: - Log2 fold changes with confidence intervals - P-values and adjusted P-values (FDR) - Base means and dispersion estimates - Convergence diagnostics ### Step 5: Results Visualization ``` "Create volcano plots and heatmaps for the differential expression results" ``` **Visualization Options**: - **Volcano Plots**: Effect size vs significance - **MA Plots**: Mean expression vs fold change - **Heatmaps**: Top differentially expressed genes - **PCA Plots**: Sample relationships ### Step 6: Downstream Analysis ``` "Perform pathway enrichment analysis on the upregulated genes" ``` **Advanced Analysis**: - Gene set enrichment analysis (GSEA) - Pathway over-representation analysis - Gene ontology analysis - KEGG pathway mapping ### Complete Workflow Example ```bash # 1. Load data /read rnaseq_counts.csv # 2. Define experimental setup "Analyze differential expression between high-fat diet and control mice, accounting for batch effects and gender differences" # 3. Request comprehensive analysis "Perform complete bulk RNA-seq analysis: quality control, normalization, differential expression testing, and generate volcano plots" # 4. Follow-up analysis "Show me the top 20 upregulated genes and their functions" # 5. Export results /export ``` ## Mass Spectrometry Proteomics Workflow ### Workflow Overview **Goal**: Analyze label-free quantitative proteomics data to identify differentially abundant proteins. **Agent**: MS Proteomics Expert handles mass spectrometry data analysis. **Time**: 20-40 minutes depending on dataset complexity ### Step 1: Data Loading ```bash # Load MaxQuant output /read proteinGroups.txt # Load Spectronaut results /read spectronaut_results.csv # Load generic proteomics data /read protein_intensities.csv ``` ### Step 2: Data Assessment ``` "Assess the quality of this proteomics data and show missing value patterns" ``` **Quality Assessment**: - **Missing Value Analysis**: MNAR vs MCAR patterns - **Coefficient of Variation**: Technical and biological CV - **Intensity Distributions**: Dynamic range assessment - **Batch Effect Detection**: Systematic biases ### Step 3: Data Preprocessing ``` "Filter proteins with excessive missing values and normalize intensities" ``` **Preprocessing Steps**: 1. **Protein Filtering**: Remove contaminants and reverse sequences 2. **Missing Value Handling**: Imputation strategies (MNAR/MCAR) 3. **Intensity Normalization**: TMM, quantile, or VSN normalization 4. **Log Transformation**: Variance stabilization ### Step 4: Statistical Analysis ``` "Perform differential protein abundance analysis between treatment groups" ``` **Statistical Methods**: - **Linear Models**: limma-based analysis - **Empirical Bayes**: Moderated t-statistics - **Multiple Testing**: FDR control - **Effect Size Estimation**: Protein fold changes ### Step 5: Results Interpretation ``` "Identify significantly changed proteins and perform pathway analysis" ``` **Results Analysis**: - Volcano plots for differential proteins - Protein interaction networks - Pathway enrichment analysis - GO term analysis ### Complete Workflow Example ```bash # Load MaxQuant data /read proteinGroups.txt # Comprehensive analysis "Analyze this label-free proteomics data: assess data quality, handle missing values, normalize intensities, and identify proteins differentially abundant between control and treatment groups" # Pathway analysis "Perform pathway enrichment analysis on the significantly changed proteins" ``` ## Affinity Proteomics Workflow ### Workflow Overview **Goal**: Analyze targeted proteomics data from Olink panels or antibody arrays. **Agent**: Affinity Proteomics Expert specializes in targeted protein analysis. **Time**: 15-25 minutes for typical panel ### Step 1: Data Loading ```bash # Load Olink NPX data /read olink_npx_data.csv # Load antibody array data /read antibody_intensities.csv ``` ### Step 2: Quality Assessment ``` "Assess the quality of this Olink panel data and check for batch effects" ``` **Quality Metrics**: - **Coefficient of Variation**: Within and between batch CV - **Detection Rates**: Protein detectability across samples - **Control Performance**: Internal control assessment - **Batch Effects**: Systematic biases between runs ### Step 3: Statistical Analysis ``` "Compare protein levels between disease and healthy control groups" ``` **Analysis Features**: - **Linear Models**: Account for covariates - **Batch Correction**: ComBat or similar methods - **Multiple Testing**: FDR correction - **Effect Size**: Clinical significance assessment ### Complete Workflow Example ```bash # Load Olink data /read olink_cardiovascular_panel.csv # Comprehensive analysis "Analyze this Olink cardiovascular panel data: assess quality, check for batch effects, and identify proteins associated with cardiovascular disease status" ``` ## Multi-Omics Integration Workflow ### Workflow Overview **Goal**: Integrate multiple data modalities for comprehensive biological insights. **Agents**: Multiple agents coordinate for multi-modal analysis. **Time**: 30-60 minutes depending on complexity ### Step 1: Load Multiple Datasets ```bash # Load different modalities /read transcriptomics_data.h5ad /read proteomics_data.csv /read metabolomics_data.xlsx ``` ### Step 2: Data Integration ``` "Integrate the transcriptomics and proteomics data to identify coordinated changes across molecular layers" ``` **Integration Methods**: - **Sample Matching**: Align samples across modalities - **Feature Integration**: Multi-omics factor analysis - **Pathway Integration**: Combine evidence across layers - **Network Analysis**: Multi-layer biological networks ### Step 3: Coordinated Analysis ``` "Find genes and proteins that change together in response to treatment" ``` **Results**: - Correlation analysis across omics layers - Pathway-level integration - Multi-omics visualizations - Integrated statistical models ## Literature Integration Workflow ### Workflow Overview **Goal**: Integrate literature knowledge with experimental data analysis. **Agent**: Research Agent with **automatic PMID/DOI → PDF resolution** (v0.2+) and **structure-aware Docling parsing** (v0.2+). **Key Capabilities**: - **v0.2+**: Automatic resolution of PMIDs and DOIs to accessible PDFs (70-80% success rate) using tiered waterfall strategy: PMC → bioRxiv/medRxiv → Publisher → Alternative suggestions - **v0.2+**: Structure-aware PDF parsing with Docling for intelligent Methods section detection (>90% hit rate vs ~30% previously), complete section extraction, table and formula preservation, and document caching ### Step 1: Literature Search ``` "Find papers about single-cell RNA-seq analysis of liver fibrosis" ``` ### Step 2: Method Extraction (Enhanced with v0.2+ DOI Resolution) **Enhanced (v0.2+)**: Directly provide PMIDs or DOIs - automatic resolution to PDFs happens internally. **Enhanced (v0.2+)**: Robust DOI/PMID auto-detection and resolution with Docling format auto-detection. **All these formats now work seamlessly:** ```bash # Bare DOI (NEW - auto-detected and resolved) "Extract methods from 10.1101/2024.08.29.610467" # DOI with prefix "Extract methods from DOI:10.1038/s41586-025-09686-5" # PMID with or without prefix "Extract methods from PMID:39370688" "Extract methods from 39370688" # Direct URLs (existing behavior maintained) "Extract methods from https://www.nature.com/articles/s41586-025-09686-5" # PMC URLs (now correctly handled as HTML, not PDF) "Extract methods from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12496192/pdf/" ``` **Batch processing for competitive analysis:** ```bash "Extract methods from these papers: 10.1101/2024.01.001, PMID:12345678, DOI:10.1038/s41586-021-12345-6" ``` **Automatic handling**: - ✅ Accessible papers → Methods extracted immediately using Docling structure-aware parsing - ✅ Complete Methods sections extracted (no arbitrary truncation) - ✅ Parameter tables and formulas preserved - ✅ Results cached for fast repeat access - ❌ Paywalled papers → 5 alternative access strategies provided (PMC accepted manuscripts, preprints, institutional access, author contact, Unpaywall) **Quality Improvement (v0.2+)**: - Methods section detection: >90% success rate (vs ~30% with naive truncation) - Complete section extraction (no 10K character limit) - Table extraction: 80%+ of parameter tables detected - Smart image filtering: 40-60% context size reduction - Document caching: 30-50x faster on repeat access ### v0.2+ Enhancement: Robust DOI Resolution **What Changed:** The v0.2+ release fixed critical DOI/PMID resolution bugs and enhanced format detection: **✅ Fixed Issues:** - DOIs and PMIDs are now automatically detected and resolved - No more "URL not found" errors for valid DOIs (e.g., `10.18632/aging.204666`) - PMC URLs serving HTML content correctly handled (not misclassified as PDF) - Eliminated duplicate code paths in research agent **✅ New Capabilities:** - **Bare DOI input:** `"Extract methods from 10.1101/2024.01.001"` (no URL wrapper needed) - **Numeric PMID input:** `"Extract methods from 38448586"` (no "PMID:" prefix needed) - **Format auto-detection:** Docling determines HTML vs PDF automatically - **Graceful error handling:** Paywalled papers return helpful suggestions **Examples that now work reliably:** ```bash # These previously failed with FileNotFoundError, now work: "Extract methods from 10.1101/2024.01.001" # bioRxiv DOI "Extract methods from 38448586" # Numeric PMID "Extract methods from 10.18632/aging.204666" # Paywalled (graceful handling) # These work better with enhanced format detection: "Extract methods from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC..." # HTML auto-detected ``` **See also**: [37-publication-intelligence-deep-dive.md](37-publication-intelligence-deep-dive.md) for comprehensive Docling integration details. ### Step 3: Check Accessibility (Optional) For competitive analysis, check accessibility before extraction: ``` "Check if PMID:12345678 is accessible" ``` ### Step 4: Method Application ``` "Apply the methods from PMID:12345678 to analyze my data using their parameters" ``` ## GEO Database Integration Workflow ### Workflow Overview **Goal**: Download and analyze public datasets from GEO database. **Agent**: Data Expert handles GEO integration. ### Step 1: Dataset Discovery ``` "Find GEO datasets related to liver single-cell RNA-seq" ``` **Research Agent** will search GEO database and return relevant datasets with accession numbers. ### Step 2: Pre-Download Metadata Validation (Recommended) **Before downloading large datasets**, validate that they contain the required metadata fields: ``` "Validate GSE200997 for required fields: cell_type, tissue" ``` **Or with specific value requirements**: ``` "Check if GSE179994 has treatment_response field with responder and non-responder values" ``` **What This Does**: - Fetches only metadata (no expression data download) - Analyzes sample characteristics from all samples - Checks field presence and coverage (% of samples) - Provides recommendation: proceed/skip/manual_check - Returns confidence score (0-1) **Example Validation Report**: ``` ## Metadata Validation Report for GSE200997 **Recommendation:** ✅ **PROCEED** **Confidence Score:** 1.00/1.00 **Total Samples:** 23 ### Field Analysis: - **cell_type**: ✅ 100.0% coverage (values: 'Colon,Right,Cecum', 'Colon,Left,Sigmoid', ...) - **tissue**: ✅ 100.0% coverage (values: 'Colorectal cancer') ### 💡 Recommendation Rationale: All required fields are present with sufficient coverage. Dataset is suitable for analysis. ``` **Why Validate First?**: - ⏱️ **Save time**: 2-5 seconds vs 5-30 minutes full download - 💾 **Save storage**: Avoid downloading datasets missing critical metadata - 🎯 **Better selection**: Compare metadata across multiple candidates - 📊 **Field coverage**: See actual sample-level completeness **Common Use Cases**: - Drug discovery: Validate treatment response fields - Biomarker studies: Check clinical outcome metadata - Multi-dataset analysis: Filter by metadata completeness - Time series: Verify timepoint field exists ### Step 3: Data Download Once validation confirms the dataset is suitable: ``` "Download GSE200997 and prepare it for analysis" ``` **Data Expert** will download expression data and create analysis-ready dataset. ### Step 4: Comparative Analysis ``` "Compare my results to the downloaded GEO dataset GSE200997" ``` ## Session Continuation and Workspace Management ### Overview Lobster AI v0.2+ includes powerful workspace management capabilities that allow you to save your analysis progress and seamlessly continue work across sessions. This is particularly useful for long-running analyses or when working with multiple datasets. ### Workspace Restoration Workflow #### Step 1: Check Current Workspace State Before starting any analysis session, check what data is currently loaded and what's available in your workspace: ```bash # Check currently loaded data /data # List available datasets in workspace /workspace list # Show comprehensive workspace information /workspace ``` **Natural Language Alternative**: ``` "What data do I have available in my workspace?" "Show me my current analysis session status" ``` #### Step 2: Restore Previous Session Use the `/restore` command to load datasets from previous sessions: ```bash # Restore most recent datasets (recommended for session continuation) /restore # Restore specific dataset by name /restore geo_gse123456_processed # Restore all datasets matching a pattern /restore geo_* # All GEO datasets /restore *single_cell* # All single-cell datasets /restore experiment_batch_2* # Specific experiment datasets # Restore all available datasets (use with caution for memory) /restore all ``` **Natural Language Alternative**: ``` "Continue my analysis from yesterday's session" "Load the GSE123456 dataset I was working on" "Restore all my single-cell datasets for comparison" ``` #### Step 3: Verify Restored Data After restoration, verify that your datasets are properly loaded: ```bash # Check loaded modalities /modalities # Get detailed data summary /data # List available plots from previous session /plots ``` ### Complete Session Continuation Example #### Scenario: Continuing Single-Cell Analysis ```bash # Day 1: Initial Analysis "Download and analyze GSE123456 single-cell data" # ... perform quality control, clustering, etc. /save # Save progress # Day 2: Continue Analysis /restore recent # System loads: geo_gse123456, geo_gse123456_filtered, geo_gse123456_clustered "Continue the differential expression analysis on the clustered data" # Agent automatically uses geo_gse123456_clustered for analysis ``` #### Scenario: Comparative Analysis Across Multiple Datasets ```bash # Load multiple related datasets for comparison /restore geo_gse123* # Loads multiple GSE datasets "Compare these datasets and identify common cell types" # Work with specific experiment batches /restore experiment_* "Perform batch correction across these experiment datasets" ``` #### Scenario: Project-Based Workflow ```bash # Organize by project patterns /restore liver_* # All liver-related datasets /restore *cancer_study* # All cancer study datasets /restore proteomics_* # All proteomics datasets "Integrate these liver datasets for multi-omics analysis" ``` ### Advanced Workspace Management #### Pattern Matching Best Practices | Use Case | Pattern | Example | |----------|---------|---------| | Continue recent work | `recent` | `/restore recent` | | Load specific dataset | `exact_name` | `/restore geo_gse123456_processed` | | Load by data type | `*type*` | `/restore *single_cell*` | | Load by experiment | `prefix*` | `/restore batch_2*` | | Load by source | `source_*` | `/restore geo_*` | #### Memory Management ```bash # Check memory usage before loading /modalities # See current memory usage # Load incrementally for large datasets /restore experiment_1* # Load first batch # Perform analysis /restore experiment_2* # Load second batch when needed ``` #### Data Organization Tips **Recommended Naming Conventions**: ``` geo_gse123456 # Raw GEO data geo_gse123456_filtered # After quality control geo_gse123456_clustered # After clustering geo_gse123456_annotated # With cell type annotations custom_liver_study_raw # Custom dataset custom_liver_study_processed # After processing ``` ### Integration with Analysis Workflows #### Single-Cell Workflow Continuation ```bash # Session 1: Initial processing "Download GSE123456 and perform quality control" /save # Session 2: Clustering analysis /restore recent "Perform clustering and find marker genes" /save # Session 3: Cell type annotation /restore recent "Annotate cell types based on marker genes" ``` #### Multi-Dataset Comparison Workflow ```bash # Load multiple datasets for comparison /restore geo_gse123456 geo_gse789012 custom_study "Compare these three datasets and identify batch effects" # Load by pattern for systematic comparison /restore *liver* "Perform integrated analysis of all liver datasets" ``` #### Cross-Session Plot Management ```bash # Restore data and plots from previous session /restore recent /plots # List available plots "Generate additional plots comparing the clustered results" # New plots are automatically saved to workspace ``` ### Natural Language Workspace Commands The data expert agent understands various natural language requests for workspace management: ``` "Load my recent datasets" "Continue my analysis from yesterday" "Load all the GEO datasets I downloaded" "Restore the liver study data for comparison" "What datasets do I have available?" "Load the processed single-cell data" "Continue working on the GSE123456 dataset" "Restore all my proteomics experiments" ``` ### Troubleshooting Workspace Issues #### Common Problems and Solutions **Dataset Not Found**: ```bash Problem: "Dataset 'my_dataset' not found" Solution: Check available datasets with /workspace list Verify spelling and use Tab completion ``` **Memory Issues**: ```bash Problem: System runs out of memory Solution: Use more specific patterns instead of /restore all Load datasets incrementally Check current usage with /modalities ``` **Outdated Workspace**: ```bash Problem: Restored data seems outdated Solution: Check workspace location with /workspace Verify you're in the correct project directory Use /workspace list to see available datasets ``` ### Best Practices for Session Management 1. **Regular Saves**: Use `/save` after major analysis steps 2. **Descriptive Names**: Use clear dataset names for easy pattern matching 3. **Incremental Loading**: Load datasets as needed to manage memory 4. **Verify Restoration**: Always check `/data` after restoration 5. **Organize by Project**: Use consistent naming patterns for related analyses 6. **Document Progress**: Keep track of analysis steps and parameters ## Advanced Workspace Management > **Version**: v0.2+ > **Prerequisites**: Basic workspace usage (see [Session Continuation and Workspace Management](#session-continuation-and-workspace-management)) While the basic workspace restoration features enable session continuation, advanced workspace management provides enterprise-grade capabilities for backup, migration, templating, analytics, cleanup, and multi-workspace orchestration. These features are critical for: - **Reproducibility**: Archive complete analysis environments - **Collaboration**: Share workspaces between team members - **Automation**: Template-based workflows for standardized pipelines - **Resource Management**: Monitor and optimize workspace storage - **Project Organization**: Manage multiple concurrent analyses ### 1. Workspace Backup and Restore #### Complete Workspace Backup Create a complete snapshot of your workspace including all datasets, provenance, and configurations. **Basic Backup:** ```bash # Backup current workspace to archive /workspace backup --name my_analysis_v1 --destination ./backups/ # With compression and metadata /workspace backup --name liver_study_final \ --destination ./backups/ \ --compress \ --include-metadata ``` **Natural Language Alternative:** ``` "Create a backup of my current workspace named liver_study_final" "Archive this workspace with all datasets and analysis history" ``` **What Gets Backed Up:** - ✅ All H5AD/MuData files in workspace - ✅ Provenance tracking history (W3C-PROV format) - ✅ Download queue state (JSONL) - ✅ Cached plots and visualizations - ✅ Workspace configuration and metadata - ✅ Analysis pipeline exports (Jupyter notebooks) - ❌ Large external files (can be optionally included) **Backup Structure:** ``` backups/ └── liver_study_final_20250116/ ├── workspace.tar.gz # Compressed workspace data ├── manifest.json # File inventory ├── provenance_graph.json # Complete W3C-PROV graph ├── metadata.json # Workspace info └── checksum.sha256 # Integrity verification ``` #### Incremental Backup For large workspaces, use incremental backups to save only changes since the last backup. ```bash # Initial full backup /workspace backup --name project_v1 --destination ./backups/ # Incremental backup (only changes) /workspace backup --name project_v2 \ --destination ./backups/ \ --incremental \ --base project_v1 ``` **Incremental Backup Benefits:** - 80-95% faster than full backups - 70-90% smaller backup size - Maintains complete restore capability - Delta compression using rsync-like algorithm #### Workspace Restore from Backup **Complete Restore:** ```bash # Restore from backup archive /workspace restore --source ./backups/liver_study_final_20250116/ # Restore to specific location /workspace restore --source ./backups/project_v2/ \ --destination ./new_workspace/ \ --verify-checksums ``` **Selective Restore:** ```bash # Restore only specific datasets /workspace restore --source ./backups/liver_study_final/ \ --datasets geo_gse123456,custom_liver_study # Restore datasets matching pattern /workspace restore --source ./backups/proteomics_study/ \ --pattern "*single_cell*" # Restore provenance only (for audit) /workspace restore --source ./backups/project_v1/ \ --provenance-only ``` **Verification After Restore:** ```bash # Verify backup integrity /workspace verify --source ./backups/liver_study_final/ # Compare restored workspace to original /workspace compare --workspace1 ./original/ \ --workspace2 ./restored/ ``` #### Automated Backup Strategies **Scheduled Backups:** ```python # In automation script or config from lobster.core.workspace_manager import WorkspaceBackupScheduler scheduler = WorkspaceBackupScheduler( workspace_path="./my_workspace", backup_dir="./backups", schedule="daily", # Options: hourly, daily, weekly retention_days=30, # Delete backups older than 30 days incremental=True, # Use incremental backups compress=True ) scheduler.start() ``` **Event-Triggered Backups:** ```python # Backup after major analysis steps from lobster.core.workspace_manager import WorkspaceManager wm = WorkspaceManager(workspace_path="./my_workspace") # Register backup trigger wm.register_backup_trigger( event="analysis_complete", backup_name_pattern="auto_{timestamp}", retention_count=10 # Keep last 10 backups ) ``` **Backup Best Practices:** | Scenario | Backup Frequency | Retention Period | Strategy | |----------|------------------|------------------|----------| | Active development | Hourly | 7 days | Incremental | | Production analysis | Daily | 30 days | Full + incremental | | Long-term archival | On completion | Indefinite | Full + compression | | Collaboration | Before handoff | Per project | Full + metadata | ### 2. Workspace Migration #### Local to Cloud Migration Migrate workspaces from local development to cloud infrastructure. **Migration Command:** ```bash # Migrate to S3-backed workspace /workspace migrate --source ./local_workspace/ \ --destination s3://my-bucket/workspaces/project_1/ \ --backend s3 \ --verify \ --dry-run # Test first # Execute migration /workspace migrate --source ./local_workspace/ \ --destination s3://my-bucket/workspaces/project_1/ \ --backend s3 \ --verify ``` **Natural Language Alternative:** ``` "Migrate my workspace to S3 storage for cloud analysis" "Move this workspace to cloud infrastructure" ``` **Migration Process:** 1. **Pre-migration Check**: Verify source workspace integrity 2. **Format Conversion**: Convert H5AD to cloud-optimized format if needed 3. **Data Transfer**: Upload with resumable transfers and checksums 4. **Provenance Migration**: Transfer W3C-PROV graph to cloud storage 5. **Configuration Update**: Update workspace config for cloud backend 6. **Verification**: Verify all data accessible in target location 7. **Cleanup** (optional): Remove local copies after verification #### Cross-Platform Migration Migrate between different operating systems or environments. **macOS → Linux Migration:** ```bash # Export workspace for Linux /workspace export --platform linux \ --destination ./linux_compatible_workspace.tar.gz # On Linux machine /workspace import --source ./linux_compatible_workspace.tar.gz \ --verify-platform ``` **Path Translation:** ```python # Automatic path translation during migration from lobster.core.workspace_migrator import WorkspaceMigrator migrator = WorkspaceMigrator() # Migrate with automatic path adjustment migrator.migrate( source_path="./workspace", target_path="/mnt/analysis/workspace", translate_paths=True, # Adjust absolute paths platform="linux", # Target platform preserve_symlinks=False # Convert symlinks to copies ) ``` #### Multi-User Environment Migration Migrate workspaces between users or teams with permission management. **Export for Sharing:** ```bash # Export with anonymization (remove personal paths) /workspace export --anonymize \ --include-data \ --format tar.gz \ --output shared_workspace.tar.gz # Export with access control metadata /workspace export --access-control \ --allowed-users user1,user2 \ --expiration-date 2025-12-31 ``` **Import with Permission Setup:** ```bash # Import to shared location /workspace import --source shared_workspace.tar.gz \ --destination /shared/workspaces/project_1/ \ --permissions group-rw \ --owner analysis_team ``` ### 3. Workspace Templates #### Creating Workspace Templates Templates enable standardized analysis pipelines and reproducible project structures. **Template Creation:** ```bash # Create template from existing workspace /workspace create-template --source ./my_workflow/ \ --name single_cell_qc_template \ --description "Standard single-cell QC pipeline" # Create template with parameterization /workspace create-template --source ./bulk_rnaseq_workflow/ \ --name bulk_rnaseq_template \ --parameters design_formula,contrast,fdr_threshold ``` **Template Structure:** ``` templates/ └── single_cell_qc_template/ ├── template.json # Template metadata ├── workspace_structure.yaml # Directory layout ├── analysis_pipeline.py # Analysis script template ├── config_schema.json # Configurable parameters └── example_config.yaml # Example configuration ``` **Template Definition (template.json):** ```json { "name": "single_cell_qc_template", "version": "1.0.0", "description": "Standard single-cell QC pipeline", "author": "Bioinformatics Team", "parameters": { "min_genes": { "type": "integer", "default": 200, "description": "Minimum genes per cell" }, "max_mito_pct": { "type": "float", "default": 20.0, "description": "Maximum mitochondrial percentage" }, "resolution": { "type": "float", "default": 0.5, "description": "Clustering resolution" } }, "expected_inputs": ["raw_counts.h5ad"], "expected_outputs": ["filtered.h5ad", "clustered.h5ad", "markers.csv"] } ``` #### Using Templates **Instantiate New Workspace from Template:** ```bash # Create workspace from template /workspace new --template single_cell_qc_template \ --name liver_study_2025 \ --parameters config.yaml # Create with inline parameters /workspace new --template bulk_rnaseq_template \ --name drug_treatment_study \ --param design_formula="~treatment+batch" \ --param contrast="treatment,drug,control" \ --param fdr_threshold=0.05 ``` **Configuration File (config.yaml):** ```yaml # Parameters for single_cell_qc_template min_genes: 250 max_mito_pct: 15.0 resolution: 0.4 tissue_type: "liver" organism: "human" ``` **Natural Language Template Usage:** ``` "Create a new workspace using the single-cell QC template for my liver study" "Set up a bulk RNA-seq analysis workspace using the standard template" ``` #### Template Library Management **List Available Templates:** ```bash # List all templates /workspace templates list # Search templates by tag /workspace templates search --tag single_cell /workspace templates search --tag proteomics ``` **Install Templates from Repository:** ```bash # Install from GitHub /workspace templates install \ --source https://github.com/omics-os/analysis-templates \ --name community_single_cell_v1 # Install from local file /workspace templates install --source ./custom_template.tar.gz ``` **Share Templates:** ```bash # Export template for sharing /workspace templates export \ --name my_custom_template \ --output ./my_template.tar.gz \ --include-examples # Publish to registry (future feature) /workspace templates publish \ --name my_custom_template \ --registry omics-os-registry \ --visibility public ``` ### 4. Workspace Analytics #### Workspace Health Monitoring Monitor workspace health, identify issues, and optimize performance. **Health Check:** ```bash # Comprehensive health check /workspace health-check # Detailed report with recommendations /workspace health-check --detailed --output health_report.json ``` **Health Check Report:** ``` === Workspace Health Report === Overall Status: 🟡 WARNING Workspace: /Users/tyo/analysis/liver_study Last Updated: 2025-01-16 14:30:00 📊 Storage Usage: Total Size: 15.2 GB Datasets: 12.8 GB (84%) Plots: 1.8 GB (12%) Provenance: 0.6 GB (4%) Warning: Approaching 80% of 20GB quota 📁 Dataset Health: Total Datasets: 24 ✅ Healthy: 22 (92%) ⚠️ Warnings: 2 (8%) - geo_gse123456_old: Not accessed in 60 days - temp_analysis: Missing provenance metadata 🔍 Provenance Integrity: ✅ Complete: 20 datasets ⚠️ Partial: 2 datasets ❌ Missing: 2 datasets 🚀 Performance Metrics: Average Load Time: 2.3s (Good) Cache Hit Rate: 76% (Good) Slow Queries: 3 identified 💡 Recommendations: 1. Archive or delete unused datasets (geo_gse123456_old) 2. Clean up temporary files (temp_analysis) 3. Run provenance repair on partial datasets 4. Consider upgrading to S3 backend for better performance ``` #### Storage Analytics **Storage Breakdown:** ```bash # Analyze storage usage by type /workspace storage-usage # Detailed analysis with visualization /workspace storage-usage --visualize --output storage_report.html ``` **Storage Usage Output:** ``` === Storage Usage Analysis === By Data Type: ┌─────────────────┬──────────┬────────┬─────────┐ │ Type │ Size │ Count │ % Total │ ├─────────────────┼──────────┼────────┼─────────┤ │ H5AD │ 10.5 GB │ 18 │ 69% │ │ MuData │ 2.3 GB │ 4 │ 15% │ │ Plots (HTML) │ 1.8 GB │ 156 │ 12% │ │ Provenance │ 0.6 GB │ 24 │ 4% │ └─────────────────┴──────────┴────────┴─────────┘ Top 10 Largest Datasets: 1. geo_gse200997_integrated (2.8 GB) 2. custom_liver_cohort_raw (1.9 GB) 3. geo_gse156793_processed (1.5 GB) ... Growth Trend (Last 30 Days): 📈 +3.2 GB total (+26% growth rate) Average: +107 MB/day Projection: At current growth, workspace will reach 80% quota in 42 days. ``` #### Dataset Usage Analytics **Access Patterns:** ```bash # Analyze dataset access patterns /workspace analytics access-patterns --days 30 # Identify unused datasets /workspace analytics find-unused --threshold-days 60 ``` **Access Pattern Report:** ``` === Dataset Access Patterns (Last 30 Days) === Most Accessed Datasets: 1. geo_gse123456_clustered (48 accesses, last: 1 hour ago) 2. custom_liver_study (32 accesses, last: 3 hours ago) 3. proteomics_batch_2 (21 accesses, last: 1 day ago) Least Accessed Datasets: 1. geo_gse987654_old (0 accesses, last: 87 days ago) ⚠️ 2. temp_analysis_v1 (0 accesses, last: 65 days ago) ⚠️ 3. exploratory_test (1 access, last: 45 days ago) 💡 Cleanup Candidates: - 3 datasets not accessed in >60 days (5.4 GB reclaimable) - 7 temporary datasets with "temp_" prefix (2.1 GB reclaimable) - Total potential savings: 7.5 GB (49% of current usage) ``` #### Provenance Analytics **Analyze Analysis Lineage:** ```bash # Visualize provenance graph /workspace analytics provenance-graph \ --dataset geo_gse123456_final \ --output lineage.html # Find dataset dependencies /workspace analytics dependencies \ --dataset geo_gse123456_final ``` **Dependency Graph Output:** ``` === Dataset Dependency Analysis === Dataset: geo_gse123456_final Direct Dependencies (3): ├─ geo_gse123456_clustered (parent) │ └─ geo_gse123456_filtered (parent) │ └─ geo_gse123456 (root) Processing Steps (5): 1. download_geo → geo_gse123456 2. assess_quality → geo_gse123456_qc 3. filter_normalize → geo_gse123456_filtered 4. cluster_leiden → geo_gse123456_clustered 5. annotate_cell_types → geo_gse123456_final Tools Used: (6 unique) - GEOService - QualityService - PreprocessingService - ClusteringService - AnnotationService - VisualizationService ``` ### 5. Cleanup Strategies #### Manual Cleanup **Identify Cleanup Candidates:** ```bash # Find datasets to clean up /workspace cleanup --dry-run \ --threshold-days 60 \ --min-size 500MB # Show what would be deleted /workspace cleanup --preview \ --unused-days 90 \ --temp-files ``` **Selective Cleanup:** ```bash # Delete specific datasets /workspace delete geo_gse987654_old temp_analysis_v1 # Delete by pattern /workspace delete "temp_*" # Delete old plots /workspace cleanup-plots --older-than 30d ``` **Safe Deletion with Backup:** ```bash # Archive before deletion /workspace delete geo_gse123456_old \ --archive ./archive/ \ --verify # Delete with confirmation /workspace delete "exploratory_*" \ --interactive # Prompt for each file ``` #### Automated Cleanup Policies **Define Cleanup Policy:** ```yaml # cleanup_policy.yaml policies: - name: delete_old_temp description: "Delete temporary files older than 7 days" conditions: pattern: "temp_*" age_days: 7 action: delete - name: archive_unused description: "Archive datasets unused for 60 days" conditions: unused_days: 60 min_size_mb: 100 action: archive destination: ./archive/ - name: compress_old_plots description: "Compress plots older than 30 days" conditions: type: plot age_days: 30 action: compress schedule: daily # Run daily at midnight retention: deleted_log: 90 # Keep deletion log for 90 days ``` **Apply Policy:** ```bash # Apply cleanup policy /workspace apply-policy cleanup_policy.yaml --dry-run /workspace apply-policy cleanup_policy.yaml # Run specific policy /workspace apply-policy cleanup_policy.yaml --policy delete_old_temp ``` #### Quota Management **Set Storage Quotas:** ```bash # Set workspace quota /workspace set-quota --size 20GB --warn-at 80% # Set quota by dataset type /workspace set-quota --type h5ad --size 15GB \ --type plots --size 3GB \ --type provenance --size 2GB ``` **Quota Enforcement:** ```python # Automatic quota enforcement from lobster.core.workspace_manager import WorkspaceManager wm = WorkspaceManager(workspace_path="./my_workspace") # Enable quota enforcement wm.set_quota( total_size_gb=20, warn_threshold_pct=80, block_threshold_pct=95, auto_cleanup=True, # Auto-delete old temp files cleanup_policy="cleanup_policy.yaml" ) # Quota will automatically trigger cleanup when 80% reached ``` ### 6. Multi-Workspace Workflows #### Managing Multiple Workspaces **Workspace Registry:** ```bash # List all workspaces /workspace list-all # Register new workspace /workspace register --path ./project_1/ --name liver_study /workspace register --path ./project_2/ --name cancer_analysis # Switch between workspaces /workspace switch liver_study /workspace switch cancer_analysis # Show active workspace /workspace current ``` **Workspace Registry Output:** ``` === Registered Workspaces === ┌───────────────────┬─────────────────────────┬──────────┬──────────┐ │ Name │ Path │ Size │ Status │ ├───────────────────┼─────────────────────────┼──────────┼──────────┤ │ liver_study ● │ ./project_1/ │ 15.2 GB │ Active │ │ cancer_analysis │ ./project_2/ │ 8.7 GB │ Inactive │ │ proteomics_cohort │ ./project_3/ │ 12.1 GB │ Inactive │ └───────────────────┴─────────────────────────┴──────────┴──────────┘ Total: 3 workspaces, 36.0 GB used ``` #### Cross-Workspace Data Sharing **Link Datasets Between Workspaces:** ```bash # Link dataset from another workspace (read-only) /workspace link --source liver_study:geo_gse123456 \ --target current \ --mode readonly # Copy dataset to current workspace /workspace copy --source cancer_analysis:processed_cohort \ --target current ``` **Natural Language Alternative:** ``` "Link the GSE123456 dataset from my liver_study workspace" "Copy the processed cohort data from cancer_analysis workspace" ``` #### Workspace Comparison **Compare Workspaces:** ```bash # Compare two workspaces /workspace compare liver_study cancer_analysis # Compare datasets /workspace compare-datasets \ --workspace1 liver_study:geo_gse123456_final \ --workspace2 cancer_analysis:geo_gse987654_final ``` **Comparison Report:** ``` === Workspace Comparison === Workspace 1: liver_study Workspace 2: cancer_analysis Datasets: Unique to liver_study: 12 Unique to cancer_analysis: 8 Shared (by name): 4 - geo_gse111111 - custom_controls - reference_atlas - quality_standards Storage: liver_study: 15.2 GB cancer_analysis: 8.7 GB Difference: +6.5 GB (75% larger) Analysis Pipelines: Common tools used: 8 Unique to liver_study: 3 (trajectory analysis, pseudobulk, enrichment) Unique to cancer_analysis: 2 (survival analysis, CNV detection) ``` #### Workspace Synchronization **Sync Workspaces Across Machines:** ```bash # Push workspace to remote /workspace sync --push \ --destination s3://backup/workspaces/liver_study/ # Pull workspace updates from remote /workspace sync --pull \ --source s3://backup/workspaces/liver_study/ \ --strategy merge # or 'overwrite' # Bidirectional sync /workspace sync --bidirectional \ --remote s3://backup/workspaces/liver_study/ ``` **Sync Strategies:** | Strategy | Description | Use Case | |----------|-------------|----------| | `merge` | Combine changes from both sides | Collaborative work | | `overwrite` | Replace local with remote | Reset to known state | | `mirror` | Exact copy (delete removed files) | Backup/disaster recovery | | `incremental` | Only transfer changes | Bandwidth optimization | #### Multi-Workspace Batch Operations **Batch Commands Across Workspaces:** ```bash # Run cleanup on all workspaces /workspace foreach --command cleanup --args "--dry-run --unused-days 60" # Backup all workspaces /workspace foreach --command backup --args "--destination ./backups/" # Health check all workspaces /workspace foreach --command health-check --output health_summary.json ``` **Aggregate Reporting:** ```bash # Generate report across all workspaces /workspace aggregate-report --output workspace_summary.html # Monitor all workspaces /workspace monitor --refresh-interval 60s # Live dashboard ``` ### Best Practices for Advanced Workspace Management #### Backup Strategy 1. **3-2-1 Rule**: 3 copies, 2 different media types, 1 offsite ```bash # Local backup /workspace backup --name daily_backup --destination ./local_backup/ # Remote backup (different medium) /workspace backup --name daily_backup --destination s3://backup/ # Archive important milestones (offsite) /workspace backup --name milestone_v1 --destination gs://archive/ ``` 2. **Incremental Backups for Active Projects**: Save time and space 3. **Full Backups for Milestones**: Before publication, major releases 4. **Automated Schedules**: Daily incrementals, weekly fulls #### Migration Planning 1. **Test Migrations**: Always use `--dry-run` first 2. **Verify Integrity**: Use checksums and validation 3. **Document Paths**: Record absolute paths for reproducibility 4. **Maintain Provenance**: Ensure provenance transfers correctly #### Template Design 1. **Parameterize Everything**: Max flexibility for reuse 2. **Include Examples**: Provide sample configurations 3. **Version Templates**: Track template evolution 4. **Document Assumptions**: Specify expected input formats #### Monitoring and Analytics 1. **Regular Health Checks**: Weekly for active projects 2. **Set Quotas Early**: Prevent runaway storage growth 3. **Track Access Patterns**: Identify unused data 4. **Review Provenance**: Ensure analysis lineage is complete #### Cleanup Guidelines 1. **Archive Before Delete**: Preserve data you might need later 2. **Use Policies**: Automated cleanup reduces manual work 3. **Interactive Mode**: For important deletions, use `--interactive` 4. **Log Deletions**: Maintain audit trail of cleaned data #### Multi-Workspace Organization 1. **Clear Naming**: Use descriptive workspace names 2. **Logical Separation**: One workspace per project or dataset 3. **Shared Standards**: Use templates for consistency 4. **Regular Sync**: Keep remote backups synchronized ## Workflow Best Practices ### General Principles 1. **Start with Data Quality**: Always assess data quality before analysis 2. **Iterative Approach**: Build analysis step-by-step 3. **Parameter Documentation**: Keep track of analysis parameters 4. **Validation**: Cross-validate results with multiple methods 5. **Visualization**: Generate plots at each major step ### Quality Control Guidelines 1. **Check Data Distribution**: Ensure appropriate data characteristics 2. **Assess Missing Values**: Handle missing data appropriately 3. **Batch Effect Detection**: Look for systematic biases 4. **Outlier Identification**: Handle outliers appropriately 5. **Normalization Validation**: Verify normalization effectiveness ### Statistical Considerations 1. **Multiple Testing Correction**: Always apply appropriate corrections 2. **Effect Size Reporting**: Report both significance and effect size 3. **Confidence Intervals**: Provide uncertainty estimates 4. **Sample Size Assessment**: Ensure adequate statistical power 5. **Assumption Validation**: Check statistical model assumptions ### Reproducibility Guidelines 1. **Parameter Recording**: Document all analysis parameters 2. **Version Control**: Track software and data versions 3. **Random Seeds**: Set seeds for reproducible results 4. **Session Export**: Save complete analysis sessions 5. **Method Documentation**: Record rationale for method choices ## Troubleshooting Common Issues ### Data Loading Problems **Issue**: File format not recognized ``` # Solution: Check file format and convert if necessary "Convert this Excel file to a format suitable for analysis" ``` **Issue**: Large file loading slowly ``` # Solution: Use streaming or chunked loading "Load this large dataset efficiently in chunks" ``` ### Analysis Issues **Issue**: Poor clustering results ``` # Solution: Adjust parameters or try different methods "The clusters look over-fragmented, can you try different resolution parameters?" ``` **Issue**: No significant results ``` # Solution: Check power and adjust thresholds "I'm not getting significant results, can you assess the statistical power and suggest improvements?" ``` ### Interpretation Challenges **Issue**: Unexpected biological results ``` # Solution: Literature validation and quality assessment "These results seem unexpected, can you check the literature and validate the analysis?" ``` **Issue**: Complex statistical output ``` # Solution: Request explanation and visualization "Can you explain these statistics in simpler terms and create visualizations?" ``` This comprehensive workflow guide covers the major analysis types supported by Lobster AI. Each workflow can be customized based on specific research questions and data characteristics.