# Protein Structure Visualization Expert Agent

**Since v0.2** - Protein structure analysis with PyMOL visualization and BioPython integration

**Agent Name**: `protein_structure_visualization_expert_agent`
**Display Name**: Protein Structure Visualization Expert
**Factory Function**: `lobster.agents.protein_structure_visualization_expert.protein_structure_visualization_expert`

## Overview

The Protein Structure Visualization Expert is a specialized agent for fetching, visualizing, and analyzing 3D protein structures from the RCSB Protein Data Bank (PDB). It integrates PyMOL (open-source) for high-quality molecular visualizations and BioPython for structural analysis, enabling seamless linking between protein structures and omics datasets.

**Version Note**: This agent requires Lobster v0.2+ and is fully supported in both local and cloud modes (with limited interactive visualization in cloud).

### Key Features

- **PDB Structure Fetching**: Download protein structures by PDB ID with comprehensive metadata
- **PyMOL Integration**: Generate professional 3D visualizations with customizable styles and colors
- **Structural Analysis**: Calculate RMSD, secondary structure, geometry, and residue contacts
- **Omics Integration**: Link protein structures to gene expression and proteomics data
- **Structure Comparison**: Compare multiple protein structures and calculate structural similarity
- **Provenance Tracking**: Full W3C-PROV compliant logging with Intermediate Representation (IR)

---

## Architecture

### Services (Stateless, 3-Tuple Pattern)

#### 1. ProteinStructureFetchService

**Location**: `lobster/tools/protein_structure_fetch_service.py`

Handles fetching protein structures from RCSB PDB with caching and metadata extraction.

**Methods**:
- `fetch_structure(pdb_id, format='cif', cache_dir, extract_metadata)` → Tuple[Dict, Dict, AnalysisStep]
- `link_structures_to_genes(adata, gene_column, organism, max_structures_per_gene)` → Tuple[AnnData, Dict, AnalysisStep]

**Features**:
- PDB ID format validation (4-character alphanumeric)
- Automatic caching to avoid redundant downloads
- BioPython-based structure parsing
- Metadata extraction (resolution, organism, experiment method)
- Gene-to-structure mapping via PDB search API

#### 2. PyMOLVisualizationService

**Location**: `lobster/tools/pymol_visualization_service.py`

Creates high-quality 3D visualizations using PyMOL (open-source).

**Methods**:
- `visualize_structure(structure_file, mode, style, color_by, output_image, width, height, execute_commands)` → Tuple[Dict, Dict, AnalysisStep]
- `check_pymol_installation()` → Dict[str, Any]

**Features**:
- Multiple representation styles: cartoon, surface, sticks, spheres, ribbon, lines
- Multiple coloring schemes: chain, secondary_structure, bfactor, element
- Interactive and batch modes (GUI or headless image generation)
- PyMOL command script generation (`.pml` files)
- Automatic PyMOL installation detection
- Graceful fallback when PyMOL is not installed
- High-resolution image export (customizable dimensions)
- Non-blocking GUI launch for interactive exploration

#### 3. StructureAnalysisService

**Location**: `lobster/tools/structure_analysis_service.py`

Performs structural analysis using BioPython.

**Methods**:
- `analyze_structure(structure_file, analysis_type, chain_id)` → Tuple[Dict, Dict, AnalysisStep]
- `calculate_rmsd(structure_file1, structure_file2, chain_id1, chain_id2, align)` → Tuple[Dict, Dict, AnalysisStep]

**Features**:
- Secondary structure analysis (DSSP integration with fallback)
- Geometric properties (center of mass, radius of gyration)
- Residue contact analysis (spatial proximity)
- RMSD calculation with optional superposition alignment
- BioPython Superimposer for structural alignment

---

## Agent Tools

### 1. fetch_protein_structure

**Purpose**: Download protein structure from RCSB PDB

**Parameters**:
- `pdb_id` (str, required): PDB identifier (e.g., '1AKE', '4HHB')
- `format` (str, default='cif'): File format ('pdb' or 'cif')

**Returns**: Summary with metadata, file paths, and structural properties

**Example**:
```python
fetch_protein_structure("1AKE")
fetch_protein_structure("4HHB", format="pdb")
```

**Output Includes**:
- PDB ID, title, organism
- Experiment method and resolution
- Number of chains, residues, atoms
- File path and size
- Publication DOI and citation

---

### 2. link_to_expression_data

**Purpose**: Link gene expression data to protein structures

**Parameters**:
- `modality_name` (str, required): Name of modality with gene/protein data
- `gene_column` (str, default='gene_symbol'): Column in adata.var with gene symbols
- `organism` (str, default='Homo sapiens'): Source organism for structure search
- `max_structures_per_gene` (int, default=5): Maximum structures per gene

**Returns**: Summary of structure links created

**Example**:
```python
link_to_expression_data("rna_seq_normalized")
link_to_expression_data("proteomics_data", gene_column="protein_name", organism="Mus musculus")
```

**Output Includes**:
- Genes searched and genes with structures found
- Total structures found and average per gene
- New modality name with structure links
- Columns added: `pdb_structures` (comma-separated PDB IDs), `has_structure` (boolean)

---

### 3. visualize_with_pymol

**Purpose**: Create high-quality 3D visualization using PyMOL

**Parameters**:
- `pdb_id` (str, required): PDB ID of structure (must be fetched first)
- `mode` (str, default='interactive'): Execution mode
  - Options: 'interactive' (launch GUI for exploration), 'batch' (save PNG and exit)
- `style` (str, default='cartoon'): Representation style
  - Options: 'cartoon', 'surface', 'sticks', 'spheres', 'ribbon', 'lines'
- `color_by` (str, default='chain'): Coloring scheme
  - Options: 'chain', 'secondary_structure', 'bfactor', 'element'
- `width` (int, default=1920): Image width in pixels
- `height` (int, default=1080): Image height in pixels
- `execute` (bool, default=True): Execute PyMOL commands if installed
- `highlight_residues` (str, optional): Residues to highlight (e.g., "15,42,89" or "A:15-20,B:42")
- `highlight_color` (str, default='red'): Color for highlighted residues
- `highlight_style` (str, default='sticks'): Visualization style for highlights
- `highlight_groups` (str, optional): Multiple highlight groups (format: "residues|color|style;...")

**Returns**: Visualization metadata with file paths and execution status

**Examples**:
```python
# Basic visualization
visualize_with_pymol("1AKE")  # Interactive mode by default
visualize_with_pymol("4HHB", mode="batch", style="surface", color_by="bfactor")
visualize_with_pymol("1AKE", mode="interactive")  # Launch GUI for exploration

# Residue highlighting - Single group
visualize_with_pymol("1AKE", highlight_residues="15,42,89", highlight_color="red", highlight_style="sticks")

# Residue highlighting - Chain-specific
visualize_with_pymol("4HHB", highlight_residues="A:15-20,B:30-35", highlight_color="yellow")

# Residue highlighting - Multiple groups
visualize_with_pymol("1AKE", highlight_groups="15,42|red|sticks;100-120|blue|surface;200,215|green|spheres")
```

**Output Includes**:
- Visualization settings (mode, style, color scheme, dimensions)
- Command script path (`.pml` file)
- Output image path (`.png` file)
- Execution status and PyMOL installation info
- Process ID (PID) for interactive mode

---

### 4. analyze_protein_structure

**Purpose**: Analyze protein structure properties

**Parameters**:
- `pdb_id` (str, required): PDB ID of structure (must be fetched first)
- `analysis_type` (str, default='secondary_structure'): Type of analysis
  - Options: 'secondary_structure', 'geometry', 'residue_contacts'
- `chain_id` (str, optional): Specific chain to analyze (None for all chains)

**Returns**: Analysis results with structural properties

**Example**:
```python
analyze_protein_structure("1AKE")
analyze_protein_structure("4HHB", analysis_type="geometry")
analyze_protein_structure("1AKE", analysis_type="residue_contacts", chain_id="A")
```

**Analysis Types**:

#### Secondary Structure
- Helix, sheet, coil percentages
- Per-residue secondary structure assignments
- Requires DSSP binary (with fallback)

#### Geometry
- Total atoms and chains
- Center of mass
- Radius of gyration
- Per-chain geometric properties

#### Residue Contacts
- Total residue-residue contacts (default cutoff: 8 Å)
- Average contacts per residue
- Contact distance matrix

---

### 5. compare_structures

**Purpose**: Compare two protein structures by RMSD

**Parameters**:
- `pdb_id1` (str, required): First PDB ID (must be fetched)
- `pdb_id2` (str, required): Second PDB ID (must be fetched)
- `align` (bool, default=True): Align structures before RMSD calculation
- `chain_id1` (str, optional): Specific chain in first structure
- `chain_id2` (str, optional): Specific chain in second structure

**Returns**: RMSD and structural comparison results

**Example**:
```python
compare_structures("1AKE", "4AKE")
compare_structures("1AKE", "4AKE", align=False)
compare_structures("4HHB", "2HHB", chain_id1="A", chain_id2="A")
```

**RMSD Interpretation**:
- **< 1.0 Å**: Nearly identical structures
- **1-2 Å**: Very similar (close homologs, small conformational changes)
- **2-3 Å**: Similar (homologs, moderate conformational changes)
- **3-5 Å**: Moderately similar (distant homologs, domain movements)
- **> 5 Å**: Different structures (large conformational changes)

---

## Workflows

### Basic Workflow: Fetch and Visualize

```plaintext
1. User: "Visualize protein structure 1AKE"
2. Supervisor → Protein Structure Visualization Expert
3. Agent: fetch_protein_structure("1AKE")
4. Agent: visualize_with_pymol("1AKE", mode="interactive", style="cartoon")
5. Agent → Supervisor: Results with visualization paths
6. Supervisor → User: Visualization complete
```

### Advanced Workflow: Link Structures to Expression Data

```plaintext
1. User: "Link structures to my RNA-seq data"
2. Supervisor → Protein Structure Visualization Expert
3. Agent: link_to_expression_data("rna_seq_normalized", organism="Homo sapiens")
4. Agent creates new modality with structure mappings
5. Agent → Supervisor: Linking results (e.g., "50 genes linked to 75 structures")
6. Supervisor → User: Structure links created
```

### Comparative Workflow: RMSD Analysis

```plaintext
1. User: "Compare structures 1AKE and 4AKE"
2. Supervisor → Protein Structure Visualization Expert
3. Agent: fetch_protein_structure("1AKE")
4. Agent: fetch_protein_structure("4AKE")
5. Agent: compare_structures("1AKE", "4AKE", align=True)
6. Agent → Supervisor: RMSD results (e.g., "RMSD = 1.2 Å, very similar")
7. Supervisor → User: Comparison complete
```

---

## PyMOL Installation

### Why PyMOL?

PyMOL is a professional open-source molecular visualization tool that provides:
- High-quality molecular graphics
- Publication-ready images
- Comprehensive visualization commands
- Python API for automation
- Interactive GUI mode for real-time exploration
- Active open-source community

### Automated Installation (Recommended)

#### Docker Container (Cloud Deployments)

PyMOL is **pre-installed** in the Lobster Docker image. No action required.

**Verify installation**:
```bash
docker run -it omicsos/lobster:latest pymol -c -Q
```

#### Local Development (macOS/Linux)

Install PyMOL via Makefile target:

```bash
# Install PyMOL automatically
make install-pymol
```

This command will:
- Detect your operating system (macOS or Linux)
- Install PyMOL via the appropriate package manager
- Verify the installation

**What it does**:
- **macOS**: Uses Homebrew with brewsci/bio tap
- **Linux**: Uses apt-get (Ubuntu/Debian) or dnf (Fedora/RHEL)
- **Homebrew on Linux**: Fallback if native package manager unavailable

**Requirements**:
- macOS: Homebrew must be installed
- Linux: sudo access for package installation

**Installation output**:
```bash
$ make install-pymol
🔬 Installing PyMOL for protein structure visualization...
🍎 macOS detected - Installing via Homebrew...
📦 Installing PyMOL...
✅ PyMOL installed successfully!

🎉 PyMOL installation complete!
💡 Test with: pymol -c -Q
```

### Manual Installation (Fallback)

If automated installation is not available or fails, you can install PyMOL manually.

#### macOS
```bash
# Install via Homebrew (recommended)
brew install brewsci/bio/pymol

# Or download from official website
# https://pymol.org/

# After installation, PyMOL is automatically added to PATH
```

#### Linux
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install pymol

# Or via Homebrew on Linux
brew install brewsci/bio/pymol

# Arch Linux
sudo pacman -S pymol

# Fedora/CentOS/RHEL
sudo dnf install pymol
```

#### Windows
```plaintext
1. Download installer from https://pymol.org/
2. Run installer and follow instructions
3. PyMOL executable will be added to Start Menu
4. Optionally add to PATH via System Environment Variables
```

### Manual Execution Without Installation

Even without PyMOL installed, the agent generates `.pml` command scripts that can be:
- Executed manually when PyMOL is installed
- Modified for custom visualizations
- Used as templates for batch processing

**Examples**:
```bash
# Interactive mode (with GUI)
pymol 1AKE_cartoon_chain_commands.pml

# Batch mode (headless, save image and exit)
pymol -c 1AKE_cartoon_chain_commands.pml
```

---

## Integration with Omics Workflows

### Single-Cell RNA-seq Integration

Link protein structures to highly expressed genes:

```plaintext
1. Run single-cell analysis (clustering, DE analysis)
2. Identify top expressed genes
3. Use link_to_expression_data() to find structures
4. Visualize structures for key marker genes
```

### Proteomics Integration

Link structures to identified proteins:

```plaintext
1. Run proteomics analysis (quantification, DE)
2. Identify significantly changing proteins
3. Use link_to_expression_data() with protein_name column
4. Compare structures of protein variants
```

### Multi-Omics Integration

Cross-reference structures across modalities:

```plaintext
1. Link structures to both RNA-seq and proteomics
2. Identify genes/proteins with structures in both datasets
3. Visualize structures colored by expression levels
4. Compare structural features with functional changes
```

---

## Performance and Caching

### Structure Caching

- **First fetch**: Downloads from PDB, stores in `protein_structures/` directory
- **Subsequent fetches**: Uses cached file (instant)
- **Cache location**: Workspace directory or current directory
- **Cache benefits**: Avoids redundant downloads, faster workflow iterations

### PDB Provider Rate Limits

- **Rate limit**: 5 requests/second (RCSB PDB API limit)
- **No authentication**: Public PDB API requires no API key
- **Batch operations**: Use link_to_expression_data() for efficient batch queries

### PyMOL Performance

- **Command scripts**: Generated instantly (no execution delay)
- **Interactive mode**: GUI launches in 2-5 seconds (non-blocking)
- **Batch mode (image generation)**: 5-30 seconds per structure (if PyMOL is installed)
- **Headless mode**: PyMOL runs without GUI for automation (use `pymol -c`)
- **Parallel execution**: Multiple structures can be visualized in parallel

---

## Error Handling

### Common Errors and Solutions

#### 1. Invalid PDB ID
**Error**: `Invalid PDB ID format: XYZ. Must be 4 alphanumeric characters.`

**Solution**: Ensure PDB ID is exactly 4 characters (e.g., '1AKE', not '1AK' or '1AKEE')

#### 2. Structure Not Found
**Error**: `Failed to download structure 1XYZ from PDB`

**Solution**: Verify PDB ID exists at https://www.rcsb.org/structure/1XYZ

#### 3. PyMOL Not Installed
**Error**: `PyMOL not found. Install with: brew install brewsci/bio/pymol`

**Solution**: Install PyMOL or use generated command scripts manually

#### 4. Gene Column Not Found
**Error**: `Gene column 'gene_symbol' not found in adata.var`

**Solution**: Check available columns with `adata.var.columns` and specify correct column name

#### 5. DSSP Not Available
**Warning**: `DSSP not available. Using simplified analysis.`

**Solution**: Install DSSP for secondary structure analysis:
```bash
conda install -c salilab dssp
```

---

## API Reference

### ProteinStructureFetchService

```python
from lobster.tools.protein_structure_fetch_service import ProteinStructureFetchService

service = ProteinStructureFetchService()

# Fetch structure
structure_data, stats, ir = service.fetch_structure(
    pdb_id="1AKE",
    format="cif",
    cache_dir=Path("protein_structures"),
    extract_metadata=True,
    data_manager=data_manager
)

# Link structures to genes
adata_linked, stats, ir = service.link_structures_to_genes(
    adata=adata,
    gene_column="gene_symbol",
    organism="Homo sapiens",
    max_structures_per_gene=5,
    data_manager=data_manager
)
```

### PyMOLVisualizationService

```python
from lobster.tools.pymol_visualization_service import PyMOLVisualizationService

service = PyMOLVisualizationService()

# Check installation
install_status = service.check_pymol_installation()

# Create visualization (batch mode - save PNG)
viz_data, stats, ir = service.visualize_structure(
    structure_file=Path("1AKE.cif"),
    mode="batch",
    style="cartoon",
    color_by="chain",
    output_image=Path("output.png"),
    width=1920,
    height=1080,
    execute_commands=True
)

# Or interactive mode (launch GUI)
viz_data, stats, ir = service.visualize_structure(
    structure_file=Path("1AKE.cif"),
    mode="interactive",
    style="cartoon",
    color_by="chain",
    execute_commands=True
)
```

### StructureAnalysisService

```python
from lobster.tools.structure_analysis_service import StructureAnalysisService

service = StructureAnalysisService()

# Analyze structure
analysis_results, stats, ir = service.analyze_structure(
    structure_file=Path("1AKE.cif"),
    analysis_type="secondary_structure",
    chain_id="A"
)

# Calculate RMSD
rmsd_results, stats, ir = service.calculate_rmsd(
    structure_file1=Path("1AKE.cif"),
    structure_file2=Path("4AKE.cif"),
    align=True
)
```

---

## Best Practices

### 1. PDB ID Validation
Always use uppercase 4-character PDB IDs:
```python
# Good
fetch_protein_structure("1AKE")

# Bad
fetch_protein_structure("1ake")  # Works but not consistent
fetch_protein_structure("1AK")   # Error: too short
```

### 2. Structure Caching
Leverage caching for iterative workflows:
```python
# First run: downloads structure
fetch_protein_structure("1AKE")

# Subsequent runs: uses cache (instant)
visualize_with_pymol("1AKE", mode="interactive", style="cartoon")
visualize_with_pymol("1AKE", mode="batch", style="surface")  # No re-download
```

### 3. PyMOL Fallback
Generate scripts even without PyMOL:
```python
# Script generation always works
visualize_with_pymol("1AKE", execute=False)

# Execute manually later when PyMOL is installed
# Interactive mode: pymol 1AKE_commands.pml
# Batch mode: pymol -c 1AKE_commands.pml
```

### 4. Gene-Structure Linking
Search by organism for better results:
```python
# Specific organism
link_to_expression_data("adata", organism="Homo sapiens")

# Mouse data
link_to_expression_data("adata", organism="Mus musculus")
```

### 5. RMSD Interpretation
Use alignment for meaningful comparisons:
```python
# With alignment (recommended)
compare_structures("1AKE", "4AKE", align=True)

# Without alignment (only for pre-aligned structures)
compare_structures("1AKE", "4AKE", align=False)
```

---

## Provenance and Reproducibility

All structure operations generate Intermediate Representation (IR) with:
- **Operation**: Specific operation performed (e.g., 'pdb.fetch_structure')
- **Parameters**: All parameters used (pdb_id, format, style, etc.)
- **Code Template**: Jinja2 template for notebook export
- **Imports**: Required Python imports
- **Parameter Schema**: Papermill-injectable parameters with validation

**Notebook Export**:
```python
# Export pipeline to Jupyter notebook
data_manager.export_notebook("protein_structure_pipeline.ipynb")

# Execute notebook with different PDB ID
papermill protein_structure_pipeline.ipynb output.ipynb -p pdb_id "4HHB"
```

---

## Troubleshooting

### Issue: "Structure file not found"
- Ensure structure was fetched first with `fetch_protein_structure()`
- Check cache directory permissions
- Verify file path in structure_data dictionary

### Issue: "PyMOL execution timed out"
- Large structures may take longer to render (batch mode)
- Increase timeout in service configuration
- Use `execute=False` to generate script without execution
- For interactive mode, the GUI may take 2-5 seconds to launch

### Issue: "No structures found for genes"
- Check organism name (use Latin names: "Homo sapiens", "Mus musculus")
- Verify gene symbols are standard (HGNC for human, MGI for mouse)
- Try reducing `max_structures_per_gene` for faster queries

### Issue: "RMSD calculation failed"
- Ensure both structures have been fetched
- Check chain IDs exist in structures
- Verify structures have matching residues (homologs, not random proteins)

---

## Examples

### Example 1: Basic Structure Visualization

```python
# Fetch and visualize adenylate kinase
fetch_protein_structure("1AKE")
visualize_with_pymol("1AKE", mode="interactive", style="cartoon", color_by="secondary_structure")
```

### Example 2: Comparative Analysis

```python
# Compare open and closed conformations of adenylate kinase
fetch_protein_structure("1AKE")  # Open form
fetch_protein_structure("4AKE")  # Closed form
compare_structures("1AKE", "4AKE", align=True)
# Output: RMSD = 1.2 Å (moderate conformational change)
```

### Example 3: RNA-seq Integration

```python
# After RNA-seq analysis
link_to_expression_data("rna_seq_normalized", organism="Homo sapiens")

# Visualize top expressed genes with structures
# Filter: adata[adata.var['has_structure']]
```

### Example 4: Protein Family Analysis

```python
# Fetch multiple family members
fetch_protein_structure("1AKE")
fetch_protein_structure("2AKE")
fetch_protein_structure("3AKE")

# Pairwise RMSD comparisons
compare_structures("1AKE", "2AKE")
compare_structures("1AKE", "3AKE")
compare_structures("2AKE", "3AKE")
```

### Example 5: Residue Highlighting for Disease Mutations and Functional Sites

```python
# Fetch structure
fetch_protein_structure("1AKE")

# Example 1: Highlight disease mutation sites in red
# Single residue group - useful for showing known pathogenic variants
visualize_with_pymol(
    "1AKE",
    mode="batch",
    style="cartoon",
    color_by="chain",
    highlight_residues="15,42,89",
    highlight_color="red",
    highlight_style="sticks"
)

# Example 2: Chain-specific highlighting for protein-protein interfaces
# Highlight interface residues in hemoglobin subunits
fetch_protein_structure("4HHB")
visualize_with_pymol(
    "4HHB",
    highlight_residues="A:15-20,A:42,B:30-35,B:50",
    highlight_color="yellow",
    highlight_style="sticks"
)

# Example 3: Multiple highlight groups for complex functional annotation
# Show binding site (red), catalytic residues (blue), and allosteric site (green)
visualize_with_pymol(
    "1AKE",
    mode="interactive",  # Launch GUI for interactive exploration
    highlight_groups="15,42,89|red|sticks;100-120|blue|surface;200,215,230|green|spheres"
)

# Example 4: Combining with different color schemes
# Highlight active site residues while showing B-factors for the rest
visualize_with_pymol(
    "1AKE",
    style="cartoon",
    color_by="bfactor",  # Color by temperature factors
    highlight_residues="100-120",  # Active site region
    highlight_color="red",
    highlight_style="sticks"
)
```

**Use Cases for Residue Highlighting**:
- **Disease Mutations**: Highlight known pathogenic variants from ClinVar or GWAS studies
- **Binding Sites**: Show ligand or substrate binding pockets
- **Active Sites**: Emphasize catalytic residues (e.g., catalytic triad in proteases)
- **Post-Translational Modifications**: Highlight phosphorylation, methylation, or acetylation sites
- **Protein-Protein Interfaces**: Show interaction residues in multi-chain complexes
- **Conservation Analysis**: Highlight evolutionarily conserved residues

---

## Related Documentation

- [Agent System Overview](19-agent-system.md)
- [Creating Agents](09-creating-agents.md)
- [Creating Services](10-creating-services.md)
- [Data Formats](07-data-formats.md)
- [Testing Guide](12-testing-guide.md)

---

## References

- **RCSB PDB**: https://www.rcsb.org
- **PDB REST API**: https://data.rcsb.org/redoc/index.html
- **PyMOL**: https://pymol.org/
- **PyMOL Wiki**: https://pymolwiki.org/
- **BioPython**: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ
- **DSSP**: https://swift.cmbi.umcn.nl/gv/dssp/

---

**Last Updated**: 2025-01-15
**Version**: 1.0.0
**Maintainer**: Lobster Development Team