-
Notifications
You must be signed in to change notification settings - Fork 3
40 protein structure visualization
Since v0.2 - Protein structure analysis with PyMOL visualization and BioPython integration
Agent Name: protein_structure_visualization_expert_agent
Display Name: Protein Structure Visualization Expert
Factory Function: lobster.agents.protein_structure_visualization_expert.protein_structure_visualization_expert
The Protein Structure Visualization Expert is a specialized agent for fetching, visualizing, and analyzing 3D protein structures from the RCSB Protein Data Bank (PDB). It integrates PyMOL (open-source) for high-quality molecular visualizations and BioPython for structural analysis, enabling seamless linking between protein structures and omics datasets.
Version Note: This agent requires Lobster v0.2+ and is fully supported in both local and cloud modes (with limited interactive visualization in cloud).
- PDB Structure Fetching: Download protein structures by PDB ID with comprehensive metadata
- PyMOL Integration: Generate professional 3D visualizations with customizable styles and colors
- Structural Analysis: Calculate RMSD, secondary structure, geometry, and residue contacts
- Omics Integration: Link protein structures to gene expression and proteomics data
- Structure Comparison: Compare multiple protein structures and calculate structural similarity
- Provenance Tracking: Full W3C-PROV compliant logging with Intermediate Representation (IR)
Location: lobster/tools/protein_structure_fetch_service.py
Handles fetching protein structures from RCSB PDB with caching and metadata extraction.
Methods:
-
fetch_structure(pdb_id, format='cif', cache_dir, extract_metadata)→ Tuple[Dict, Dict, AnalysisStep] -
link_structures_to_genes(adata, gene_column, organism, max_structures_per_gene)→ Tuple[AnnData, Dict, AnalysisStep]
Features:
- PDB ID format validation (4-character alphanumeric)
- Automatic caching to avoid redundant downloads
- BioPython-based structure parsing
- Metadata extraction (resolution, organism, experiment method)
- Gene-to-structure mapping via PDB search API
Location: lobster/tools/pymol_visualization_service.py
Creates high-quality 3D visualizations using PyMOL (open-source).
Methods:
-
visualize_structure(structure_file, mode, style, color_by, output_image, width, height, execute_commands)→ Tuple[Dict, Dict, AnalysisStep] -
check_pymol_installation()→ Dict[str, Any]
Features:
- Multiple representation styles: cartoon, surface, sticks, spheres, ribbon, lines
- Multiple coloring schemes: chain, secondary_structure, bfactor, element
- Interactive and batch modes (GUI or headless image generation)
- PyMOL command script generation (
.pmlfiles) - Automatic PyMOL installation detection
- Graceful fallback when PyMOL is not installed
- High-resolution image export (customizable dimensions)
- Non-blocking GUI launch for interactive exploration
Location: lobster/tools/structure_analysis_service.py
Performs structural analysis using BioPython.
Methods:
-
analyze_structure(structure_file, analysis_type, chain_id)→ Tuple[Dict, Dict, AnalysisStep] -
calculate_rmsd(structure_file1, structure_file2, chain_id1, chain_id2, align)→ Tuple[Dict, Dict, AnalysisStep]
Features:
- Secondary structure analysis (DSSP integration with fallback)
- Geometric properties (center of mass, radius of gyration)
- Residue contact analysis (spatial proximity)
- RMSD calculation with optional superposition alignment
- BioPython Superimposer for structural alignment
Purpose: Download protein structure from RCSB PDB
Parameters:
-
pdb_id(str, required): PDB identifier (e.g., '1AKE', '4HHB') -
format(str, default='cif'): File format ('pdb' or 'cif')
Returns: Summary with metadata, file paths, and structural properties
Example:
fetch_protein_structure("1AKE")
fetch_protein_structure("4HHB", format="pdb")Output Includes:
- PDB ID, title, organism
- Experiment method and resolution
- Number of chains, residues, atoms
- File path and size
- Publication DOI and citation
Purpose: Link gene expression data to protein structures
Parameters:
-
modality_name(str, required): Name of modality with gene/protein data -
gene_column(str, default='gene_symbol'): Column in adata.var with gene symbols -
organism(str, default='Homo sapiens'): Source organism for structure search -
max_structures_per_gene(int, default=5): Maximum structures per gene
Returns: Summary of structure links created
Example:
link_to_expression_data("rna_seq_normalized")
link_to_expression_data("proteomics_data", gene_column="protein_name", organism="Mus musculus")Output Includes:
- Genes searched and genes with structures found
- Total structures found and average per gene
- New modality name with structure links
- Columns added:
pdb_structures(comma-separated PDB IDs),has_structure(boolean)
Purpose: Create high-quality 3D visualization using PyMOL
Parameters:
-
pdb_id(str, required): PDB ID of structure (must be fetched first) -
mode(str, default='interactive'): Execution mode- Options: 'interactive' (launch GUI for exploration), 'batch' (save PNG and exit)
-
style(str, default='cartoon'): Representation style- Options: 'cartoon', 'surface', 'sticks', 'spheres', 'ribbon', 'lines'
-
color_by(str, default='chain'): Coloring scheme- Options: 'chain', 'secondary_structure', 'bfactor', 'element'
-
width(int, default=1920): Image width in pixels -
height(int, default=1080): Image height in pixels -
execute(bool, default=True): Execute PyMOL commands if installed -
highlight_residues(str, optional): Residues to highlight (e.g., "15,42,89" or "A:15-20,B:42") -
highlight_color(str, default='red'): Color for highlighted residues -
highlight_style(str, default='sticks'): Visualization style for highlights -
highlight_groups(str, optional): Multiple highlight groups (format: "residues|color|style;...")
Returns: Visualization metadata with file paths and execution status
Examples:
# Basic visualization
visualize_with_pymol("1AKE") # Interactive mode by default
visualize_with_pymol("4HHB", mode="batch", style="surface", color_by="bfactor")
visualize_with_pymol("1AKE", mode="interactive") # Launch GUI for exploration
# Residue highlighting - Single group
visualize_with_pymol("1AKE", highlight_residues="15,42,89", highlight_color="red", highlight_style="sticks")
# Residue highlighting - Chain-specific
visualize_with_pymol("4HHB", highlight_residues="A:15-20,B:30-35", highlight_color="yellow")
# Residue highlighting - Multiple groups
visualize_with_pymol("1AKE", highlight_groups="15,42|red|sticks;100-120|blue|surface;200,215|green|spheres")Output Includes:
- Visualization settings (mode, style, color scheme, dimensions)
- Command script path (
.pmlfile) - Output image path (
.pngfile) - Execution status and PyMOL installation info
- Process ID (PID) for interactive mode
Purpose: Analyze protein structure properties
Parameters:
-
pdb_id(str, required): PDB ID of structure (must be fetched first) -
analysis_type(str, default='secondary_structure'): Type of analysis- Options: 'secondary_structure', 'geometry', 'residue_contacts'
-
chain_id(str, optional): Specific chain to analyze (None for all chains)
Returns: Analysis results with structural properties
Example:
analyze_protein_structure("1AKE")
analyze_protein_structure("4HHB", analysis_type="geometry")
analyze_protein_structure("1AKE", analysis_type="residue_contacts", chain_id="A")Analysis Types:
- Helix, sheet, coil percentages
- Per-residue secondary structure assignments
- Requires DSSP binary (with fallback)
- Total atoms and chains
- Center of mass
- Radius of gyration
- Per-chain geometric properties
- Total residue-residue contacts (default cutoff: 8 Å)
- Average contacts per residue
- Contact distance matrix
Purpose: Compare two protein structures by RMSD
Parameters:
-
pdb_id1(str, required): First PDB ID (must be fetched) -
pdb_id2(str, required): Second PDB ID (must be fetched) -
align(bool, default=True): Align structures before RMSD calculation -
chain_id1(str, optional): Specific chain in first structure -
chain_id2(str, optional): Specific chain in second structure
Returns: RMSD and structural comparison results
Example:
compare_structures("1AKE", "4AKE")
compare_structures("1AKE", "4AKE", align=False)
compare_structures("4HHB", "2HHB", chain_id1="A", chain_id2="A")RMSD Interpretation:
- < 1.0 Å: Nearly identical structures
- 1-2 Å: Very similar (close homologs, small conformational changes)
- 2-3 Å: Similar (homologs, moderate conformational changes)
- 3-5 Å: Moderately similar (distant homologs, domain movements)
- > 5 Å: Different structures (large conformational changes)
1. User: "Visualize protein structure 1AKE"
2. Supervisor → Protein Structure Visualization Expert
3. Agent: fetch_protein_structure("1AKE")
4. Agent: visualize_with_pymol("1AKE", mode="interactive", style="cartoon")
5. Agent → Supervisor: Results with visualization paths
6. Supervisor → User: Visualization complete
1. User: "Link structures to my RNA-seq data"
2. Supervisor → Protein Structure Visualization Expert
3. Agent: link_to_expression_data("rna_seq_normalized", organism="Homo sapiens")
4. Agent creates new modality with structure mappings
5. Agent → Supervisor: Linking results (e.g., "50 genes linked to 75 structures")
6. Supervisor → User: Structure links created
1. User: "Compare structures 1AKE and 4AKE"
2. Supervisor → Protein Structure Visualization Expert
3. Agent: fetch_protein_structure("1AKE")
4. Agent: fetch_protein_structure("4AKE")
5. Agent: compare_structures("1AKE", "4AKE", align=True)
6. Agent → Supervisor: RMSD results (e.g., "RMSD = 1.2 Å, very similar")
7. Supervisor → User: Comparison complete
PyMOL is a professional open-source molecular visualization tool that provides:
- High-quality molecular graphics
- Publication-ready images
- Comprehensive visualization commands
- Python API for automation
- Interactive GUI mode for real-time exploration
- Active open-source community
PyMOL is pre-installed in the Lobster Docker image. No action required.
Verify installation:
docker run -it omicsos/lobster:latest pymol -c -QInstall PyMOL via Makefile target:
# Install PyMOL automatically
make install-pymolThis command will:
- Detect your operating system (macOS or Linux)
- Install PyMOL via the appropriate package manager
- Verify the installation
What it does:
- macOS: Uses Homebrew with brewsci/bio tap
- Linux: Uses apt-get (Ubuntu/Debian) or dnf (Fedora/RHEL)
- Homebrew on Linux: Fallback if native package manager unavailable
Requirements:
- macOS: Homebrew must be installed
- Linux: sudo access for package installation
Installation output:
$ make install-pymol
🔬 Installing PyMOL for protein structure visualization...
🍎 macOS detected - Installing via Homebrew...
📦 Installing PyMOL...
✅ PyMOL installed successfully!
🎉 PyMOL installation complete!
💡 Test with: pymol -c -QIf automated installation is not available or fails, you can install PyMOL manually.
# Install via Homebrew (recommended)
brew install brewsci/bio/pymol
# Or download from official website
# https://pymol.org/
# After installation, PyMOL is automatically added to PATH# Ubuntu/Debian
sudo apt-get update
sudo apt-get install pymol
# Or via Homebrew on Linux
brew install brewsci/bio/pymol
# Arch Linux
sudo pacman -S pymol
# Fedora/CentOS/RHEL
sudo dnf install pymol1. Download installer from https://pymol.org/
2. Run installer and follow instructions
3. PyMOL executable will be added to Start Menu
4. Optionally add to PATH via System Environment Variables
Even without PyMOL installed, the agent generates .pml command scripts that can be:
- Executed manually when PyMOL is installed
- Modified for custom visualizations
- Used as templates for batch processing
Examples:
# Interactive mode (with GUI)
pymol 1AKE_cartoon_chain_commands.pml
# Batch mode (headless, save image and exit)
pymol -c 1AKE_cartoon_chain_commands.pmlLink protein structures to highly expressed genes:
1. Run single-cell analysis (clustering, DE analysis)
2. Identify top expressed genes
3. Use link_to_expression_data() to find structures
4. Visualize structures for key marker genes
Link structures to identified proteins:
1. Run proteomics analysis (quantification, DE)
2. Identify significantly changing proteins
3. Use link_to_expression_data() with protein_name column
4. Compare structures of protein variants
Cross-reference structures across modalities:
1. Link structures to both RNA-seq and proteomics
2. Identify genes/proteins with structures in both datasets
3. Visualize structures colored by expression levels
4. Compare structural features with functional changes
-
First fetch: Downloads from PDB, stores in
protein_structures/directory - Subsequent fetches: Uses cached file (instant)
- Cache location: Workspace directory or current directory
- Cache benefits: Avoids redundant downloads, faster workflow iterations
- Rate limit: 5 requests/second (RCSB PDB API limit)
- No authentication: Public PDB API requires no API key
- Batch operations: Use link_to_expression_data() for efficient batch queries
- Command scripts: Generated instantly (no execution delay)
- Interactive mode: GUI launches in 2-5 seconds (non-blocking)
- Batch mode (image generation): 5-30 seconds per structure (if PyMOL is installed)
-
Headless mode: PyMOL runs without GUI for automation (use
pymol -c) - Parallel execution: Multiple structures can be visualized in parallel
Error: Invalid PDB ID format: XYZ. Must be 4 alphanumeric characters.
Solution: Ensure PDB ID is exactly 4 characters (e.g., '1AKE', not '1AK' or '1AKEE')
Error: Failed to download structure 1XYZ from PDB
Solution: Verify PDB ID exists at https://www.rcsb.org/structure/1XYZ
Error: PyMOL not found. Install with: brew install brewsci/bio/pymol
Solution: Install PyMOL or use generated command scripts manually
Error: Gene column 'gene_symbol' not found in adata.var
Solution: Check available columns with adata.var.columns and specify correct column name
Warning: DSSP not available. Using simplified analysis.
Solution: Install DSSP for secondary structure analysis:
conda install -c salilab dsspfrom lobster.tools.protein_structure_fetch_service import ProteinStructureFetchService
service = ProteinStructureFetchService()
# Fetch structure
structure_data, stats, ir = service.fetch_structure(
pdb_id="1AKE",
format="cif",
cache_dir=Path("protein_structures"),
extract_metadata=True,
data_manager=data_manager
)
# Link structures to genes
adata_linked, stats, ir = service.link_structures_to_genes(
adata=adata,
gene_column="gene_symbol",
organism="Homo sapiens",
max_structures_per_gene=5,
data_manager=data_manager
)from lobster.tools.pymol_visualization_service import PyMOLVisualizationService
service = PyMOLVisualizationService()
# Check installation
install_status = service.check_pymol_installation()
# Create visualization (batch mode - save PNG)
viz_data, stats, ir = service.visualize_structure(
structure_file=Path("1AKE.cif"),
mode="batch",
style="cartoon",
color_by="chain",
output_image=Path("output.png"),
width=1920,
height=1080,
execute_commands=True
)
# Or interactive mode (launch GUI)
viz_data, stats, ir = service.visualize_structure(
structure_file=Path("1AKE.cif"),
mode="interactive",
style="cartoon",
color_by="chain",
execute_commands=True
)from lobster.tools.structure_analysis_service import StructureAnalysisService
service = StructureAnalysisService()
# Analyze structure
analysis_results, stats, ir = service.analyze_structure(
structure_file=Path("1AKE.cif"),
analysis_type="secondary_structure",
chain_id="A"
)
# Calculate RMSD
rmsd_results, stats, ir = service.calculate_rmsd(
structure_file1=Path("1AKE.cif"),
structure_file2=Path("4AKE.cif"),
align=True
)Always use uppercase 4-character PDB IDs:
# Good
fetch_protein_structure("1AKE")
# Bad
fetch_protein_structure("1ake") # Works but not consistent
fetch_protein_structure("1AK") # Error: too shortLeverage caching for iterative workflows:
# First run: downloads structure
fetch_protein_structure("1AKE")
# Subsequent runs: uses cache (instant)
visualize_with_pymol("1AKE", mode="interactive", style="cartoon")
visualize_with_pymol("1AKE", mode="batch", style="surface") # No re-downloadGenerate scripts even without PyMOL:
# Script generation always works
visualize_with_pymol("1AKE", execute=False)
# Execute manually later when PyMOL is installed
# Interactive mode: pymol 1AKE_commands.pml
# Batch mode: pymol -c 1AKE_commands.pmlSearch by organism for better results:
# Specific organism
link_to_expression_data("adata", organism="Homo sapiens")
# Mouse data
link_to_expression_data("adata", organism="Mus musculus")Use alignment for meaningful comparisons:
# With alignment (recommended)
compare_structures("1AKE", "4AKE", align=True)
# Without alignment (only for pre-aligned structures)
compare_structures("1AKE", "4AKE", align=False)All structure operations generate Intermediate Representation (IR) with:
- Operation: Specific operation performed (e.g., 'pdb.fetch_structure')
- Parameters: All parameters used (pdb_id, format, style, etc.)
- Code Template: Jinja2 template for notebook export
- Imports: Required Python imports
- Parameter Schema: Papermill-injectable parameters with validation
Notebook Export:
# Export pipeline to Jupyter notebook
data_manager.export_notebook("protein_structure_pipeline.ipynb")
# Execute notebook with different PDB ID
papermill protein_structure_pipeline.ipynb output.ipynb -p pdb_id "4HHB"- Ensure structure was fetched first with
fetch_protein_structure() - Check cache directory permissions
- Verify file path in structure_data dictionary
- Large structures may take longer to render (batch mode)
- Increase timeout in service configuration
- Use
execute=Falseto generate script without execution - For interactive mode, the GUI may take 2-5 seconds to launch
- Check organism name (use Latin names: "Homo sapiens", "Mus musculus")
- Verify gene symbols are standard (HGNC for human, MGI for mouse)
- Try reducing
max_structures_per_genefor faster queries
- Ensure both structures have been fetched
- Check chain IDs exist in structures
- Verify structures have matching residues (homologs, not random proteins)
# Fetch and visualize adenylate kinase
fetch_protein_structure("1AKE")
visualize_with_pymol("1AKE", mode="interactive", style="cartoon", color_by="secondary_structure")# Compare open and closed conformations of adenylate kinase
fetch_protein_structure("1AKE") # Open form
fetch_protein_structure("4AKE") # Closed form
compare_structures("1AKE", "4AKE", align=True)
# Output: RMSD = 1.2 Å (moderate conformational change)# After RNA-seq analysis
link_to_expression_data("rna_seq_normalized", organism="Homo sapiens")
# Visualize top expressed genes with structures
# Filter: adata[adata.var['has_structure']]# Fetch multiple family members
fetch_protein_structure("1AKE")
fetch_protein_structure("2AKE")
fetch_protein_structure("3AKE")
# Pairwise RMSD comparisons
compare_structures("1AKE", "2AKE")
compare_structures("1AKE", "3AKE")
compare_structures("2AKE", "3AKE")# Fetch structure
fetch_protein_structure("1AKE")
# Example 1: Highlight disease mutation sites in red
# Single residue group - useful for showing known pathogenic variants
visualize_with_pymol(
"1AKE",
mode="batch",
style="cartoon",
color_by="chain",
highlight_residues="15,42,89",
highlight_color="red",
highlight_style="sticks"
)
# Example 2: Chain-specific highlighting for protein-protein interfaces
# Highlight interface residues in hemoglobin subunits
fetch_protein_structure("4HHB")
visualize_with_pymol(
"4HHB",
highlight_residues="A:15-20,A:42,B:30-35,B:50",
highlight_color="yellow",
highlight_style="sticks"
)
# Example 3: Multiple highlight groups for complex functional annotation
# Show binding site (red), catalytic residues (blue), and allosteric site (green)
visualize_with_pymol(
"1AKE",
mode="interactive", # Launch GUI for interactive exploration
highlight_groups="15,42,89|red|sticks;100-120|blue|surface;200,215,230|green|spheres"
)
# Example 4: Combining with different color schemes
# Highlight active site residues while showing B-factors for the rest
visualize_with_pymol(
"1AKE",
style="cartoon",
color_by="bfactor", # Color by temperature factors
highlight_residues="100-120", # Active site region
highlight_color="red",
highlight_style="sticks"
)Use Cases for Residue Highlighting:
- Disease Mutations: Highlight known pathogenic variants from ClinVar or GWAS studies
- Binding Sites: Show ligand or substrate binding pockets
- Active Sites: Emphasize catalytic residues (e.g., catalytic triad in proteases)
- Post-Translational Modifications: Highlight phosphorylation, methylation, or acetylation sites
- Protein-Protein Interfaces: Show interaction residues in multi-chain complexes
- Conservation Analysis: Highlight evolutionarily conserved residues
- RCSB PDB: https://www.rcsb.org
- PDB REST API: https://data.rcsb.org/redoc/index.html
- PyMOL: https://pymol.org/
- PyMOL Wiki: https://pymolwiki.org/
- BioPython: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ
- DSSP: https://swift.cmbi.umcn.nl/gv/dssp/
Last Updated: 2025-01-15 Version: 1.0.0 Maintainer: Lobster Development Team