# Developer Overview - Lobster AI Architecture ## ๐Ÿ—๏ธ Overview This guide provides a comprehensive introduction to developing within the Lobster AI codebase, covering architecture patterns, design principles, and development workflows. Lobster AI is a professional multi-agent bioinformatics analysis platform that combines specialized AI agents with proven scientific tools. ## ๐ŸŽฏ Core Design Principles ### 1. Agent-Based Architecture - **Specialized Agents**: Each agent handles specific bioinformatics domains (transcriptomics, proteomics) - **Centralized Registry**: Single source of truth for agent configuration via `AGENT_REGISTRY` - **Natural Language Interface**: Users describe analyses in plain English ### 2. Modular Service Design - **Stateless Services**: All analysis services are stateless and return `(processed_adata, statistics_dict)` - **Separation of Concerns**: Agents coordinate workflows, services handle computation - **Reusable Components**: Services can be used independently or composed in workflows ### 3. Multi-Modal Data Management - **DataManagerV2**: Centralized orchestrator for multi-omics data with modality management - **Professional Naming**: Consistent naming conventions for dataset versions and analysis stages - **Provenance Tracking**: W3C-PROV compliant analysis history for reproducibility ### 4. Cloud/Local Hybrid Architecture - **BaseClient Interface**: Consistent API for local and cloud execution - **Seamless Switching**: Automatic detection and fallback between cloud and local modes - **Unified CLI**: Single interface supporting both execution environments ## ๐Ÿ›๏ธ Architecture Components ### Core Directories ``` lobster/ โ”œโ”€โ”€ agents/ # Specialized AI agents for bioinformatics domains โ”œโ”€โ”€ core/ # Data management, client infrastructure, interfaces โ”œโ”€โ”€ tools/ # Stateless analysis services โ”œโ”€โ”€ config/ # Configuration management and agent registry โ”œโ”€โ”€ cli.py # Modern terminal interface with autocomplete โ””โ”€โ”€ utils/ # Shared utilities and logging ``` ### Key Architectural Patterns #### 1. **Agent Registry Pattern** ```python # lobster/config/agent_registry.py @dataclass class AgentRegistryConfig: name: str # Unique identifier display_name: str # Human-readable name description: str # Agent capabilities factory_function: str # Module path to factory handoff_tool_name: Optional[str] # Auto-generated tool name AGENT_REGISTRY = { 'data_expert_agent': AgentRegistryConfig(...), 'transcriptomics_expert': AgentRegistryConfig(...), 'proteomics_expert': AgentRegistryConfig(...), # ... more agents } ``` #### 2. **Service Pattern** ```python class QualityService: """Stateless service for data quality assessment.""" def assess_quality(self, adata: anndata.AnnData, **params) -> Tuple[anndata.AnnData, Dict]: """ Returns: Tuple of (processed_adata, statistics_dict) """ # Stateless processing logic return processed_adata, statistics ``` #### 3. **Agent Tool Pattern** ```python @tool def assess_data_quality(modality_name: str, **params) -> str: """Standard pattern for all agent tools.""" # 1. Validate modality exists if modality_name not in data_manager.list_modalities(): raise ModalityNotFoundError(f"Modality '{modality_name}' not found") # 2. Get data and call stateless service adata = data_manager.get_modality(modality_name) result_adata, stats = service.assess_quality(adata, **params) # 3. Store results with descriptive naming new_modality = f"{modality_name}_quality_assessed" data_manager.modalities[new_modality] = result_adata # 4. Log operation for provenance data_manager.log_tool_usage("assess_data_quality", params, stats) return formatted_response(stats, new_modality) ``` #### 4. **Client Adapter Pattern** ```python # lobster/core/interfaces/base_client.py class BaseClient(ABC): @abstractmethod def query(self, user_input: str, stream: bool = False) -> Dict[str, Any]: pass @abstractmethod def get_status(self) -> Dict[str, Any]: pass # Implementations: AgentClient (local), CloudLobsterClient (cloud) ``` ## ๐Ÿ”ง Development Setup ### 1. Environment Setup ```bash # Clone repository git clone cd lobster # Install development dependencies make dev-install # Activate environment source .venv/bin/activate # Verify installation python -m lobster --help ``` ### 2. Required Environment Variables ```bash # Required API Keys export AWS_BEDROCK_ACCESS_KEY="your-aws-access-key" export AWS_BEDROCK_SECRET_ACCESS_KEY="your-aws-secret-key" # Optional export NCBI_API_KEY="your-ncbi-api-key" export LOBSTER_CLOUD_KEY="your-cloud-api-key" # Enables cloud mode ``` ### 3. Development Commands ```bash # Run all tests make test # Fast parallel testing make test-fast # Code formatting make format # Linting make lint # Type checking make type-check # Start CLI lobster chat ``` ## ๐Ÿงช Scientific Workflows ### Professional Naming Convention ``` geo_gse12345 # Raw downloaded data โ”œโ”€โ”€ geo_gse12345_quality_assessed # QC metrics added โ”œโ”€โ”€ geo_gse12345_filtered_normalized # Preprocessed data โ”œโ”€โ”€ geo_gse12345_doublets_detected # Doublet annotations โ”œโ”€โ”€ geo_gse12345_clustered # Leiden clustering + UMAP โ”œโ”€โ”€ geo_gse12345_markers # Differential expression โ”œโ”€โ”€ geo_gse12345_annotated # Cell type annotations โ””โ”€โ”€ geo_gse12345_pseudobulk # Aggregated for DE analysis ``` ### Data Flow Architecture ``` User Input (CLI) โ†“ LobsterClientAdapter โ†’ BaseClient (AgentClient | CloudLobsterClient) โ†“ Agent Registry โ†’ Specialized Agent (data_expert, transcriptomics_expert, etc.) โ†“ Agent Tools โ†’ Stateless Services (QualityService, ClusteringService, etc.) โ†“ DataManagerV2 โ†’ Modality Management โ†’ Storage Backends (H5AD, MuData) โ†“ Results โ†’ CLI Response with Visualizations ``` ## ๐ŸŽจ Code Style Guidelines ### 1. Python Standards - Follow PEP 8 style guidelines - Use type hints for all functions and methods - Line length: 88 characters (Black formatting) - Comprehensive docstrings for all public functions ### 2. Scientific Accuracy - Prioritize scientific accuracy over performance optimizations - Include comprehensive QC metrics at each analysis step - Support batch effect detection and correction - Implement proper missing value handling strategies ### 3. Error Handling ```python # Use specific exceptions class ModalityNotFoundError(Exception): pass class ServiceError(Exception): pass # Proper error handling in tools try: result = service.process(data) except ServiceError as e: logger.error(f"Service error: {e}") return f"Analysis failed: {str(e)}" ``` ## ๐Ÿš€ Development Workflow ### 1. Adding New Features 1. **Design First**: Consider how the feature fits into existing patterns 2. **Use Registry**: For agents, add to `AGENT_REGISTRY` instead of manual graph edits 3. **Follow Patterns**: Use established service, tool, and adapter patterns 4. **Test Thoroughly**: Include unit, integration, and scientific validation tests 5. **Document**: Update relevant documentation files ### 2. Code Quality Checklist - [ ] Type hints on all functions - [ ] Comprehensive docstrings - [ ] Error handling with specific exceptions - [ ] Unit tests with 80%+ coverage - [ ] Integration tests with real data - [ ] Scientific validation where applicable - [ ] CLI compatibility (local and cloud) ### 3. Pre-commit Hooks ```bash # Install pre-commit hooks pre-commit install # Run manually pre-commit run --all-files ``` ## ๐Ÿ“Š Performance Considerations ### 1. Memory Management - Use memory-efficient data loading for large datasets - Implement lazy loading where possible - Monitor memory usage in long-running analyses ### 2. Computation Optimization - Leverage GPU acceleration when available (ScVI, rapids) - Use efficient algorithms for large-scale data - Implement progress tracking for long operations ### 3. Caching Strategy - File operations: 60s cache for cloud, 10s for local - Intelligent caching for expensive computations - Clear cache invalidation strategies ## ๐Ÿ” Debugging and Troubleshooting ### 1. Common Issues - **Import Errors**: Check environment activation and dependencies - **Agent Registry**: Verify factory function paths are correct - **Data Loading**: Check file permissions and formats - **Cloud Integration**: Verify API keys and network connectivity ### 2. Debugging Tools ```python # Use structured logging from lobster.utils.logger import get_logger logger = get_logger(__name__) # Enable debug mode logger.setLevel(logging.DEBUG) # Check system status lobster chat /status ``` ### 3. Testing Connectivity ```bash # Test agent registry python -c "from lobster.config.agent_registry import AGENT_REGISTRY; print(list(AGENT_REGISTRY.keys()))" # Test CLI with both clients LOBSTER_CLOUD_KEY="" python -m lobster chat # Local mode LOBSTER_CLOUD_KEY="key" python -m lobster chat # Cloud mode ``` ## ๐Ÿ“š Further Reading - **[Creating Agents Guide](09-creating-agents.md)** - Detailed agent development - **[Creating Services Guide](10-creating-services.md)** - Service implementation patterns - **[Creating Adapters Guide](11-creating-adapters.md)** - Data adapter development - **[Testing Guide](12-testing-guide.md)** - Comprehensive testing framework - **[CLAUDE.md](CLAUDE.md)** - Complete architectural documentation ## ๐ŸŽฏ Quick Reference ### Key Files to Know - `lobster/config/agent_registry.py` - Agent configuration registry - `lobster/core/interfaces/base_client.py` - Client interface definition - `lobster/core/data_manager_v2.py` - Multi-modal data orchestrator - `lobster/cli.py` - CLI implementation with autocomplete - `tests/conftest.py` - Test configuration and fixtures ### Essential Commands ```bash make dev-install # Development setup make test # Run all tests lobster chat # Start interactive CLI /help # Show available commands /status # System status /files # List workspace files ``` This overview provides the foundation for contributing to Lobster AI. Each component follows established patterns that promote consistency, maintainability, and scientific rigor.