Contributing to Chorus

Thank you for your interest in contributing to Chorus! This guide will walk you through the process of implementing a new oracle (genomic sequence prediction model) step by step.

Overview

Chorus provides a unified interface for genomic sequence oracles. Each oracle runs in its own isolated conda environment to avoid dependency conflicts. To add a new oracle, you'll need to:

Create the oracle implementation
Define the conda environment requirements
Implement required methods
Add tests and examples
Submit a pull request

Step-by-Step Guide to Implementing a New Oracle

Step 1: Fork and Clone the Repository

# Fork the repository on GitHub, then:
git clone https://github.com/YOUR_USERNAME/chorus.git
cd chorus
pip install -e .

Step 2: Create Your Oracle Implementation

Create a new file in chorus/oracles/ named after your oracle (e.g., borzoi.py):

# chorus/oracles/borzoi.py
"""Borzoi oracle implementation."""

import numpy as np
from typing import List, Dict, Optional, Tuple, Union, Any
import logging

from ..core.base import OracleBase
from ..core.exceptions import ModelNotLoadedError

logger = logging.getLogger(__name__)


class BorzoiOracle(OracleBase):
    """Borzoi oracle implementation."""
    
    def __init__(self, use_environment: bool = True, reference_fasta: Optional[str] = None):
        """
        Initialize Borzoi oracle.
        
        Args:
            use_environment: Whether to use isolated conda environment
            reference_fasta: Path to reference genome FASTA file
        """
        # Set oracle name before calling super().__init__
        self.oracle_name = 'borzoi'
        
        super().__init__(use_environment=use_environment)
        
        # Model-specific parameters
        self.sequence_length = 524288  # Example: Borzoi uses 524kb sequences
        self.bin_size = 128
        self.num_tracks = 7919  # Example track count
        
        # Store reference genome path
        self.reference_fasta = reference_fasta
        
        # Model components (will be loaded later)
        self._model = None

Step 3: Implement Required Methods

Your oracle must implement these abstract methods from OracleBase:

3.1 Model Loading

def load_pretrained_model(self, weights: Optional[str] = None) -> None:
    """Load pre-trained model weights."""
    if weights is None:
        weights = "default_model_path_or_url"
    
    logger.info(f"Loading {self.oracle_name} model from {weights}")
    
    if self.use_environment:
        # Run loading in isolated environment
        load_code = f"""
import torch  # or tensorflow, depending on your model
# Your model loading code here
model = load_your_model('{weights}')
result = {{'loaded': True, 'description': 'Model loaded successfully'}}
"""
        
        result = self.run_code_in_environment(load_code, timeout=300)
        if result and result['loaded']:
            self.loaded = True
            logger.info(f"{self.oracle_name} model loaded successfully!")
        else:
            raise ModelNotLoadedError(f"Failed to load {self.oracle_name} model")
    else:
        # Direct loading if not using environment
        self._load_direct(weights)

3.2 Track Information

def list_assay_types(self) -> List[str]:
    """Return list of available assay types."""
    return [
        "DNase", "ATAC-seq", "ChIP-seq", "RNA-seq", 
        # Add your model's supported assay types
    ]

def list_cell_types(self) -> List[str]:
    """Return list of available cell types."""
    return [
        "K562", "GM12878", "HepG2", "H1-hESC",
        # Add your model's supported cell types
    ]

3.3 Prediction Method

def _predict(self, seq: Union[str, Tuple[str, int, int]], assay_ids: List[str]) -> np.ndarray:
    """
    Make predictions for given sequence and assays.
    
    Args:
        seq: Either DNA sequence string or (chrom, start, end) tuple
        assay_ids: List of assay identifiers
        
    Returns:
        numpy array of shape (num_bins, num_tracks)
    """
    if not self.loaded:
        raise ModelNotLoadedError("Model not loaded")
    
    # Handle genomic coordinates if provided
    if isinstance(seq, tuple):
        if self.reference_fasta is None:
            raise ValueError("Reference FASTA required for coordinate input")
        chrom, start, end = seq
        # Use the utility function to extract sequence with padding
        from ..utils.sequence import extract_sequence_with_padding
        seq = extract_sequence_with_padding(
            self.reference_fasta, chrom, start, end, self.sequence_length
        )
    
    if self.use_environment:
        # Run prediction in isolated environment
        import tempfile
        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
            f.write(seq)
            seq_path = f.name
        
        predict_code = f"""
# Read sequence
with open('{seq_path}', 'r') as f:
    seq = f.read().strip()

# Your prediction code here
import torch  # or tensorflow
model = load_cached_model()  # Load from cache
predictions = model.predict(seq, {repr(assay_ids)})
result = predictions.tolist()
"""
        
        predictions = self.run_code_in_environment(predict_code, timeout=120)
        return np.array(predictions)
    else:
        # Direct prediction
        return self._predict_direct(seq, assay_ids)

3.4 Required Helper Methods

def _get_context_size(self) -> int:
    """Return the required context size for the model."""
    return self.sequence_length

def _get_sequence_length_bounds(self) -> Tuple[int, int]:
    """Return min and max sequence lengths accepted by the model."""
    return (1000, self.sequence_length)

def _get_bin_size(self) -> int:
    """Return the bin size for predictions."""
    return self.bin_size

Step 4: Define the Conda Environment

Create an environment configuration that we can integrate into the setup system. Provide us with:

Conda packages needed:

# Example for a PyTorch-based model
channels:
  - pytorch
  - conda-forge
  - bioconda
  - defaults

dependencies:
  - python=3.9
  - pytorch=2.0
  - torchvision
  - numpy
  - pandas
  - scikit-learn
  - pysam
  - bedtools
  - pip
  - pip:
    - your-special-package==1.0.0

Installation commands:

# Any special setup commands
# For example, downloading model weights:
wget https://example.com/model_weights.pt -O ~/.cache/borzoi/weights.pt

Step 5: Register Your Oracle

Add your oracle to chorus/oracles/__init__.py:

from .borzoi import BorzoiOracle

ORACLES = {
    'enformer': EnformerOracle,
    'borzoi': BorzoiOracle,  # Add your oracle
    # ...
}

Update chorus/__init__.py to support environment isolation:

if oracle_name.lower() == 'borzoi':
    from .oracles.borzoi import BorzoiOracle
    return BorzoiOracle(use_environment=True, **kwargs)

Step 6: Add Tests

Create a test file tests/test_borzoi.py:

import pytest
import chorus


def test_borzoi_creation():
    """Test Borzoi oracle creation."""
    oracle = chorus.create_oracle('borzoi', use_environment=False)
    assert oracle.oracle_name == 'borzoi'
    assert oracle.sequence_length == 524288


def test_borzoi_tracks():
    """Test track listing."""
    oracle = chorus.create_oracle('borzoi', use_environment=False)
    assays = oracle.list_assay_types()
    assert 'DNase' in assays
    
    cells = oracle.list_cell_types()
    assert 'K562' in cells


# Add more tests for predictions, etc.

Step 7: Create an Example Notebook

Create examples/borzoi_example.ipynb demonstrating your oracle's features:

# Example notebook structure
1. Oracle initialization
2. Model loading
3. Basic sequence prediction
4. Genomic coordinate prediction (if supported)
5. Track visualization
6. Special features of your model

Step 8: Document Your Oracle

Add a section to the README.md describing:

Model capabilities
Sequence length requirements
Number of tracks
Special features
Citation information

Environment Configuration Format

When submitting your oracle, provide the environment configuration in this format:

# In your oracle implementation or a separate config file
BORZOI_ENV_CONFIG = {
    'channels': ['pytorch', 'conda-forge', 'bioconda', 'defaults'],
    'dependencies': [
        'python=3.9',
        'pytorch=2.0',
        'numpy',
        'pandas',
        # ... other conda packages
    ],
    'pip_packages': [
        'special-package==1.0.0',
        # ... other pip packages
    ],
    'post_install_commands': [
        'wget https://example.com/weights.pt -O ~/.cache/borzoi/weights.pt',
        # ... other setup commands
    ]
}

Best Practices

Lazy Imports: Import model-specific packages inside methods to avoid import errors:

def _load_direct(self, weights):
    import torch  # Import here, not at module level

Memory Management: Be mindful of memory usage, especially for large models
Error Handling: Provide clear error messages for common issues
Logging: Use the logger for important status updates
Type Hints: Use proper type annotations for all methods
Documentation: Include docstrings for all public methods

Submitting Your Contribution

Create a Pull Request with:
- Your oracle implementation
- Environment configuration
- Tests
- Example notebook
- Documentation updates
PR Description should include:
- Model description and capabilities
- Environment setup instructions
- Any special requirements
- Link to model paper/repository
Testing: Ensure all tests pass and the oracle works in both modes:
- With environment isolation (use_environment=True)
- Without environment isolation (use_environment=False)

Example PR Structure

chorus/
├── oracles/
│   └── borzoi.py          # Your oracle implementation
├── tests/
│   └── test_borzoi.py     # Tests
├── examples/
│   └── borzoi_example.ipynb  # Example notebook
└── README.md              # Updated with your oracle info

Getting Help

Open an issue for questions
Join discussions in existing oracle implementation PRs
Tag maintainers for review: @pinellolab

Current Priorities

We're particularly interested in implementations for:

Borzoi - Enhanced Enformer model
ChromBPNet - Base-pair resolution TF binding
Sei - Sequence regulatory effects
Basset - Chromatin accessibility
DeepSEA - Variant effects

Thank you for contributing to Chorus! Your implementation will help make genomic deep learning models more accessible to the research community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to Chorus

Overview

Step-by-Step Guide to Implementing a New Oracle

Step 1: Fork and Clone the Repository

Step 2: Create Your Oracle Implementation

Step 3: Implement Required Methods

3.1 Model Loading

3.2 Track Information

3.3 Prediction Method

3.4 Required Helper Methods

Step 4: Define the Conda Environment

Step 5: Register Your Oracle

Step 6: Add Tests

Step 7: Create an Example Notebook

Step 8: Document Your Oracle

Environment Configuration Format

Best Practices

Submitting Your Contribution

Example PR Structure

Getting Help

Current Priorities

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Chorus

Overview

Step-by-Step Guide to Implementing a New Oracle

Step 1: Fork and Clone the Repository

Step 2: Create Your Oracle Implementation

Step 3: Implement Required Methods

3.1 Model Loading

3.2 Track Information

3.3 Prediction Method

3.4 Required Helper Methods

Step 4: Define the Conda Environment

Step 5: Register Your Oracle

Step 6: Add Tests

Step 7: Create an Example Notebook

Step 8: Document Your Oracle

Environment Configuration Format

Best Practices

Submitting Your Contribution

Example PR Structure

Getting Help

Current Priorities