- Overview
- Core Classes
- Prediction Methods
- Utility Functions
- Track Management
- Environment Management
- Examples
Chorus provides a unified interface for genomic sequence prediction models (oracles). Each oracle predicts regulatory activity from DNA sequences, with support for various genomic manipulations and analyses.
Base class for all oracle implementations. Provides common functionality and defines the interface that all oracles must implement.
class OracleBase(ABC):
def __init__(self, use_environment: bool = True)Attributes:
oracle_name(str): Name of the oracle (e.g., 'enformer')reference_fasta(str): Path to reference genome FASTA fileloaded(bool): Whether the model is loadeduse_environment(bool): Whether to use isolated conda environment
Implementation of the Enformer model for predicting gene expression and chromatin states.
class EnformerOracle(OracleBase):
def __init__(self, use_environment: bool = True, reference_fasta: Optional[str] = None)Enformer-specific attributes:
sequence_length(int): 393,216 bp input sequence lengthtarget_length(int): 896 bins in outputbin_size(int): 128 bp per bin- Output window: 114,688 bp (896 × 128)
- Offset from input edges: 139,264 bp on each side
Basic prediction method for DNA sequences or genomic coordinates.
def predict(
input_data: Union[str, Tuple[str, int, int]],
assay_ids: List[str],
create_tracks: bool = False
) -> Dict[str, np.ndarray]Parameters:
input_data: Either:- DNA sequence string (must be model's required length)
- Tuple of (chromosome, start, end) for genomic coordinates
assay_ids: List of track identifiers (oracle-specific)- Enformer: ENCODE IDs (e.g., 'ENCFF413AHU'), CAGE IDs (e.g., 'CNhs11250'), or descriptions (e.g., 'DNase:K562')
create_tracks: Whether to create track files (not implemented)
Returns:
- Dictionary mapping track IDs to prediction arrays
- Each array has shape (n_bins,) where n_bins = output_length / bin_size
Logic:
- If input is coordinates, extracts sequence from reference genome
- Validates sequence length matches model requirements
- Runs model prediction
- Returns predictions for requested tracks
Example:
# From sequence
seq = 'ACGT' * 98304 # 393,216 bp for Enformer
predictions = oracle.predict(seq, ['ENCFF413AHU', 'CNhs11250'])
# From coordinates
predictions = oracle.predict(('chrX', 48777634, 48790694), ['ENCFF413AHU'])Replace a genomic region with a new sequence and predict the effects.
def predict_region_replacement(
genomic_region: Union[str, pd.DataFrame],
seq: str,
assay_ids: List[str],
create_tracks: bool = False,
genome: Optional[str] = None
) -> DictParameters:
genomic_region: Region to replace- String format: "chr1:1000-2000" (1-based, inclusive)
- DataFrame: First row with columns 'chrom', 'start', 'end'
seq: Replacement DNA sequence (must match region length exactly)assay_ids: List of track identifierscreate_tracks: Whether to save track filesgenome: Reference genome path (uses oracle's reference_fasta if None)
Returns: Dictionary with:
raw_predictions: Dict[track_id, np.ndarray] - Raw model outputsnormalized_scores: Dict[track_id, np.ndarray] - Min-max normalized (0-1)track_objects: List[Track] - Track objects if create_tracks=Truetrack_files: List[str] - File paths if create_tracks=True
Logic:
- Validates replacement sequence length matches region length
- Calculates full context window centered on region
- Extracts context sequence from reference genome
- Replaces specified region within context
- Runs prediction on modified full-length sequence
- Returns predictions for the output window
Example:
# Replace 200bp region with GATA motif repeats
enhancer = 'GATA' * 50 # 200bp
results = oracle.predict_region_replacement(
'chr11:5247400-5247600',
enhancer,
['ENCFF413AHU']
)Insert a sequence at a specific genomic position.
def predict_region_insertion_at(
genomic_position: Union[str, pd.DataFrame],
seq: str,
assay_ids: List[str],
create_tracks: bool = False,
genome: Optional[str] = None
) -> DictParameters:
genomic_position: Insertion point- String format: "chr1:1000" (1-based)
- DataFrame: First row with columns 'chrom', 'pos'
seq: DNA sequence to insert (any length that fits in context)assay_ids: List of track identifierscreate_tracks: Whether to save track filesgenome: Reference genome path
Returns:
Same format as predict_region_replacement()
Logic:
- Calculates required flanking sequence sizes
- Extracts left flank (before insertion point)
- Extracts right flank (after insertion point)
- Constructs: left_flank + inserted_seq + right_flank
- Ensures total length matches model requirements
- Runs prediction on modified sequence
Example:
# Insert enhancer at specific position
results = oracle.predict_region_insertion_at(
'chr11:5247500',
'GATA' * 50, # Insert 200bp
['CNhs11250']
)Analyze effects of genetic variants (SNPs, indels).
def predict_variant_effect(
genomic_region: Union[str, pd.DataFrame],
variant_position: Union[str, pd.DataFrame],
alleles: Union[List[str], pd.DataFrame],
assay_ids: List[str],
create_tracks: bool = False,
genome: Optional[str] = None
) -> DictParameters:
genomic_region: Region containing the variant- Should be large enough for model context
variant_position: Position of variant- String format: "chr1:1000"
- Must be within genomic_region
alleles: List of alleles to test- First element is reference allele
- Remaining elements are alternative alleles
- Can also be DataFrame with 'ref' and 'alt' columns
assay_ids: List of track identifierscreate_tracks: Whether to save track filesgenome: Reference genome path
Returns: Dictionary with:
predictions: Dict of allele_name → track predictions- 'reference': predictions for reference allele
- 'alt_1', 'alt_2', etc.: predictions for alternatives
effect_sizes: Dict of alt_allele → track → effect array- Effect = alternative - reference
track_objects: Dict if create_tracks=Truetrack_files: Dict if create_tracks=Truevariant_info: Summary of variant tested
Logic:
- Extracts reference sequence for region
- Validates reference allele matches genome
- Creates modified sequences for each allele
- Runs predictions for all alleles
- Calculates effect sizes (alt - ref)
- Returns comprehensive results
Example:
# Test all possible SNPs at a position
results = oracle.predict_variant_effect(
'chr11:5247000-5248000', # 1kb region
'chr11:5247500', # Variant position
['C', 'A', 'G', 'T'], # C is reference
['ENCFF413AHU']
)
# Access results
ref_pred = results['predictions']['reference']['ENCFF413AHU']
alt1_pred = results['predictions']['alt_1']['ENCFF413AHU']
effect = results['effect_sizes']['alt_1']['ENCFF413AHU']def extract_sequence(
genomic_region: str,
genome: str = "hg38.fa"
) -> strExtracts DNA sequence from reference genome.
Parameters:
genomic_region: "chr1:1000-2000" format (1-based, inclusive)genome: Path to indexed FASTA file
Returns:
- DNA sequence string (uppercase)
Note: Properly handles coordinate conversion from 1-based genomic to 0-based pysam.
def apply_variant(
reference_seq: str,
position: int,
ref: str,
alt: str
) -> strApplies a variant to a sequence.
Parameters:
reference_seq: Original DNA sequenceposition: 0-based position in sequenceref: Reference allele (must match sequence)alt: Alternative allele
Returns:
- Modified sequence with variant applied
def get_genome(genome_name: str = 'hg38') -> PathDownloads and returns path to reference genome.
Parameters:
genome_name: One of 'hg38', 'hg19', 'mm10', 'mm9', 'dm6', 'ce11'
Returns:
- Path object to genome FASTA file
Logic:
- Checks if genome already downloaded
- Downloads from UCSC if needed
- Creates FASTA index
- Returns path
def download_gencode(
version: str = 'v48',
annotation_type: str = 'basic'
) -> PathDownloads GENCODE gene annotations.
Parameters:
version: GENCODE version (e.g., 'v48')annotation_type: 'basic' or 'comprehensive'
Returns:
- Path to GTF file
def get_gene_tss(gene_name: str) -> pd.DataFrameGets transcription start sites for a gene.
Parameters:
gene_name: Gene symbol (e.g., 'GATA1')
Returns:
- DataFrame with columns: transcript_id, chrom, tss, strand, gene_name
def visualize_chorus_predictions(
predictions: Dict[str, np.ndarray],
chrom: str,
start: int,
track_ids: List[str],
output_file: Optional[str] = None,
bin_size: int = 128,
style: str = 'modern',
use_pygenometracks: bool = True,
gtf_file: Optional[str] = None,
show_gene_names: bool = True
) -> NoneCreates publication-quality visualizations of predictions.
Parameters:
predictions: Dict of track_id → prediction arraychrom: Chromosome namestart: Start coordinatetrack_ids: List of tracks to plotoutput_file: Save to file if providedbin_size: Bin size for predictionsstyle: 'modern', 'classic', or 'minimal'use_pygenometracks: Use pyGenomeTracks if availablegtf_file: Gene annotation file for gene trackshow_gene_names: Whether to label genes
class Track:
def __init__(
self,
name: str,
assay_type: str,
cell_type: str,
data: pd.DataFrame,
color: Optional[str] = None
)Represents a genomic signal track.
Methods:
to_bedgraph(filename): Save as BedGraphto_bigwig(filename, chrom_sizes): Save as BigWignormalize(method): Normalize valuessmooth(window_size): Smooth signal
def save_predictions_as_bedgraph(
predictions: Dict[str, np.ndarray],
chrom: str,
start: int,
end: Optional[int] = None,
output_dir: str = ".",
prefix: str = "",
bin_size: Optional[int] = None,
track_colors: Optional[Dict[str, str]] = None
) -> List[str]Saves predictions as BedGraph files for genome browser visualization.
Note for Enformer: Automatically handles coordinate mapping from input window to output window.
# Set up oracle environment
chorus setup --oracle enformer
# Check environment health
chorus health
# List environments
chorus list
# Remove environment
chorus remove --oracle enformer# Create oracle with environment
oracle = chorus.create_oracle('enformer', use_environment=True)
# Run code in oracle's environment
result = oracle.run_code_in_environment(
"import tensorflow; print(tensorflow.__version__)"
)import chorus
from chorus.utils import get_genome, download_gencode
# Setup
genome = get_genome('hg38')
gtf = download_gencode()
oracle = chorus.create_oracle('enformer', reference_fasta=str(genome))
oracle.load_pretrained_model()
# Define tracks (Enformer-specific)
tracks = ['ENCFF413AHU', 'CNhs11250'] # DNase:K562, CAGE:K562
# 1. Wild-type prediction
wt = oracle.predict(('chr11', 5247000, 5248000), tracks)
# 2. Test enhancer insertion
enhancer = 'GATA' * 50
inserted = oracle.predict_region_insertion_at(
'chr11:5247500',
enhancer,
tracks
)
# 3. Test variant
variant = oracle.predict_variant_effect(
'chr11:5247000-5248000',
'chr11:5247500',
['C', 'A', 'G', 'T'], # C is reference
tracks
)
# 4. Analyze gene expression
expr = oracle.analyze_gene_expression(
predictions=wt,
gene_name='HBB', # Beta-globin
chrom='chr11',
start=5247000,
end=5248000,
gtf_file=str(gtf),
cage_track_ids=['CNhs11250']
)
# 5. Save for visualization
oracle.save_predictions_as_bedgraph(
wt,
chrom='chr11',
start=5247000,
end=5248000,
output_dir='results'
)- Requires exactly 393,216 bp input sequence
- Output covers middle 114,688 bp of input
- Uses ENCODE and CAGE track identifiers
- Supports gene expression analysis via CAGE at TSS
- Borzoi: Similar to Enformer, enhanced performance
- ChromBPNet: Base-resolution, different track naming
- Sei: 21,907 profiles, custom track names
Common exceptions:
ModelNotLoadedError: Callload_pretrained_model()firstInvalidSequenceError: Check sequence length and contentInvalidAssayError: Use valid track identifiers for the oracleInvalidRegionError: Check genomic coordinatesFileFormatError: Ensure genome file is indexed