Skip to content

TobyBaril/EarlGreyParTEA

Repository files navigation

ParTEA Logo

EarlGrey Pangenome Pipeline (ParTEA)

🎉 Multiple genomes? Time for a parTEA! 🎉

Because analyzing transposable elements across genomes should be a parTEA, not a chore.

DOI Repository Dependency


What is ParTEA?

ParTEA (Pangenome Transposable Element Analysis) is a Snakemake-based pipeline that brings the party to multi-genome TE annotation! It extends EarlGrey to process multiple genomes in parallel, build pangenome TE libraries, and perform comparative transposable element analysis across species.

Why ParTEA?

  • 🎊 Parallel Processing: Analyze multiple genomes simultaneously
  • 🧬 Pangenome Libraries: Build consensus TE libraries across species
  • 🔄 Optional Clustering: Merge TE libraries using cd-hit for consistent and traceable naming in annotations across all genomes
  • 📊 Rich Outputs: Get annotations, divergence metrics, and visualizations
  • Dynamic Threading: Automatically optimizes resource allocation
  • 🔍 Version-Agnostic: Works seamlessly with any EarlGrey version

📖 Citation

If you use ParTEA in your research, please cite:

Baril, T., Galbraith, J. and Hayward, A., 2024. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Molecular Biology and Evolution, 41(4), p.msae068.

ParTEA manuscript in preparation.


� Table of Contents


�🔄 Pipeline Overview

ParTEA orchestrates TE analysis across multiple genomes with smart parallelization. Here's the workflow:

                    ┌─────────────────────────────────────┐
                    │   Input: Multiple Genome FASTAs     │
                    └──────────────┬──────────────────────┘
                                   │
                    ┌──────────────▼──────────────────────┐
                    │      prep_genome (per genome)       │
                    │   • Format validation               │
                    │   • Dictionary creation             │
                    └──────────────┬──────────────────────┘
                                   │
          ┌────────────────────────┼────────────────────────┐
          │                        │                        │
    ┌─────▼──────┐          ┌─────▼──────┐          ┌─────▼──────┐
    │ build_db   │          │ build_db   │          │ build_db   │
    │ (genome1)  │          │ (genome2)  │   ...    │ (genomeN)  │
    └─────┬──────┘          └─────┬──────┘          └─────┬──────┘
          │                        │                        │
    ┌─────▼──────┐          ┌─────▼──────┐          ┌─────▼──────┐
    │RepeatModel │          │RepeatModel │          │RepeatModel │
    │ (genome1)  │          │ (genome2)  │   ...    │ (genomeN)  │
    └─────┬──────┘          └─────┬──────┘          └─────┬──────┘
          │                        │                        │
    ┌─────▼──────┐          ┌─────▼──────┐          ┌─────▼──────┐
    │TEstrainer  │          │TEstrainer  │          │TEstrainer  │
    │ (genome1)  │          │ (genome2)  │   ...    │ (genomeN)  │
    └─────┬──────┘          └─────┬──────┘          └─────┬──────┘
          │                        │                        │
          └────────────────────────┼────────────────────────┘
                                   │
                    ┌──────────────▼──────────────────────┐
                    │     cluster_all_species             │
                    │   • Combine all TE libraries        │
                    │   • Optional: cd-hit clustering     │
                    │   • Add RepeatMasker/custom lib     │
                    └──────────────┬──────────────────────┘
                                   │
          ┌────────────────────────┼────────────────────────┐
          │                        │                        │
    ┌─────▼──────┐          ┌─────▼──────┐          ┌─────▼──────┐
    │RepeatMasker│          │RepeatMasker│          │RepeatMasker│
    │ (genome1)  │          │ (genome2)  │   ...    │ (genomeN)  │
    └─────┬──────┘          └─────┬──────┘          └─────┬──────┘
          │                        │                        │
    ┌─────▼──────┐          ┌─────▼──────┐          ┌─────▼──────┐
    │merge_repea │          │merge_repea │          │merge_repea │
    │ (genome1)  │          │ (genome2)  │   ...    │ (genomeN)  │
    └─────┬──────┘          └─────┬──────┘          └─────┬──────┘
          │                        │                        │
    ┌─────▼──────┐          ┌─────▼──────┐          ┌─────▼──────┐
    │ divergence │          │ divergence │          │ divergence │
    │ charts etc │          │ charts etc │   ...    │ charts etc │
    └────────────┘          └────────────┘          └────────────┘

Optional Steps

  • 🎨 HELIANO detection (run_heliano: true/false) - Helitron-specific detection
  • 🔄 Clustering (skip_clustering: true/false) - Merge similar TEs across genomes
  • 🎭 Initial masking (repeatmasker_species or custom_library) - Pre-mask known repeats
  • 📊 DAG visualization (generate_dag: true/false) - Generate workflow graphs

📈 See detailed workflow visualization: Example Rulegraph

📦 Installation

Via conda/mamba (Recommended) - Party in a Package!

Option 1: Add to Existing EarlGrey Environment

If you already have EarlGrey installed in a conda environment:

# Activate your existing EarlGrey environment
mamba activate earlgrey  # or whatever your environment is named

# Install ParTEA into the same environment
mamba install -c conda-forge -c bioconda earlgrey-partea

# Make sure everything's ready to party
earlGreyParTEA --help

Option 2: Create a New Environment

For a fresh installation with both EarlGrey and ParTEA:

# Create a new environment with both packages
mamba create -n partea -c conda-forge -c bioconda earlgrey-partea

# Activate the environment
mamba activate partea

# Make sure everything's ready to party
earlGreyParTEA --help

✨ Magic Feature: ParTEA automatically detects your EarlGrey installation (any version ≥7.0.3) and adapts on the fly. Update EarlGrey anytime - no config changes needed!

Manual Installation - For the DIY ParTEA Planners

git clone https://github.com/TobyBaril/EarlGreyParTEA.git
cd EarlGreyParTEA
chmod +x earlGreyParTEA*
export PATH="$PWD:$PATH"

⚠️ IMPORTANT: Configure EarlGrey Before Running ParTEA

ParTEA requires EarlGrey to be properly configured with Dfam library partitions. The pipeline will check for this and fail with helpful instructions if not configured.

After installing EarlGrey (via conda or manually), you must download additional Dfam partitions:

# Activate your environment
mamba activate partea  # or your environment name

# Check your RepeatMasker library location
which RepeatMasker

# Download Dfam partitions (this may take a while)
# The pipeline will generate a configuration script for you if this step is missing

What happens if you skip this?

ParTEA will detect the missing configuration and:

  1. ✋ Stop the pipeline before wasting compute time
  2. 📝 Generate a configuration script: configure_dfam39.sh
  3. 📋 Provide clear instructions to fix the issue

To configure manually:

When EarlGrey is first installed, only Dfam partition 0 is included. For comprehensive TE annotation, download partitions 0-16:

# Navigate to your RepeatMasker famdb directory
cd $CONDA_PREFIX/share/RepeatMasker/Libraries/famdb/

# Download all partitions (0-16)
curl -o 'dfam39_full.#1.h5.gz' 'https://dfam.org/releases/current/families/FamDB/dfam39_full.[0-16].h5.gz'

# Decompress
gunzip *.gz

# Reconfigure RepeatMasker
cd $CONDA_PREFIX/share/RepeatMasker/
perl ./configure \
    -libdir $CONDA_PREFIX/share/RepeatMasker/Libraries \
    -trf_prgm $CONDA_PREFIX/bin/trf \
    -rmblast_dir $CONDA_PREFIX/bin \
    -hmmer_dir $CONDA_PREFIX/bin \
    -default_search_engine rmblast

# Mark configuration as complete
touch $CONDA_PREFIX/share/RepeatMasker/Libraries/famdb/.earlgrey.config.complete

Verification:

# The pipeline will automatically check this on startup
# You can also verify manually:
ls -lh $CONDA_PREFIX/share/RepeatMasker/Libraries/famdb/

# Should see multiple dfam39_full.*.h5 files (not just partition 0)
# Should see .earlgrey.config.complete marker file

🚀 Quick Start

Three simple steps to get the parTEA started!

1️⃣ Generate a config file

earlGreyParTEA --generate-config my_config.yaml

2️⃣ Add your genomes (the more, the merrier!)

genome:
  species1: /path/to/genome1.fasta
  species2: /path/to/genome2.fasta
  species3: /path/to/genome3.fasta

species:
  - species1
  - species2
  - species3

output_dir: /path/to/output

3️⃣ Let the parTEA begin!

earlGreyParTEA -c my_config.yaml -t 16

🎭 Pipeline Modes

ParTEA offers three ways to party - choose your adventure!

🎊 Full Pipeline (earlGreyParTEA)

The complete parTEA experience!

Runs the full celebration: library construction → clustering → annotation

earlGreyParTEA -c config.yaml -t 16

What you get:

  • 🧬 Pangenome TE library (clustered across all genomes)
  • 📍 TE annotations for each genome (BED, GFF)
  • 📊 Divergence analysis and landscape plots
  • 📈 Summary charts and statistics
  • 🎨 Workflow visualizations

Perfect for: Complete comparative TE analysis across multiple species


🏗️ Library Construction Only (earlGreyParTEA_LibConstruct)

Build the guest list!

Creates a pangenome TE library without annotation.

earlGreyParTEA_LibConstruct -c config.yaml -t 16

What you get:

  • 📚 {output_dir}/combinedLibraries/combined_all_species.clstrd.fa

Perfect for: Building a curated TE library to annotate other genomes later


🎯 Annotation Only (earlGreyParTEA_AnnotationOnly)

Use an existing playlist!

Annotates genomes using a pre-made TE library (bring your own TEs).

earlGreyParTEA_AnnotationOnly -c config.yaml -t 16

Requirements:

  • Must specify annotation_library in config.yaml
  • Library should be in fasta format

Use case: Annotate multiple genomes with a curated TE library from a previous run or external source.

Command-Line Options

Option Short Description
--config FILE -c Config file (required)
--threads INT -t Number of threads (required)
--memory INT -m Max memory in MB (optional)
--dry-run -n Show what would run without executing
--generate-config FILE - Generate example config template
--unlock - Unlock directory after crash
--rerun-incomplete - Rerun incomplete jobs
--help -h Show help message

Configuration Parameters

Required Parameters

genome:                    # Dictionary of genome paths
  species1: /path/to/genome1.fasta
  
species: [species1]       # List of species to analyze

output_dir: /path/to/out  # Output directory

Note: The EarlGrey script_dir parameter is automatically detected and does not need to be specified in your config file. ParTEA will find the correct EarlGrey installation regardless of version (7.x, 8.x, etc.). Only set script_dir manually if you have a custom installation location.

Library Construction Parameters

iterations: 10            # BLAST-extend-align cycles
flank: 1000              # Flanking basepairs to extract
max_consensus_seqs: 20   # Max sequences for consensus
min_consensus_seqs: 3    # Min sequences for consensus

Initial Masking (Optional)

Choose ONE or leave both empty:

repeatmasker_species: "fungi"        # Use RepeatMasker database
# OR
custom_library: "/path/to/lib.fa"   # Use custom library

Clustering Options

skip_clustering: false     # Set true to skip clustering
clustering_identity: 0.8   # cd-hit identity threshold (0.0-1.0)
clustering_coverage: 0.8   # cd-hit coverage threshold (0.0-1.0)

Output Options

softmask: false           # Generate softmasked genomes
margin: false             # Remove short TEs (<100bp)
run_heliano: true         # Run HELIANO for Helitron detection

Visualization Options

generate_dag: true        # Generate workflow DAG graphs
dag_format: "svg"         # Format: svg, png, or pdf

Output Structure

output_dir/
├── combinedLibraries/
│   ├── combined_all_species.clstrd.fa      # Pangenome TE library
│   └── combined_all_species.nonclstrd.fa   # Unclustered library
│
├── species1_EarlGrey/
│   ├── species1_Database/              # RepeatModeler database
│   ├── species1_RepeatModeler/         # RepeatModeler working files
│   ├── species1_strainer/              # TEstrainer output
│   ├── species1_RepeatMasker_Against_Custom_Library/
│   ├── species1_mergedRepeats/         # Merged annotations
│   └── species1_summaryFiles/          # Final outputs
│       ├── species1.filteredRepeats.bed
│       ├── species1.filteredRepeats.gff
│       ├── species1.highLevelCount.txt
│       ├── species1.summaryPie.pdf
│       ├── species1_divergence_summary_table.tsv
│       └── species1.softmasked.fasta (if enabled)
│
├── species2_EarlGrey/
│   └── ...
│
├── workflow_visualization/
│   ├── dag_full_mode.svg               # Workflow DAG visualization
│   └── dag_full_mode_rulegraph.svg     # Simplified rule graph
│
└── validated_config.yaml               # Config used for run

Example Workflows

Example 1: Full Analysis of Multiple Genomes

# Generate config
earlGreyParTEA --generate-config analysis.yaml

# Edit config with genome paths
# Then run
earlGreyParTEA -c analysis.yaml -t 32 -m 128000

Example 2: Build Pangenome Library

# Generate config for library construction
earlGreyParTEA_LibConstruct --generate-config build_lib.yaml

# Edit config, then build library
earlGreyParTEA_LibConstruct -c build_lib.yaml -t 16

# Output: build_lib_output/combinedLibraries/combined_all_species.clstrd.fa

Example 3: Annotate with Pre-existing Library

# Generate config for annotation
earlGreyParTEA_AnnotationOnly --generate-config annotate.yaml

# Edit config and set annotation_library parameter
# annotation_library: "/path/to/combined_all_species.clstrd.fa"

# Run annotation
earlGreyParTEA_AnnotationOnly -c annotate.yaml -t 16

Example 4: Dry Run to Check Pipeline

earlGreyParTEA -c config.yaml -t 16 --dry-run

⚡ Dynamic Resource Allocation

ParTEA knows how to share! The pipeline automatically distributes computing power across your genomes:

Cores Genomes Threads/Genome Parallel Jobs Party Size
8 2 4 2 genomes Intimate 🥂
16 4 4 4 genomes Cozy 🎵
32 2 16 2 genomes Focused 🎯
64 8 8 8 genomes Epic 🎆

Smart scaling means:

  • 🎊 Multiple genomes party together when you have the cores
  • 🎪 Fair sharing - everyone gets their turn on the dance floor
  • 🎨 Optimal thread allocation prevents bottlenecks

Requirements

To join the parTEA, you'll need:

  • EarlGrey ≥7.0.3 (with all dependencies) + configured with Dfam partitions
  • Snakemake ≥7.0,<8.0
  • Python ≥3.9,<3.11
  • cd-hit (for clustering)
  • Graphviz (optional, for DAG visualization)

⚠️ Critical: EarlGrey must be configured with Dfam library partitions before running ParTEA. See the Installation section for configuration instructions. The pipeline will check this automatically and guide you if configuration is missing.

Troubleshooting

Error: "EarlGrey RepeatMasker libraries not configured!"

This is the most common issue - ParTEA detected that EarlGrey is missing required Dfam partitions.

Solution:

The pipeline automatically generates a configuration script for you. Simply run:

chmod +x configure_dfam39.sh
./configure_dfam39.sh

Or follow the manual configuration steps in the Installation section above.

Why does this happen?

  • Fresh EarlGrey installations only include Dfam partition 0 (minimal database)
  • Full TE annotation requires partitions 0-16 for comprehensive coverage
  • The download is ~10GB and takes time, so it's not included by default

After configuring:

Re-run your ParTEA command and it will proceed normally:

earlGreyParTEA -c config.yaml -t 16

Error: "Config file required"

Make sure you specify the config file:

earlGreyParTEA -c config.yaml -t 16

Error: "Both RepeatMasker species and custom library specified"

Choose only ONE initial masking method in your config:

# Either
repeatmasker_species: "fungi"
custom_library: ""

# Or
repeatmasker_species: ""
custom_library: "/path/to/library.fa"

Error: "Pipeline mode 'annotate' requires 'annotation_library'"

For annotation-only mode, you must specify a TE library:

pipeline_mode: "annotate"
annotation_library: "/path/to/TE_library.fasta"

Pipeline stops early or has incomplete output

Try rerunning incomplete jobs:

earlGreyParTEA -c config.yaml -t 16 --rerun-incomplete

Snakemake directory locked after crash

Unlock the directory:

earlGreyParTEA -c config.yaml -t 16 --unlock

Error: "Script directory not found" or "TEstrainer module not found"

This means ParTEA couldn't auto-detect your EarlGrey installation. This usually happens with custom installations. Check your EarlGrey is installed:

# Check if EarlGrey is available
which earlGrey

# Check conda environment
conda list | grep earlgrey

If installed, manually specify the script directory in your config:

script_dir: "/path/to/earlgrey/scripts"

For conda installations, this is typically:

script_dir: "$CONDA_PREFIX/share/earlgrey-7.0.3-0/scripts"  # Adjust version

DAG visualization not generated

Install graphviz:

mamba install graphviz

Or disable DAG generation in config:

generate_dag: false

Additional Documentation

For technical details, packaging information, and implementation documentation, see the docs/ directory:

Found ParTEA useful? Give us a ⭐ on GitHub!

License

This project is distributed under the same license as EarlGrey. See the LICENSE file for details.

About

2+ genomes make a parTEA! This pangenome TE annotation pipeline extends EarlGrey for multi-genome comparative analysis

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors