Because analyzing transposable elements across genomes should be a parTEA, not a chore.
ParTEA (Pangenome Transposable Element Analysis) is a Snakemake-based pipeline that brings the party to multi-genome TE annotation! It extends EarlGrey to process multiple genomes in parallel, build pangenome TE libraries, and perform comparative transposable element analysis across species.
Why ParTEA?
- 🎊 Parallel Processing: Analyze multiple genomes simultaneously
- 🧬 Pangenome Libraries: Build consensus TE libraries across species
- 🔄 Optional Clustering: Merge TE libraries using cd-hit for consistent and traceable naming in annotations across all genomes
- 📊 Rich Outputs: Get annotations, divergence metrics, and visualizations
- ⚡ Dynamic Threading: Automatically optimizes resource allocation
- 🔍 Version-Agnostic: Works seamlessly with any EarlGrey version
If you use ParTEA in your research, please cite:
Baril, T., Galbraith, J. and Hayward, A., 2024. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Molecular Biology and Evolution, 41(4), p.msae068.
ParTEA manuscript in preparation.
- What is ParTEA?
- Citation
- Pipeline Overview
- Installation
- Quick Start
- Pipeline Modes
- Command-Line Options
- Configuration Parameters
- Output Structure
- Example Workflows
- Dynamic Resource Allocation
- Requirements
- Troubleshooting
- Additional Documentation
- Support & Contributing
- License
ParTEA orchestrates TE analysis across multiple genomes with smart parallelization. Here's the workflow:
┌─────────────────────────────────────┐
│ Input: Multiple Genome FASTAs │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ prep_genome (per genome) │
│ • Format validation │
│ • Dictionary creation │
└──────────────┬──────────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│ build_db │ │ build_db │ │ build_db │
│ (genome1) │ │ (genome2) │ ... │ (genomeN) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│RepeatModel │ │RepeatModel │ │RepeatModel │
│ (genome1) │ │ (genome2) │ ... │ (genomeN) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│TEstrainer │ │TEstrainer │ │TEstrainer │
│ (genome1) │ │ (genome2) │ ... │ (genomeN) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
└────────────────────────┼────────────────────────┘
│
┌──────────────▼──────────────────────┐
│ cluster_all_species │
│ • Combine all TE libraries │
│ • Optional: cd-hit clustering │
│ • Add RepeatMasker/custom lib │
└──────────────┬──────────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│RepeatMasker│ │RepeatMasker│ │RepeatMasker│
│ (genome1) │ │ (genome2) │ ... │ (genomeN) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│merge_repea │ │merge_repea │ │merge_repea │
│ (genome1) │ │ (genome2) │ ... │ (genomeN) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│ divergence │ │ divergence │ │ divergence │
│ charts etc │ │ charts etc │ ... │ charts etc │
└────────────┘ └────────────┘ └────────────┘
- 🎨 HELIANO detection (
run_heliano: true/false) - Helitron-specific detection - 🔄 Clustering (
skip_clustering: true/false) - Merge similar TEs across genomes - 🎭 Initial masking (
repeatmasker_speciesorcustom_library) - Pre-mask known repeats - 📊 DAG visualization (
generate_dag: true/false) - Generate workflow graphs
📈 See detailed workflow visualization: Example Rulegraph
If you already have EarlGrey installed in a conda environment:
# Activate your existing EarlGrey environment
mamba activate earlgrey # or whatever your environment is named
# Install ParTEA into the same environment
mamba install -c conda-forge -c bioconda earlgrey-partea
# Make sure everything's ready to party
earlGreyParTEA --helpFor a fresh installation with both EarlGrey and ParTEA:
# Create a new environment with both packages
mamba create -n partea -c conda-forge -c bioconda earlgrey-partea
# Activate the environment
mamba activate partea
# Make sure everything's ready to party
earlGreyParTEA --help✨ Magic Feature: ParTEA automatically detects your EarlGrey installation (any version ≥7.0.3) and adapts on the fly. Update EarlGrey anytime - no config changes needed!
git clone https://github.com/TobyBaril/EarlGreyParTEA.git
cd EarlGreyParTEA
chmod +x earlGreyParTEA*
export PATH="$PWD:$PATH"ParTEA requires EarlGrey to be properly configured with Dfam library partitions. The pipeline will check for this and fail with helpful instructions if not configured.
After installing EarlGrey (via conda or manually), you must download additional Dfam partitions:
# Activate your environment
mamba activate partea # or your environment name
# Check your RepeatMasker library location
which RepeatMasker
# Download Dfam partitions (this may take a while)
# The pipeline will generate a configuration script for you if this step is missingWhat happens if you skip this?
ParTEA will detect the missing configuration and:
- ✋ Stop the pipeline before wasting compute time
- 📝 Generate a configuration script:
configure_dfam39.sh - 📋 Provide clear instructions to fix the issue
To configure manually:
When EarlGrey is first installed, only Dfam partition 0 is included. For comprehensive TE annotation, download partitions 0-16:
# Navigate to your RepeatMasker famdb directory
cd $CONDA_PREFIX/share/RepeatMasker/Libraries/famdb/
# Download all partitions (0-16)
curl -o 'dfam39_full.#1.h5.gz' 'https://dfam.org/releases/current/families/FamDB/dfam39_full.[0-16].h5.gz'
# Decompress
gunzip *.gz
# Reconfigure RepeatMasker
cd $CONDA_PREFIX/share/RepeatMasker/
perl ./configure \
-libdir $CONDA_PREFIX/share/RepeatMasker/Libraries \
-trf_prgm $CONDA_PREFIX/bin/trf \
-rmblast_dir $CONDA_PREFIX/bin \
-hmmer_dir $CONDA_PREFIX/bin \
-default_search_engine rmblast
# Mark configuration as complete
touch $CONDA_PREFIX/share/RepeatMasker/Libraries/famdb/.earlgrey.config.completeVerification:
# The pipeline will automatically check this on startup
# You can also verify manually:
ls -lh $CONDA_PREFIX/share/RepeatMasker/Libraries/famdb/
# Should see multiple dfam39_full.*.h5 files (not just partition 0)
# Should see .earlgrey.config.complete marker file1️⃣ Generate a config file
earlGreyParTEA --generate-config my_config.yaml2️⃣ Add your genomes (the more, the merrier!)
genome:
species1: /path/to/genome1.fasta
species2: /path/to/genome2.fasta
species3: /path/to/genome3.fasta
species:
- species1
- species2
- species3
output_dir: /path/to/output3️⃣ Let the parTEA begin!
earlGreyParTEA -c my_config.yaml -t 16ParTEA offers three ways to party - choose your adventure!
The complete parTEA experience!
Runs the full celebration: library construction → clustering → annotation
earlGreyParTEA -c config.yaml -t 16What you get:
- 🧬 Pangenome TE library (clustered across all genomes)
- 📍 TE annotations for each genome (BED, GFF)
- 📊 Divergence analysis and landscape plots
- 📈 Summary charts and statistics
- 🎨 Workflow visualizations
Perfect for: Complete comparative TE analysis across multiple species
Build the guest list!
Creates a pangenome TE library without annotation.
earlGreyParTEA_LibConstruct -c config.yaml -t 16What you get:
- 📚
{output_dir}/combinedLibraries/combined_all_species.clstrd.fa
Perfect for: Building a curated TE library to annotate other genomes later
Use an existing playlist!
Annotates genomes using a pre-made TE library (bring your own TEs).
earlGreyParTEA_AnnotationOnly -c config.yaml -t 16Requirements:
- Must specify
annotation_libraryin config.yaml - Library should be in fasta format
Use case: Annotate multiple genomes with a curated TE library from a previous run or external source.
| Option | Short | Description |
|---|---|---|
--config FILE |
-c |
Config file (required) |
--threads INT |
-t |
Number of threads (required) |
--memory INT |
-m |
Max memory in MB (optional) |
--dry-run |
-n |
Show what would run without executing |
--generate-config FILE |
- | Generate example config template |
--unlock |
- | Unlock directory after crash |
--rerun-incomplete |
- | Rerun incomplete jobs |
--help |
-h |
Show help message |
genome: # Dictionary of genome paths
species1: /path/to/genome1.fasta
species: [species1] # List of species to analyze
output_dir: /path/to/out # Output directoryNote: The EarlGrey script_dir parameter is automatically detected and does not need to be specified in your config file. ParTEA will find the correct EarlGrey installation regardless of version (7.x, 8.x, etc.). Only set script_dir manually if you have a custom installation location.
iterations: 10 # BLAST-extend-align cycles
flank: 1000 # Flanking basepairs to extract
max_consensus_seqs: 20 # Max sequences for consensus
min_consensus_seqs: 3 # Min sequences for consensusChoose ONE or leave both empty:
repeatmasker_species: "fungi" # Use RepeatMasker database
# OR
custom_library: "/path/to/lib.fa" # Use custom libraryskip_clustering: false # Set true to skip clustering
clustering_identity: 0.8 # cd-hit identity threshold (0.0-1.0)
clustering_coverage: 0.8 # cd-hit coverage threshold (0.0-1.0)softmask: false # Generate softmasked genomes
margin: false # Remove short TEs (<100bp)
run_heliano: true # Run HELIANO for Helitron detectiongenerate_dag: true # Generate workflow DAG graphs
dag_format: "svg" # Format: svg, png, or pdfoutput_dir/
├── combinedLibraries/
│ ├── combined_all_species.clstrd.fa # Pangenome TE library
│ └── combined_all_species.nonclstrd.fa # Unclustered library
│
├── species1_EarlGrey/
│ ├── species1_Database/ # RepeatModeler database
│ ├── species1_RepeatModeler/ # RepeatModeler working files
│ ├── species1_strainer/ # TEstrainer output
│ ├── species1_RepeatMasker_Against_Custom_Library/
│ ├── species1_mergedRepeats/ # Merged annotations
│ └── species1_summaryFiles/ # Final outputs
│ ├── species1.filteredRepeats.bed
│ ├── species1.filteredRepeats.gff
│ ├── species1.highLevelCount.txt
│ ├── species1.summaryPie.pdf
│ ├── species1_divergence_summary_table.tsv
│ └── species1.softmasked.fasta (if enabled)
│
├── species2_EarlGrey/
│ └── ...
│
├── workflow_visualization/
│ ├── dag_full_mode.svg # Workflow DAG visualization
│ └── dag_full_mode_rulegraph.svg # Simplified rule graph
│
└── validated_config.yaml # Config used for run
# Generate config
earlGreyParTEA --generate-config analysis.yaml
# Edit config with genome paths
# Then run
earlGreyParTEA -c analysis.yaml -t 32 -m 128000# Generate config for library construction
earlGreyParTEA_LibConstruct --generate-config build_lib.yaml
# Edit config, then build library
earlGreyParTEA_LibConstruct -c build_lib.yaml -t 16
# Output: build_lib_output/combinedLibraries/combined_all_species.clstrd.fa# Generate config for annotation
earlGreyParTEA_AnnotationOnly --generate-config annotate.yaml
# Edit config and set annotation_library parameter
# annotation_library: "/path/to/combined_all_species.clstrd.fa"
# Run annotation
earlGreyParTEA_AnnotationOnly -c annotate.yaml -t 16earlGreyParTEA -c config.yaml -t 16 --dry-runParTEA knows how to share! The pipeline automatically distributes computing power across your genomes:
| Cores | Genomes | Threads/Genome | Parallel Jobs | Party Size |
|---|---|---|---|---|
| 8 | 2 | 4 | 2 genomes | Intimate 🥂 |
| 16 | 4 | 4 | 4 genomes | Cozy 🎵 |
| 32 | 2 | 16 | 2 genomes | Focused 🎯 |
| 64 | 8 | 8 | 8 genomes | Epic 🎆 |
Smart scaling means:
- 🎊 Multiple genomes party together when you have the cores
- 🎪 Fair sharing - everyone gets their turn on the dance floor
- 🎨 Optimal thread allocation prevents bottlenecks
To join the parTEA, you'll need:
- EarlGrey ≥7.0.3 (with all dependencies) + configured with Dfam partitions
- Snakemake ≥7.0,<8.0
- Python ≥3.9,<3.11
- cd-hit (for clustering)
- Graphviz (optional, for DAG visualization)
This is the most common issue - ParTEA detected that EarlGrey is missing required Dfam partitions.
Solution:
The pipeline automatically generates a configuration script for you. Simply run:
chmod +x configure_dfam39.sh
./configure_dfam39.shOr follow the manual configuration steps in the Installation section above.
Why does this happen?
- Fresh EarlGrey installations only include Dfam partition 0 (minimal database)
- Full TE annotation requires partitions 0-16 for comprehensive coverage
- The download is ~10GB and takes time, so it's not included by default
After configuring:
Re-run your ParTEA command and it will proceed normally:
earlGreyParTEA -c config.yaml -t 16Make sure you specify the config file:
earlGreyParTEA -c config.yaml -t 16Choose only ONE initial masking method in your config:
# Either
repeatmasker_species: "fungi"
custom_library: ""
# Or
repeatmasker_species: ""
custom_library: "/path/to/library.fa"For annotation-only mode, you must specify a TE library:
pipeline_mode: "annotate"
annotation_library: "/path/to/TE_library.fasta"Try rerunning incomplete jobs:
earlGreyParTEA -c config.yaml -t 16 --rerun-incompleteUnlock the directory:
earlGreyParTEA -c config.yaml -t 16 --unlockThis means ParTEA couldn't auto-detect your EarlGrey installation. This usually happens with custom installations. Check your EarlGrey is installed:
# Check if EarlGrey is available
which earlGrey
# Check conda environment
conda list | grep earlgreyIf installed, manually specify the script directory in your config:
script_dir: "/path/to/earlgrey/scripts"For conda installations, this is typically:
script_dir: "$CONDA_PREFIX/share/earlgrey-7.0.3-0/scripts" # Adjust versionInstall graphviz:
mamba install graphvizOr disable DAG generation in config:
generate_dag: falseFor technical details, packaging information, and implementation documentation, see the docs/ directory:
-
Packaging Guide - Conda/mamba package creation and setup
-
Version Compatibility - How version-agnostic detection works
-
Auto-Detection - Implementation details for script_dir auto-detection
-
DAG Visualization - Understanding workflow visualizations Support & Contributing
-
🐛 Bug Reports: Open an issue
-
💡 Feature Requests: Suggest a feature
-
📧 Email: tobias.baril[at]unine.ch
-
📚 Documentation: Technical docs
Found ParTEA useful? Give us a ⭐ on GitHub!
- Documentation: https://github.com/TobyBaril/EarlGreyParTEA
This project is distributed under the same license as EarlGrey. See the LICENSE file for details.
