Home

Sylvan Tutorial: Running with Toy Data

This tutorial walks you through running Sylvan on the included toy dataset (A. thaliana chromosome 4). The commands and parameter choices shown below are specific to this experiment; when annotating your own genome, adjust the configuration files to match your organism, data, and cluster environment. For a description of the pipeline's architecture and available modules, see the README.

Prerequisites
Step 1: Environment Setup
Step 2: Download Containers
Step 3: Prepare Repeat Library (EDTA)
Step 4: Run Annotation Pipeline
Step 5: Run Filter Pipeline
Step 5b: Alternative Score-based Filter
Step 6: Format Output (TidyGFF)
Step 7: Cleanup Intermediate Files
Advanced Configuration
Monitoring and Debugging
Toy Data Details
Runtime Benchmark

Prerequisites

Linux system with SLURM
Singularity 3.x+
Conda/Mamba
Git LFS

Host Dependencies

Most bioinformatics tools run inside the Singularity container. The host environment needs:

Package	Purpose
Python 3.10+	Pipeline orchestration and filter scripts
Snakemake 7	Workflow engine
pandas	Data manipulation in Snakemake and filter scripts
scikit-learn	Random forest classifier (`Filter.py`, `score_filter.py`)
NumPy	Numerical operations
PyYAML	Config file parsing
rich	Logging format (optional)

Perl (fillingEndsOfGeneModels.pl) and R (filter_distributions.R) scripts run inside the container and do not need host installation.

Step 1: Environment Setup

# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan

# Install Git LFS
git lfs install

Step 2: Download Containers

Sylvan Container

singularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latest

EDTA Container (for repeat library)

export SINGULARITY_CACHEDIR=$PWD
singularity pull EDTA.sif docker://quay.io/biocontainers/edta:2.2.0--hdfd78af_1

Clone Repository

git clone https://github.com/plantgenomicslab/Sylvan.git
cd Sylvan

# Verify LFS files are downloaded (not pointers)
git lfs pull
ls -la toydata/  # Files should be > 200 bytes

Step 3: Prepare Repeat Library (EDTA)

Note: The toy data includes a pre-computed repeat library. This step is for reference or if you need to regenerate it.

sbatch -c 16 --mem=68g --wrap="singularity exec --cleanenv --env PYTHONNOUSERSITE=1 \
  EDTA.sif EDTA.pl \
  --genome toydata/genome_input/genome.fasta \
  --cds toydata/cds_aa/neighbor.cds \
  --anno 1 --threads 16 --force 1"

Expected runtime: ~1.5 hours for 18.6 Mb genome

Output: genome.fasta.mod.EDTA.TElib.fa

EDTA Benchmark (Toy Data)

Stage	Duration
LTR detection	~3 min
SINE detection	~6 min
LINE detection	~45 min
TIR detection	~8 min
Helitron detection	~10 min
Filtering & annotation	~8 min
Total	~1.5 hours

Experiment Setup (Toy Data)

The toy data experiment uses the following configuration choices. These are pre-set in toydata/config/config_annotate.yml and toydata/config/config_filter.yml:

Parameter	Choice for this experiment	Alternatives
Genome	A. thaliana Chr4, 18.6 Mb, 3 segments	Any FASTA assembly
Repeat masking	RepeatMasker with `Embryophyta` library + custom EDTA library	Any RepeatMasker species; RepeatModeler for de novo
RNA-seq	12 paired-end samples (leaf, rosette, whole plant)	Any number of paired-end samples
RNA-seq alignment	STAR pathway with StringTie + PsiCLASS	HiSat2 is also available
Protein homology	Combined neighbor-species proteins (`neighbor.aa`)	UniProt, OrthoDB, or any protein FASTA
Neighbor species	3 species: A. lyrata, C. rubella, C. hirsuta	One or more annotated relatives
Ab initio (Helixer)	`land_plant` model, subsequence length 64152	`vertebrate`, `invertebrate`, `fungi`
Ab initio (Augustus)	Start from `arabidopsis` model, train on target data	Any existing Augustus species; or skip training
EVM weights	PASA=10, Augustus=7, GETA=5, Helixer=3	Adjust to evidence quality
Filter cutoffs	TPM=3, coverage=0.5, BLAST identity=0.6	Adjust per organism
BUSCO lineage	`eudicots_odb10`	Any BUSCO lineage

These choices are experiment-specific. The pipeline supports all the alternatives listed above — see the README for the full set of configurable modules.

Step 4: Run Annotation Pipeline

Note: The commands below use the pre-configured toy data settings. For your own data, replace toydata/config/config_annotate.yml with your config file path.

Environment Variables

The pipeline uses several environment variables for configuration:

Variable	Description	Example
`SYLVAN_CONFIG`	Path to pipeline config file (also used as `--cluster-config` for SLURM settings)	`toydata/config/config_annotate.yml`
`SYLVAN_FILTER_CONFIG`	Path to filter config file	`toydata/config/config_filter.yml`
`SYLVAN_RESULTS_DIR`	Override results output directory (default: `$(pwd)/results/`). Useful on HPC systems with separate storage.	`/scratch/$USER/sylvan_results`
`TMPDIR`	Temporary directory for intermediate files	`$(pwd)/results/TMP`
`SLURM_TMPDIR`	SLURM job temporary directory (should match `TMPDIR`)	`$TMPDIR`

Why set TMPDIR?

Setting TMPDIR explicitly is critical on many HPC systems:

Memory-backed /tmp (tmpfs): Some HPC nodes have no local disk storage. The default /tmp is mounted as tmpfs, which stores files in RAM. Large temporary files from tools like STAR, RepeatMasker, or Augustus can quickly exhaust memory and crash your jobs with cryptic "out of memory", "no space left on device", or even segmentation fault errors—since the OS may kill processes or corrupt memory when tmpfs fills up.
Quota limits: Shared /tmp partitions often have strict per-user quotas (e.g., 1-10 GB). Genome annotation tools easily exceed this.
Job isolation: When TMPDIR points to your project directory, temp files persist after job completion for debugging. Cleanup is also straightforward with rm -rf results/TMP/*.
Singularity compatibility: Containers inherit TMPDIR from the host. Setting it to a bound path ensures temp files are written to accessible storage.

Tip: Check if your cluster uses tmpfs with df -h /tmp. If it shows tmpfs as the filesystem type, you must set TMPDIR to avoid memory issues.

Complete Environment Setup

Copy this block before running the pipeline:

# Required: Pipeline configuration
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"

# Required: Temp directory (create if not exists)
mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"

# Optional: Bind additional paths for Singularity
# export SINGULARITY_BIND="/scratch,/data"

# Optional: Increase open file limit (some tools need this)
ulimit -n 65535 2>/dev/null || true

Additional Singularity Variables

Variable	Description	When to Use
`SINGULARITY_BIND`	Bind additional host paths into container	When input files are outside working directory
`SINGULARITY_CACHEDIR`	Location for Singularity cache	When home directory has quota limits
`SINGULARITY_TMPDIR`	Singularity's internal temp directory	Should match `TMPDIR`

When do you need SINGULARITY_BIND?

Singularity automatically binds your current working directory, $HOME, and /tmp. However, you need explicit binding when:

Input files (genome, RNA-seq, proteins) are on a separate filesystem (e.g., /scratch, /project, /data)
Your home directory has quota limits and you store data elsewhere
Using shared lab storage mounted at non-standard paths
Config file references absolute paths outside the working directory

# Common scenarios:

# 1. Data on scratch space
export SINGULARITY_BIND="/scratch/$USER"

# 2. Multiple paths (comma-separated)
export SINGULARITY_BIND="/scratch,/project/shared_data,/data/genomes"

# 3. Bind with different container path (host:container)
export SINGULARITY_BIND="/long/path/on/host:/data"

# 4. Read-only binding (for shared reference data)
export SINGULARITY_BIND="/shared/databases:/databases:ro"

Diagnosing bind issues:

# Error: "file not found" inside container but exists on host
# → The path isn't bound. Add it to SINGULARITY_BIND

# Test if path is accessible inside container:
singularity exec sylvan.sif ls /your/data/path

# See what's currently bound:
singularity exec sylvan.sif cat /proc/mounts | grep -E "scratch|project|data"

Verifying Your Environment

Before submitting jobs, verify your setup:

# Check TMPDIR is on real disk (not tmpfs)
df -h $TMPDIR

# Verify Singularity can access paths
singularity exec sylvan.sif ls $TMPDIR

# Test config file is readable
cat $SYLVAN_CONFIG | head -5

Debugging Environment Issues

When jobs fail unexpectedly, environment variables are often the culprit. Use these techniques to diagnose:

1. Print all relevant variables:

# Add to your script or run interactively
echo "=== Environment Check ==="
echo "SYLVAN_CONFIG: $SYLVAN_CONFIG"
echo "TMPDIR: $TMPDIR"
echo "SLURM_TMPDIR: $SLURM_TMPDIR"
echo "SINGULARITY_BIND: $SINGULARITY_BIND"
echo "PWD: $PWD"
df -h $TMPDIR

2. Check what SLURM jobs actually see:

# Submit a diagnostic job
sbatch --wrap='env | grep -E "TMPDIR|SINGULARITY|SYLVAN" && df -h /tmp $TMPDIR'

3. Common environment-related errors:

Error Message	Likely Cause	Solution
`No space left on device`	TMPDIR on tmpfs or quota exceeded	Set `TMPDIR` to project storage
`Segmentation fault`	Memory exhausted (tmpfs full)	Set `TMPDIR` to disk-backed storage
`file not found` (in container)	Path not bound in Singularity	Add path to `SINGULARITY_BIND`
`Permission denied`	Singularity can't write to TMPDIR	Check directory permissions, ensure path is bound
`cannot create temp file`	TMPDIR doesn't exist or not writable	Run `mkdir -p $TMPDIR && touch $TMPDIR/test`

4. Interactive debugging inside container:

# Start interactive shell in container with same bindings
singularity shell --cleanenv sylvan.sif

# Inside container, verify paths exist
ls -la $TMPDIR
ls -la /path/to/your/data

5. Check if variables survive into SLURM jobs:

SLURM doesn't always pass environment variables. Ensure your submit script exports them:

# In your sbatch script or wrapper
#SBATCH --export=ALL    # Pass all environment variables

# Or explicitly export in the script:
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"

Dry Run (Recommended)

Always do a dry run first to verify configuration:

export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate

Submit to SLURM

sbatch -A [account] -p [partition] -c 1 --mem=1g \
  -J annotate -o annotate.out -e annotate.err \
  --wrap="./bin/annotate_toydata.sh"

Output locations

All intermediate/final results are written under the repo root results/.
RepeatMasker/RepeatModeler run inside results/GETA/RepeatMasker/..., so .RepeatMaskerCache and RM_* temp folders also stay there.
EVM commands and outputs live in results/EVM/; no EVM -> results/EVM symlink is needed.
For the filter pipeline, set RexDB to a RepeatExplorer protein DB (e.g., Viridiplantae_v4.0.fasta from https://github.com/repeatexplorer/rexdb). You can download directly via:
wget -O toydata/misc/Viridiplantae_v4.0.fasta https://raw.githubusercontent.com/repeatexplorer/rexdb/refs/heads/main/Viridiplantae_v4.0.fasta

Expected Runtime (Toy Data)

Stage	Time
RepeatMasking	15-30 min
RNA-seq alignment	30-60 min
Transcript assembly	20-40 min
Homology search	30-60 min
Augustus training	1-2 hours
Gene model combination	30-60 min
EvidenceModeler	30-60 min
Total	4-8 hours

Force Rerun

# Rerun all jobs
./bin/annotate_toydata.sh --forceall

# Rerun specific rule
./bin/annotate_toydata.sh --forcerun helixer

# Rerun incomplete jobs
./bin/rerun-incomplete_toydata.sh

Step 5: Run Filter Pipeline

Note: The filter cutoff thresholds and parameters below are configured for the toy data experiment. Adjust config_filter.yml for your organism.

After annotation completes:

# Set filter config
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_filter

# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=4g \
  -J filter -o filter.out -e filter.err \
  --wrap="./bin/filter_toydata.sh"

Filter Pipeline Steps

The filter phase runs these steps in order:

Extract sequences: CDS and peptide extraction from annotated GFF (gff3_file_to_proteins.pl)
PfamScan: Identify conserved protein domains using Pfam-A HMMs
RSEM: Quantify transcript expression (TPM) from re-aligned RNA-seq reads
BLASTp: Homolog similarity (protein DB) and repeat similarity (RexDB) — parallelized across 20 split peptide files
Ab initio coverage: Calculate overlap with Augustus, Helixer, and RepeatMasker predictions
lncDC: XGBoost-based classification of long non-coding RNAs using plant-specific pre-trained models
BUSCO: Identify conserved gene models (monitoring only, not used as a filter feature)
Semi-supervised random forest: Iterative training with configurable thresholds

Filter Cutoff Thresholds

Thresholds in config_filter.yml under Cutoff control the initial heuristic gene selection:

Parameter	Description	Default
`tpm`	TPM threshold	3
`rsem_cov`	RNA-seq coverage	0.5
`blast_pident` / `blast_qcovs`	BLASTp identity/coverage	0.6 / 0.6
`rex_pident` / `rex_qcovs`	RexDB identity/coverage	0.6 / 0.6
`helixer_cov` / `augustus_cov`	Ab initio overlap	0.8 / 0.8
`repeat_cov`	Repeat overlap	0.5

Filter Random Forest Parameters

Set via Filter.py arguments (configured in Snakefile_filter):

Parameter	Description	Default
`--trees`	Number of trees in the random forest	100
`--predictors`	Number of predictors for tree splitting	5
`--recycle`	Predicted accuracy required to add observation to next iteration	0.95
`--max-iter`	Maximum re-training iterations	10
`--seed`	Random seed for reproducibility	123

The filter typically converges within 3-5 iterations. Each iteration adds high-confidence predictions back into the training set and retrains. The process stops when no new observations exceed the --recycle threshold or --max-iter is reached.

Important: `chrom_regex`

The chrom_regex field in config_filter.yml is required for proper chromosome identification. It must match your genome's chromosome naming convention:

# Common patterns:
chrom_regex: (^Chr)|(^chr)|(^LG)|(^Ch)|(^\d)

Output Files

File	Description
`results/FILTER/filter.gff3`	Kept gene models
`results/FILTER/discard.gff3`	Discarded gene models
`results/FILTER/data.tsv`	Feature matrix used by random forest
`results/FILTER/keep_data.tsv`	Evidence data for kept genes
`results/FILTER/discard_data.tsv`	Evidence data for discarded genes
`results/FILTER/{prefix}.cdna`	Transcript sequences
`results/FILTER/{prefix}.pep`	Peptide sequences
`results/FILTER/lncrna_predict.csv`	lncDC predictions

Step 5b: Alternative Score-based Filter

An alternative scoring pipeline (Snakefile_filter_score) is available. Instead of the iterative semi-supervised approach, it uses logistic regression and random forest scoring with pseudo-labels derived from evidence data:

Positive pseudo-labels: genes with Pfam hits or homolog BLAST hits
Negative pseudo-labels: genes with RexDB (repeat) hits
Threshold selection: maximizes F1 score on pseudo-labels via precision-recall curve

export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
./bin/filter_score_toydata.sh

Output:

results/FILTER/scores.csv — Per-gene scores, features, and predictions
results/FILTER/scores.metrics.txt — AUC, precision-recall, F1, and chosen thresholds

This approach is simpler and faster but may be less accurate than the semi-supervised method for organisms with limited prior annotation data.

Step 6: Format Output (TidyGFF)

singularity exec sylvan.sif python bin/TidyGFF.py \
  Ath4 results/FILTER/filter.gff3 \
  --out Ath4_v1.0 \
  --splice-name t \
  --justify 5 \
  --sort \
  --chrom-regex "^Chr" \
  --source Sylvan

Step 7: Cleanup Intermediate Files

After both annotation and filter phases have completed successfully, run the cleanup script to remove intermediate files:

./bin/cleanup.sh

What it removes:

Untracked Snakemake workflow files from the annotation phase
Trimmed RNA-seq files (GETA/fastp/single/*.fq.gz, GETA/fastp/paired/*.fq.gz)
RepeatMasker cache directories (.RepeatMaskerCache/)
Empty directories
Temporary PASA and HMM files

What it preserves:

Final outputs (complete_draft.gff3, FILTER/)
Log files (results/logs/)
Configuration files, Singularity images, and pipeline scripts
LncDC database files

Warning: Only run this after you have verified both pipeline phases completed successfully. The annotation phase intermediate files cannot be regenerated without re-running the full annotation pipeline.

Advanced Configuration

EVM Weights (`evm_weights.txt`)

Controls how EvidenceModeler prioritizes evidence sources. Higher weights give more influence:

ABINITIO_PREDICTION  AUGUSTUS     7
ABINITIO_PREDICTION  Helixer     3
OTHER_PREDICTION     Liftoff     2
OTHER_PREDICTION     GETA        5
OTHER_PREDICTION     Genewise    2
TRANSCRIPT           assembler-pasa.sqlite  10
TRANSCRIPT           StringTie   1
TRANSCRIPT           PsiClass    1
PROTEIN              GeneWise    2
PROTEIN              miniprot    2

PASA transcripts (weight 10) have the highest weight as direct transcript evidence
Augustus (weight 7) is weighted higher than Helixer (weight 3) as the primary ab initio predictor
GETA combined models (weight 5) reflect the integrated GETA pipeline evidence
Adjust weights based on evidence quality for your organism

EVM Partitioning (`num_evm_files`)

The num_evm_files parameter in config_annotate.yml controls how many parallel EVM partitions to create. The genome is split into overlapping segments (1 Mb segments with 20 kb overlap):

Higher values = more parallel SLURM jobs = faster wall-clock time, but more job scheduling overhead
Lower values = fewer jobs = less cluster burden, but slower
Default: 126 (works well for genomes up to ~500 Mb)
For very large genomes (>1 Gb), consider increasing to 200-500

Mikado Scoring (`config/plant.yaml`)

The plant.yaml file controls Mikado transcript selection parameters. Default values are tuned for plants with intron sizes typical of most angiosperms. Key parameters:

requirements: Minimum CDS fraction, exon count, intron length constraints
not_fragmentary: Criteria for rejecting incomplete models
scoring: Weights for transcript ranking (blast score, CDS length, UTR length, etc.)

Most users should not need to modify this file. For organisms with unusual intron sizes or gene structures, consult the Mikado documentation.

Augustus Training Options

Config Parameter	Description
`augustus_species`	Species name for Augustus training (e.g., `arabidopsis`)
`augustus_start_from`	Start training from an existing species model (faster convergence for closely related species)
`use_augustus`	Use a pre-trained Augustus model without re-training (set to species name; `placeholder` = train fresh)

If your organism is close to a well-annotated species, set augustus_start_from to that species for faster, more accurate training
If Augustus training fails (requires ~500 training genes minimum), use use_augustus with a close species

Helixer GPU Configuration

Helixer runs ~10x faster with GPU acceleration. Configure a separate GPU partition in the per-rule overrides section of config_annotate.yml:

helixer:
  ncpus: 4
  memory: 32g
  account: your-gpu-account
  partition: your-gpu-partition

RNA-seq File Naming

The pipeline supports paired-end reads with two naming conventions:

{sample}_1.fastq.gz / {sample}_2.fastq.gz
{sample}_R1.fastq.gz / {sample}_R2.fastq.gz

All paired-end FASTQ files should be placed in the directory specified by rna_seq in the config. Sample names are automatically detected from file names.

Annotation Pipeline Architecture

The annotation Snakefile supports two parallel RNA-seq alignment pathways:

STAR pathway: STAR_paired → psiClass_STAR / stringtie_STAR
HiSat2 pathway: HiSat2_PAIRED → psiClass_HiSat / stringtie_HiSat

Both pathways feed into transcript assembly and the GETA pipeline. The choice is determined by the config and available resources. STAR is generally preferred for accuracy; HiSat2 uses less memory.

The protein homology pipeline is sequential: Miniprot (fast protein-to-genome alignment) → miniprot2genewise (converts to gene regions) → GeneWise (refined gene structure prediction on identified regions).

Snakemake Job Grouping

The annotate.sh script uses Snakemake job grouping for the Sam2Transfrag rule:

--groups Sam2Transfrag=group0 --group-components group0=100

This groups up to 100 Sam2Transfrag jobs into a single SLURM submission, reducing cluster scheduling overhead for this highly parallelized step. Adjust the group-components value if you experience SLURM scheduling issues or want different parallelism.

Monitoring and Debugging

Check Job Status

squeue -u $USER

View Logs

# Snakemake log
tail -f .snakemake/log/*.snakemake.log

# Find recent error logs
ls -lt results/logs/*.err | head -10

# Search for errors
grep -l 'Error\|Traceback' results/logs/*.err

# View specific log (pattern: {rule}_{wildcards}.err)
cat results/logs/liftoff_.err
cat results/logs/geneRegion2Genewise_seqid=group17400.err

Common Issues

Issue	Solution
Out of memory	Increase `memory` in the per-rule overrides section of `config_annotate.yml` for the failing rule
`No space left on device`	`TMPDIR` is on tmpfs or quota exceeded — set `TMPDIR` to project storage
`Segmentation fault`	Often caused by tmpfs exhaustion — set `TMPDIR` to disk-backed storage
LFS files are pointers	Run `git lfs pull`; verify with `ls -la toydata/` (files should be > 200 bytes)
Singularity bind error	Ensure paths are within working directory or use `SINGULARITY_BIND`
Augustus training fails	Needs ~500 training genes minimum; use `augustus_start_from` with a close species or `use_augustus` to skip training
Job timeout	Increase `time` in the per-rule overrides section of `config_annotate.yml` for the rule
Variables not in SLURM job	Add `#SBATCH --export=ALL` or explicitly export in submit script
Filter `chrom_regex` error	Ensure `chrom_regex` in `config_filter.yml` matches your chromosome naming convention

Toy Data Details

Overview

The toy dataset contains Arabidopsis thaliana chromosome 4 split into 3 segments (~18.6 Mb total).

Directory Structure

toydata/
├── config/                      # Configuration files
│   ├── config_annotate.yml      # Annotation pipeline + SLURM resource config (pre-configured)
│   ├── config_filter.yml        # Filter pipeline config
│   ├── evm_weights.txt          # EVM evidence weights
│   └── plant.yaml               # Mikado transcript scoring parameters
├── genome_input/
│   └── genome.fasta.gz          # A. thaliana Chr4 (3 parts)
├── RNASeq/                      # 12 paired-end RNA-seq samples
│   ├── sub_SRR1019221_1.fastq.gz
│   └── ...
├── neighbor_genome/             # Neighbor species genomes
│   ├── aly4.fasta               # A. lyrata
│   ├── ath4.fasta               # A. thaliana
│   ├── chi4.fasta               # C. hirsuta
│   └── cru4.fasta               # C. rubella
├── neighbor_gff3/               # Neighbor annotations
│   ├── aly4.gff3
│   └── ...
├── cds_aa/
│   ├── neighbor.cds             # Combined CDS for EDTA
│   └── neighbor.aa              # Proteins for homology
├── misc/
│   └── Viridiplantae_v4.0.fasta # RexDB plant repeat protein database
└── EDTA/                        # Pre-computed repeat library
    └── genome.fasta.mod.EDTA.TElib.fa

Genome Statistics

Segment	Length	GC (%)
Chr4_1	6,195,060 bp	36.66
Chr4_2	6,195,060 bp	35.24
Chr4_3	6,195,018 bp	36.69
Total	18,585,138 bp	36.20

TAIR10 Reference Annotation (Chr4)

For comparison, the official TAIR10 annotation of Arabidopsis thaliana (Col-0) chromosome 4 contains:

Feature Type	Count
Protein-coding genes	4,124
pre-tRNA genes	79
rRNA genes	0
snRNA genes	0
snoRNA genes	11
miRNA genes	28
Other RNA genes	62
Pseudogenes	121
Transposable element (TE) genes	711
Total annotated loci	5,410

Note: TAIR10 is the current reference annotation standard for A. thaliana. The protein-coding gene count increased from 3,744 (original Chr4 paper) to 4,124 as annotation methods improved.

Source: Phoenix Bioinformatics - Genome Annotation at TAIR

Sylvan Pipeline Results (Toy Data)

Running the Sylvan pipeline on the Chr4 toy dataset produces the following results:

Annotation Phase (complete_draft.gff3):

Metric	Count
Total genes	5,720
Total mRNA	5,800

Filter Phase (filter.gff3):

Metric	Count
Genes kept	3,756
mRNA kept	3,834
Genes discarded	1,964
mRNA discarded	1,966

Output files:

results/FILTER/filter.gff3 - Kept gene models
results/FILTER/discard.gff3 - Discarded gene models
results/FILTER/data.tsv - Feature matrix used by random forest (input to feature importance analysis)
results/FILTER/keep_data.tsv - Evidence data for kept genes
results/FILTER/discard_data.tsv - Evidence data for discarded genes
results/FILTER/{prefix}.cdna - Extracted transcript sequences
results/FILTER/{prefix}.pep - Extracted peptide sequences
results/FILTER/lncrna_predict.csv - lncDC long non-coding RNA predictions
results/complete_draft.gff3.map - ID mapping between original and new IDs

Comparison: The 3,756 kept genes represents ~91% of TAIR10's 4,124 protein-coding genes on Chr4. The higher initial count (5,720) includes transposable elements (TAIR10 has 711 TE genes) and low-confidence predictions that are filtered out.

Neighbor Species

Code	Species	Common Name
aly4	Arabidopsis lyrata	Lyrate rockcress
cru4	Capsella rubella	Pink shepherd's purse
chi4	Cardamine hirsuta	Hairy bittercress

RNA-seq Samples

SRA	Tissue	Size
SRR1019221	Leaf (14-day)	4.6 Gb
SRR1105822	Rosette (19-day)	3.1 Gb
SRR1105823	Rosette (19-day)	4.5 Gb
SRR1106559	Rosette (19-day)	3.6 Gb
SRR446027	Whole plant	2.5 Gb
SRR446028	Whole plant	5.4 Gb
SRR446033	Whole plant	5.3 Gb
SRR446034	Whole plant	5.4 Gb
SRR446039	Whole plant	2.5 Gb
SRR446040	Whole plant	6.3 Gb
SRR764885	Leaf (4-week)	4.6 Gb
SRR934391	Whole plant	4.0 Gb

How the Toy Data Was Created

1. Genome segmentation

seqkit split2 -p 3 Chr4.fasta

2. Neighbor CDS extraction

Syntenic regions were identified using MCscan/jcvi:

python -m jcvi.compara.catalog ortholog ath aly --no_strip_names
python -m jcvi.compara.synteny mcscan ath.bed ath.aly.lifted.anchors --iter=1

3. RNA-seq subsetting

Reads mapping to Chr4 were extracted:

STAR --genomeDir star_index --readFilesIn sample_1.fq.gz sample_2.fq.gz
samtools view -b -F 4 Aligned.bam | samtools fastq -1 out_1.fq -2 out_2.fq -

Test Environment

The toy data was tested on:

Specification	Value
Nodes	4
Total CPUs	256
CPU	Intel Xeon E5-2683 v4 @ 2.10GHz
Cores per node	64 (2 sockets × 16 cores × 2 threads)
Memory per node	256 GB
Storage	GPFS

Runtime Statistics

The following summarizes the runtime distribution across all pipeline rules when running on the toy dataset.

Key observations:

Most time-consuming steps: aggregate_CombineGeneModels (~50,000s), geneRegion2Genewise (~1,000s), and Sam2Transfrag (~100-200s) are the bottlenecks
Fast steps: Most preprocessing and formatting rules complete in under 10 seconds
Parallelizable rules: Rules like geneRegion2Genewise, gmapExon, and STAR_paired run as multiple parallel jobs (shown as multiple dots), significantly reducing wall-clock time
GPU-accelerated: helixer benefits from GPU acceleration when available

Runtime variability:

Actual runtime will vary significantly depending on your hardware and cluster configuration:

Factor	Impact
CPU speed	Faster clock speeds reduce single-threaded bottlenecks
Available nodes	More nodes = more parallel jobs = faster wall-clock time
Memory per node	Insufficient memory causes job failures or swapping
Storage I/O	GPFS/Lustre faster than NFS; SSD faster than HDD
Queue wait time	Busy clusters add significant delays between jobs
GPU availability	Helixer runs ~10x faster with GPU acceleration

With the test environment above (4 nodes, 256 CPUs, 256 GB/node), the toy dataset completes in 4-8 hours wall-clock time. On smaller clusters or shared resources, expect longer runtimes.

Utility Scripts

Generate Cluster Config

bin/generate_cluster_from_config.py extracts the SLURM-relevant sections (__default__ and per-rule overrides) from config_annotate.yml into a standalone cluster-only YAML. This is optional — config_annotate.yml already serves as both pipeline config and --cluster-config, so you only need this script if you want a minimal, separate cluster file.

# Extract standalone cluster YAML from production config
python bin/generate_cluster_from_config.py \
  --config config/config_annotate.yml \
  --out config/cluster_annotate.yml \
  --account your-account --partition your-partition

# Extract from toydata config
python bin/generate_cluster_from_config.py \
  --config toydata/config/config_annotate.yml \
  --out toydata/config/cluster_annotate.yml \
  --account your-account --partition your-partition

Key Python Scripts

Script	Purpose
`Filter.py`	Semi-supervised random forest gene model classification
`score_filter.py`	Alternative logistic regression + RF scoring pipeline
`filter_feature_importance.py`	Leave-one-feature-out ablation study for filter
`TidyGFF.py`	Reformat GFF for public distribution (renumber IDs, sort, validate)
`CombineDuck.py`	DuckDB-based gene annotation database for model selection
`combine_genemodel.py`	Merge Augustus, TransFrag, and GeneWise gene models
`Pick_Primaries.py`	Select primary transcript per gene locus
`repeat_gene_removal.py`	Remove gene models overlapping with transposable elements
`gff_to_evm.py`	Convert GFF3 to EVM input format
`splitEVMCommands.py`	Partition EVM commands for parallel execution
`clusterGeneWiseRegions.py`	Cluster GeneWise alignment regions
`miniprot2Genewise.py`	Convert Miniprot output to GeneWise format
`splitBam.py`	Split BAM files for parallel processing
`MonitorFilter.py`	Visualize filter iteration progress (matplotlib)

Runtime Benchmark

The following benchmarks were measured on the toy dataset (A. thaliana chromosome 4, 18.6 Mb) using the test environment described in Toy Data Details.

Test Environment

Specification	Value
Nodes	4
Total CPUs	256
CPU	Intel Xeon E5-2683 v4 @ 2.10GHz
Cores per node	64 (2 sockets x 16 cores x 2 threads)
Memory per node	256 GB
Storage	GPFS
GPU	NVIDIA V100 (Helixer only)

Phase 1 — Annotate

Step	Wall-clock time	Notes
EDTA (repeat library)	~1.5 h	Pre-pipeline; LINE detection is the bottleneck
RepeatMasking	15–30 min
RNA-seq alignment (STAR)	30–60 min	Parallelized across samples
Transcript assembly (StringTie/PsiCLASS)	20–40 min
Protein homology (Miniprot/GeneWise)	30–60 min	Parallelized across genome regions
Helixer (GPU)	~5 min	Single NVIDIA V100; see GPU note below
Augustus training + prediction	1–2 h	Longest single-threaded bottleneck
Gene model combination (CombineDuck)	30–60 min	~50,000 s total CPU across parallel jobs
EvidenceModeler	30–60 min
Phase 1 total	4–8 h	Wall-clock with 4 nodes / 256 CPUs

Phase 2 — Filter

Step	Wall-clock time	Notes
CDS/peptide extraction	< 1 min
PfamScan	5–10 min
RSEM (expression quantification)	10–20 min
BLASTp + RexDB	10–20 min	Parallelized across 20 splits
Ab initio coverage (Helixer/Augustus/RM)	< 5 min
lncDC	< 5 min
BUSCO	5–10 min
Semi-supervised random forest	< 5 min	3–5 iterations to convergence
Phase 2 total	~1 h

End-to-End Summary

Component	Time
EDTA (pre-pipeline)	~1.5 h
Phase 1 — Annotate	4–8 h
Phase 2 — Filter	~1 h
Total	~7–11 h

Helixer GPU vs CPU

Helixer is the only pipeline step that benefits from GPU acceleration. On the toy dataset (18.6 Mb):

Hardware	Helixer runtime	Notes
NVIDIA V100 (1 GPU)	~5 min	Default in pipeline; `--gres=gpu:1` in SLURM config
CPU-only (16 threads)	~50 min	~10x slower; set via Helixer `--no-gpu` flag

Helixer GPU time represents < 2% of total Sylvan wall-clock time, so GPU availability is helpful but not a bottleneck. The pipeline runs end-to-end without a GPU if needed.

Scalability for EBP-Scale Genomes

The Earth BioGenome Project (EBP) targets thousands of species with genomes ranging from hundreds of Mb to several Gb. Key scaling considerations:

Factor	Scaling behavior
Genome size	Most steps scale linearly (repeat masking, alignment, EVM). Augustus training and CombineDuck scale super-linearly due to increased model complexity.
RNA-seq volume	STAR/HiSat2 alignment scales linearly with read count. Parallelized across samples.
Protein DB size	Miniprot/GeneWise scale linearly. BLASTp is parallelized across 20 splits.
Helixer (GPU)	Scales linearly with genome size via `--subsequence-length`. A 1 Gb genome takes ~30–60 min on a V100.
Filter phase	Scales with gene count, not genome size. Typically < 2 h even for large annotations (50k+ genes).

Estimated runtimes by genome size (4 nodes / 256 CPUs / 1 GPU):

Genome size	Example organism	Estimated wall-clock
18.6 Mb (toy)	A. thaliana chr4	7–11 h
~135 Mb	A. thaliana (full)	1–2 days
~750 Mb	Rice, tomato	3–5 days
~2.5 Gb	Maize, wheat subgenome	1–2 weeks

These estimates assume continuous job scheduling. In practice, SLURM queue wait times can dominate wall-clock time on shared clusters. Snakemake's job-level parallelism means that adding nodes provides near-linear speedup for the parallelizable steps (RNA-seq alignment, GeneWise, BLASTp, EVM).

Getting Help

Issues: https://github.com/plantgenomicslab/Sylvan/issues
See also: README.md for configuration reference

Feature Importance Analysis

After finishing the filter phase you will have FILTER/data.tsv (the feature matrix used by Filter.py) and a BUSCO run directory such as results/busco/eudicots_odb10. Reviewers often ask for a feature ablation study, so we provide an automated helper:

python bin/filter_feature_importance.py FILTER/data.tsv results/busco/<lineage>/full_table.tsv \
  --output-table FILTER/feature_importance.tsv

What is the BUSCO full table? Every BUSCO run writes a full_table.tsv inside its lineage-specific run folder. Each non-Missing BUSCO row lists the BUSCO ID, status (Complete/Duplicated/Fragmented), and the transcript/gene ID it matched. The feature-importance script reuses this file to count how many BUSCOs remain in the “keep” set during each iteration—no new BUSCO analysis is required.
Outputs: FILTER/feature_importance.tsv (table) plus FILTER/feature_importance.json (machine-readable). Both include the baseline run (all features) and each leave-one-feature-out run, along with final out-of-bag (OOB) error, BUSCO counts, and iteration counts.
Optional flags:
- --features TPM COVERAGE PFAM ... restricts the analysis to specific columns from FILTER/data.tsv.
- --ignore TPM_missing singleExon removes metadata columns so the script automatically uses every other feature column.

Workflow summary:

Run Filter.py as usual to create FILTER/data.tsv.
Identify the BUSCO full_table.tsv path you already used for filter monitoring (e.g., results/busco/eudicots_odb10/full_table.tsv).
Execute the command above. Inspect FILTER/feature_importance.tsv to see how dropping each feature affects OOB error (positive delta ⇒ feature is important).
Incorporate the results (table/plot) into your manuscript or reviewer response.

Home

Sylvan Tutorial: Running with Toy Data

Table of Contents

Prerequisites

Host Dependencies

Step 1: Environment Setup

Step 2: Download Containers

Sylvan Container

EDTA Container (for repeat library)

Clone Repository

Step 3: Prepare Repeat Library (EDTA)

EDTA Benchmark (Toy Data)

Experiment Setup (Toy Data)

Step 4: Run Annotation Pipeline

Environment Variables

Complete Environment Setup

Additional Singularity Variables

Verifying Your Environment

Debugging Environment Issues

Dry Run (Recommended)

Submit to SLURM

Expected Runtime (Toy Data)

Force Rerun

Step 5: Run Filter Pipeline

Filter Pipeline Steps

Filter Cutoff Thresholds

Filter Random Forest Parameters

Important: chrom_regex

Output Files

Step 5b: Alternative Score-based Filter

Step 6: Format Output (TidyGFF)

Step 7: Cleanup Intermediate Files

Advanced Configuration

EVM Weights (evm_weights.txt)

EVM Partitioning (num_evm_files)

Mikado Scoring (config/plant.yaml)

Augustus Training Options

Helixer GPU Configuration

RNA-seq File Naming

Annotation Pipeline Architecture

Snakemake Job Grouping

Monitoring and Debugging

Check Job Status

View Logs

Common Issues

Toy Data Details

Overview

Directory Structure

Genome Statistics

TAIR10 Reference Annotation (Chr4)

Sylvan Pipeline Results (Toy Data)

Neighbor Species

RNA-seq Samples

How the Toy Data Was Created

Test Environment

Runtime Statistics

Utility Scripts

Generate Cluster Config

Key Python Scripts

Runtime Benchmark

Test Environment

Phase 1 — Annotate

Phase 2 — Filter

End-to-End Summary

Helixer GPU vs CPU

Scalability for EBP-Scale Genomes

Getting Help

Feature Importance Analysis

Clone this wiki locally

Important: `chrom_regex`

EVM Weights (`evm_weights.txt`)

EVM Partitioning (`num_evm_files`)

Mikado Scoring (`config/plant.yaml`)