Skip to content

Comprehensive genome annotation pipeline for plant combining EVM/PASA, GETA, AUGUSTUS and Helixer with semi-supervised filtering for high-quality gene models.

License

Notifications You must be signed in to change notification settings

plantgenomicslab/Sylvan

Repository files navigation

Sylvan Genome Annotation

Sylvan is a comprehensive genome annotation pipeline that combines EVM/PASA, GETA, and Helixer with semi-supervised random forest filtering for generating high-quality gene models from raw genome assemblies.

Sylvan Workflow

Features

  • Multi-evidence integration: RNA-seq, protein homology, neighbor species annotations
  • Dual RNA-seq alignment pathways: STAR and HiSat2 with StringTie/PsiCLASS
  • Multiple ab initio predictors: Helixer (GPU-accelerated), Augustus
  • Semi-supervised filtering: Random forest-based spurious gene removal
  • Score-based filtering: Alternative logistic regression + random forest scoring pipeline
  • HPC-ready: SLURM cluster support with Singularity containers
  • Customizable cluster command: sbatch template lives in the config YAML — no shell script edits needed
  • TidyGFF: Format annotations for public distribution
  • Cleanup utility: Remove intermediate files after pipeline completion

Quick Start

  1. Complete Installation (conda environment, Singularity image, git clone)
  2. Run with toy data:
# Dry-run first
snakemake -n --snakefile bin/Snakefile_annotate

# Run annotation
./bin/annotate_toydata.sh

The toy data experiment uses A. thaliana chromosome 4 with 12 paired-end RNA-seq samples, 3 neighbor species, and the land_plant Helixer model. For a detailed walkthrough, see the Wiki.


Installation

Requirements

  • Linux (tested on CentOS/RHEL)
  • Singularity 3.x+
  • Conda/Mamba
  • SLURM for cluster execution (no HPC? deploy via Cloud Cluster Toolkit)
  • Git LFS (for toy data)

Dependencies

Most bioinformatics tools (STAR, Augustus, GeneWise, PASA, EVM, BLAST, BUSCO, etc.) are bundled inside the Singularity container. The host environment needs:

Package Purpose
Python 3.10+ Pipeline orchestration
Snakemake 7 Workflow engine
pandas Data manipulation
scikit-learn Random forest classifier
NumPy Numerical operations
PyYAML Config parsing
rich Logging (optional)

Perl and R scripts (fillingEndsOfGeneModels.pl, filter_distributions.R) run inside the Singularity container and do not require host installation.

Setup

# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan

# Download Singularity image
singularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latest

# Clone repository (with Git LFS for toy data)
git lfs install
git clone https://github.com/plantgenomicslab/Sylvan.git

Build from source (optional)

cd Sylvan/singularity
sudo singularity build sylvan.sif Sylvan.def

Pipeline Design

The Sylvan pipeline consists of two main phases — annotation and filtration — with configurable modules that process evidence from multiple sources and combine them into a unified gene model. The following describes the available tools and modules. Users configure which components to enable and how to parameterize them via config_annotate.yml and config_filter.yml.

Pipeline DAG

Phase 1: Annotate

The annotation phase generates gene models by integrating multiple configurable evidence sources.

Evidence Generation

  • Repeat Masking

    • Runs RepeatMasker with a user-specified species library (e.g. Embryophyta, Viridiplantae, Metazoa — configured via geta.RM_species)
    • Can optionally run RepeatModeler for de novo repeat identification
    • Supports user-supplied custom repeat libraries (e.g. from EDTA, configured via geta.RM_lib)
  • RNA-seq Processing

    • Quality-trims reads with fastp
    • Aligns reads via STAR (default) or HiSat2 (alternative pathway — both are available in the pipeline; the active pathway depends on the Snakemake rule graph)
    • Assembles transcripts with StringTie and PsiCLASS
    • Optionally performs de novo transcript assembly with SPAdes + Evigene clustering
    • Refines and clusters transcripts with PASA
  • Protein Homology (sequential pipeline)

    • Miniprot performs fast protein-to-genome alignment to identify candidate gene regions
    • GeneWise refines gene structures on Miniprot-identified regions
    • GMAP provides exonerate-style exon-level alignments
  • Ab Initio Prediction

    • Helixer: deep learning–based gene prediction (optionally GPU-accelerated; model selected via helixer_modelland_plant, vertebrate, invertebrate, or fungi)
    • Augustus: HMM-based prediction, either trained de novo on the target genome or initialized from an existing species model (via augustus_start_from), or skipped entirely if a pre-trained model is supplied (via use_augustus)
  • Liftover

    • LiftOff transfers annotations from one or more neighbor species (configured via liftoff.neighbor_gff and liftoff.neighbor_fasta)

Evidence Combination

  • GETA Pipeline

    • TransDecoder predicts ORFs from assembled transcripts
    • Gene models are combined and filtered; repeat-overlapping genes are removed
  • Portcullis

    • Filters splice junctions from transcript evidence
  • EvidenceModeler (EVM)

    • Integrates all evidence sources using configurable weights (evm_weights.txt)
    • Generates consensus gene models
    • Genome is partitioned into overlapping segments for parallel execution (partition count configured via num_evm_files)
  • PASA Post-processing

    • PASA operates at two stages in the pipeline: (1) initial transcript assembly and clustering before EVM, and (2) post-EVM refinement for UTR addition and alternative isoform incorporation
  • AGAT

    • Final GFF3 format cleaning and validation

Output: results/complete_draft.gff3

Phase 2: Semi-Supervised Random Forest Filter

The filter phase computes additional evidence features for each gene model and applies a semi-supervised random forest classifier to separate high-quality genes from spurious predictions.

Feature Generation

The following features are computed for every gene model in the draft annotation:

  • PfamScan — identifies conserved protein domains using the Pfam-A HMM database
  • RSEM — quantifies transcript expression (TPM) from re-aligned RNA-seq reads; bedtools computes read coverage
  • BLASTp (homolog) — measures similarity to a user-supplied protein database (parallelized across 20 split peptide files)
  • BLASTp (RexDB) — measures similarity to a repeat element protein database (e.g. RepeatExplorer Viridiplantae)
  • Ab initio overlap — computes the fraction of each gene model overlapping with Augustus predictions, Helixer predictions, and RepeatMasker annotations
  • lncDC — classifies transcripts as protein-coding or long non-coding RNA using an XGBoost model with plant-specific pre-trained parameters
  • BUSCO — identifies conserved single-copy orthologs (used only to monitor the filtration process; not used as a classifier feature)

Semi-supervised Classification

  1. Initial gene set selection: A data-driven heuristic selects high-confidence positive genes (strong homolog/Pfam/expression evidence) and high-confidence negative genes (repeat-like, no expression) using configurable cutoff thresholds (TPM, coverage, BLAST identity/coverage, repeat overlap)
  2. Random forest training: A binary classifier is trained on the initial gene set
  3. Iterative refinement: High-confidence predictions (above the --recycle threshold, default 0.95) are added back to the training set, and the model is retrained. This repeats for up to --max-iter iterations (default 5) or until convergence

Output files:

  • results/FILTER/filter.gff3 — Kept gene models
  • results/FILTER/discard.gff3 — Discarded gene models
  • results/FILTER/data.tsv — Feature matrix used by random forest
  • results/FILTER/keep_data.tsv — Evidence data for kept genes
  • results/FILTER/discard_data.tsv — Evidence data for discarded genes
  • results/FILTER/{prefix}.cdna — Extracted transcript sequences
  • results/FILTER/{prefix}.pep — Extracted peptide sequences

Alternative: Score-based Filter

An alternative scoring pipeline (Snakefile_filter_score) uses logistic regression and random forest scoring with pseudo-labels instead of the iterative semi-supervised approach. This requires the same feature generation outputs and produces:

  • results/FILTER/scores.csv — Per-gene scores and features
  • results/FILTER/scores.metrics.txt — AUC/PR/F1 and chosen thresholds
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
./bin/filter_score_toydata.sh

Running the Annotate Phase

This section describes the inputs, configuration, and commands needed to run the annotation pipeline on your data.

Input Requirements

Input Description Config Field
Genome assembly FASTA file (.fa, .fasta, .fa.gz, .fasta.gz) genome
RNA-seq data Paired-end gzipped FASTQ files (*_1.fastq.gz/*_2.fastq.gz or *_R1.fastq.gz/*_R2.fastq.gz) in a folder rna_seq
Protein sequences FASTA from UniProt, OrthoDB, etc. (comma-separated for multiple files) proteins
Neighbor species Directories containing GFF3 and genome FASTA (.fa, .fasta, .fna) files, one per species liftoff.neighbor_gff, liftoff.neighbor_fasta
Repeat library EDTA output (.TElib.fa) geta.RM_lib
Singularity image Path to sylvan.sif singularity

Running the Pipeline

# Set config (required)
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_annotate

# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=1g \
  -J annotate -o annotate.out -e annotate.err \
  --wrap="./bin/annotate_toydata.sh"

# Or run directly
./bin/annotate_toydata.sh

Output: results/complete_draft.gff3


Running the Filter Phase

This section describes the inputs and commands for the filter pipeline. All inputs below are specified in config_filter.yml.

Input Requirements

Input Description Config Field
Annotated GFF Output from Annotate phase (results/complete_draft.gff3) anot_gff
Genome Same as Annotate phase genome
RNA-seq data Same as Annotate phase rna_seq
Protein sequences Same as Annotate phase protein
Augustus GFF Augustus predictions (results/GETA/Augustus/augustus.gff3) augustus_gff
Helixer GFF Helixer predictions (results/AB_INITIO/Helixer/helixer.gff3) helixer_gff
Repeat GFF RepeatMasker output (results/GETA/RepeatMasker/genome.repeat.gff3) repeat_gff
HmmDB Pfam database directory (default: /usr/local/src inside container) HmmDB
RexDB RepeatExplorer protein DB (e.g. Viridiplantae_v4.0.fasta from rexdb) RexDB
BUSCO lineage e.g., eudicots_odb10 busco_lin
Chromosome regex Regex to match chromosome prefixes (e.g. (^Chr)|(^chr)|(^LG)) chrom_regex

Filter cutoff thresholds (in config_filter.yml under Cutoff):

Parameter Description Default
tpm TPM threshold for initial gene selection 3
rsem_cov RNA-seq coverage threshold 0.5
blast_pident / blast_qcovs BLASTp identity / coverage 0.6 / 0.6
rex_pident / rex_qcovs RexDB identity / coverage 0.6 / 0.6
helixer_cov / augustus_cov Ab initio overlap 0.8 / 0.8
repeat_cov Repeat overlap coverage threshold 0.5

Running the Pipeline

# Set config (required)
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"

# Dry run
snakemake -n --snakefile bin/Snakefile_filter

# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=4g \
  -J filter -o filter.out -e filter.err \
  --wrap="./bin/filter_toydata.sh"

# Or run directly
./bin/filter_toydata.sh

Output: results/FILTER/filter.gff3

Feature Importance Test

After a filter run completes, run the leave-one-feature-out ablation test:

python bin/filter_feature_importance.py FILTER/data.tsv results/busco/full_table.tsv \
  --output-table FILTER/feature_importance.tsv

See the Wiki for detailed usage, optional flags, and workflow.


Configuration

Sylvan uses several configuration files. Each config YAML serves as both the pipeline config and the Snakemake --cluster-config, so no separate cluster file is needed.

File Purpose
config_annotate.yml Pipeline options and SLURM resources: input paths, species parameters, tool settings, plus per-rule CPU/memory/partition allocation and cluster_cmd template
config_filter.yml Filter options: input paths, cutoff thresholds, thread allocation, and cluster_cmd template
evm_weights.txt EVM evidence weights: priority of each evidence source
config/plant.yaml Mikado scoring: transcript selection parameters (plant-specific defaults provided)

Pipeline Config (config_annotate.yml)

Contains:

  • Input file paths (genome, RNA-seq, proteins, neighbor species)
  • Species-specific settings (Helixer model, Augustus species)
  • Tool parameters (max intron length, EVM weights)
  • Output prefix and directories
  • SLURM resource allocation (__default__ section with account, partition, memory, ncpus, time, cluster_cmd, plus per-rule overrides)

EVM Weights (evm_weights.txt)

Controls how EvidenceModeler prioritizes different evidence sources. Higher weights give more influence. Example (from toy data):

ABINITIO_PREDICTION  AUGUSTUS     7
ABINITIO_PREDICTION  Helixer     3
OTHER_PREDICTION     Liftoff     2
OTHER_PREDICTION     GETA        5
OTHER_PREDICTION     Genewise    2
TRANSCRIPT           assembler-pasa.sqlite  10
TRANSCRIPT           StringTie   1
TRANSCRIPT           PsiClass    1
PROTEIN              GeneWise    2
PROTEIN              miniprot    2

Adjust weights based on the quality of each evidence type for your organism. PASA transcripts (weight 10) typically have the highest weight as they represent direct transcript evidence.

Environment Variables

Variable Phase Description
SYLVAN_CONFIG Annotate Path to config_annotate.yml (default: config_annotate.yml in cwd)
SYLVAN_FILTER_CONFIG Filter Path to config_filter.yml (default: config_filter.yml in cwd)
SYLVAN_RESULTS_DIR Annotate Override results output directory (default: $(pwd)/results/)
TMPDIR Both Temporary directory — critical on HPC (see below)
SLURM_TMPDIR Both Should match TMPDIR
SINGULARITY_BIND Both Bind additional host paths into container

Why TMPDIR matters: Many HPC nodes mount /tmp as tmpfs (RAM-backed). Large temporary files from STAR, RepeatMasker, or Augustus can exhaust memory, causing cryptic segmentation faults or "no space left on device" errors. Always set TMPDIR to disk-backed project storage:

mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"

Using Custom Config Location

# For toydata
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"

# For custom project
export SYLVAN_CONFIG="/path/to/my_config.yml"

This is required for any Snakemake command (dry-run, unlock, etc.):

export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate  # dry-run
snakemake --unlock --snakefile bin/Snakefile_annotate  # unlock

Key Parameters

Parameter Description Example
prefix Output file prefix my_species
helixer_model land_plant, vertebrate, invertebrate, fungi land_plant
helixer_subseq 64152 (plants), 21384 (fungi), 213840 (vertebrates) 64152
augustus_species Augustus species name for training arabidopsis
augustus_start_from Start Augustus training from an existing species model (skips de novo training if close match available) arabidopsis
use_augustus Use a pre-trained Augustus species without re-training (set to species name, or placeholder to train fresh) placeholder
num_evm_files Number of parallel EVM partitions (more = faster but more SLURM jobs) 126
geta.RM_species RepeatMasker species database (e.g. Embryophyta, Viridiplantae, Metazoa) Embryophyta

Helixer GPU Configuration

Helixer benefits significantly from GPU acceleration (~10x speedup). To use a separate GPU partition, add the following per-rule override in config_annotate.yml:

helixer:
  ncpus: 4
  memory: 32g
  account: your-gpu-account      # GPU-specific billing account
  partition: your-gpu-partition   # GPU partition name

SLURM Configuration

Find your SLURM account and partition:

# Show your accounts and partitions
sacctmgr show user "$USER" withassoc format=Account,Partition -nP

# List all available partitions
sinfo -s

# Show partition details (time limits, nodes, etc.)
sinfo -o "%P %l %D %c %m"

Set in config_annotate.yml under the __default__ section:

__default__:
  account: your-account
  partition: your-partition
  time: "3-00:00:00"
  memory: 8g
  ncpus: 1

Customizing the Cluster Submit Command

The cluster_cmd field in the __default__ section defines the sbatch template used by Snakemake's --cluster option. All entry scripts (annotate.sh, filter.sh, etc.) read this template via bin/get_cluster_cmd.py, so you only need to edit the config YAML:

__default__:
  cluster_cmd: >-
    sbatch -N {cluster.nodes} --mem={cluster.memory} --cpus-per-task={cluster.ncpus}
    -J {cluster.name} --parsable
    -A {cluster.account} -p {cluster.partition}
    -t {cluster.time} -o {cluster.output} -e {cluster.error}

Common customizations:

  • Remove -A {cluster.account} if your HPC doesn't require a billing account
  • Add --gres=gpu:1 for GPU rules
  • Add --export=ALL to pass environment variables

Output

All outputs are organized under results/:

results/
├── complete_draft.gff3          # Annotate phase final output
│
├── AB_INITIO/
│   └── Helixer/                 # Helixer predictions
│
├── GETA/
│   ├── RepeatMasker/            # Repeat masking results
│   ├── Augustus/                # Augustus predictions
│   ├── transcript/              # TransDecoder results
│   ├── homolog/                 # Protein alignments (Miniprot → GeneWise)
│   └── CombineGeneModels/       # GETA gene models
│
├── LIFTOVER/
│   └── LiftOff/                 # Neighbor species liftover
│
├── TRANSCRIPT/
│   ├── PASA/                    # PASA assemblies
│   ├── spades/                  # De novo assembly
│   └── evigene/                 # Evigene transcript clustering
│
├── PROTEIN/                     # Merged protein alignments
│
├── EVM/                         # EvidenceModeler output
│
├── FILTER/
│   ├── filter.gff3              # Kept gene models
│   ├── discard.gff3             # Discarded gene models
│   ├── data.tsv                 # Feature matrix (input to random forest)
│   ├── keep_data.tsv            # Evidence data for kept genes
│   ├── discard_data.tsv         # Evidence data for discarded genes
│   ├── {prefix}.cdna            # Transcript sequences
│   ├── {prefix}.pep             # Peptide sequences
│   ├── STAR/                    # RNA-seq realignment for filter
│   ├── rsem_outdir/             # RSEM quantification
│   ├── splitPep/                # Parallelized BLAST inputs
│   ├── busco_*/                 # BUSCO results (monitoring only)
│   └── lncrna_predict.csv       # lncDC predictions
│
├── config/                      # Runtime config copies
│
└── logs/                        # SLURM job logs

Formatting Output

Use TidyGFF to prepare annotations for public distribution:

singularity exec sylvan.sif python bin/TidyGFF.py \
  MySpecies results/FILTER/filter.gff3 \
  --out MySpecies_v1.0 --splice-name t --justify 5 --sort \
  --chrom-regex "^Chr" --source Sylvan

TidyGFF options:

Option Description
pre (positional) Prefix for gene IDs (e.g. Ath produces Ath01G000010)
gff (positional) Input GFF3 file
--out Output file basename (produces .gff3, .cdna, .pep files)
--splice-name Splice variant label style (e.g. tmRNA1.t1, mRNA1.t2)
--justify Number of digits in gene IDs (default: 8)
--sort Sort output by chromosome and start coordinate
--chrom-regex Regex for chromosome prefixes (auto-detects Chr, chr, LG, Ch, ^\d)
--contig-regex Regex for contig/scaffold naming (e.g. HiC_scaffold_(\d+$),Scaf)
--source Value for GFF column 2 (e.g. Sylvan)
--remove-names Remove Name attributes from GFF
--no-chrom-id Do not number gene IDs by chromosome

Useful Commands

# Force rerun all
./bin/annotate.sh --forceall

# Rerun specific rule
./bin/annotate.sh --forcerun helixer

# Rerun incomplete jobs (jobs that started but didn't finish)
./bin/rerun-incomplete.sh

# Generate report after completion
snakemake --report report.html --snakefile bin/Snakefile_annotate

# Unlock after interruption
./bin/annotate.sh --unlock

# Clean up intermediate files (run after BOTH phases complete)
./bin/cleanup.sh

Cleanup

bin/cleanup.sh removes intermediate files generated during the annotation phase while preserving:

  • Final outputs (complete_draft.gff3, filter.gff3)
  • Log files (results/logs/)
  • Configuration files
  • Filter phase outputs (FILTER/)

Run this only after both annotation and filter phases have completed successfully.

Helper Scripts

Script Purpose
bin/generate_cluster_from_config.py Regenerate per-rule SLURM defaults within config_annotate.yml — keeps resource requests in sync with the pipeline's threads/memory
bin/get_cluster_cmd.py Extract the cluster_cmd sbatch template from a config YAML (used internally by all entry scripts)

Troubleshooting

Check logs

# Find recent errors
ls -lt results/logs/*.err | head -10
grep -l 'Error\|Traceback' results/logs/*.err

# View specific log
cat results/logs/{rule}_{wildcards}.err

Common Issues

Issue Solution
Out of memory Increase memory in config_annotate.yml for the rule
No space left on device TMPDIR is on tmpfs or quota exceeded — set TMPDIR to project storage
Segmentation fault Often caused by tmpfs exhaustion — set TMPDIR to disk-backed storage
File not found (Singularity) Path not bound in container — add to SINGULARITY_BIND
Singularity bind error Ensure paths are within working directory or use SINGULARITY_BIND
Permission denied in container Check directory permissions, ensure path is bound
SLURM account error Use account (billing account), not username
cluster_cmd not found Ensure __default__.cluster_cmd exists in your config YAML
LFS files not downloaded Run git lfs pull; verify with ls -la toydata/ (files should be > 200 bytes)
Augustus training fails Needs minimum ~500 training genes; use augustus_start_from with a close species
Job timeout Increase time in config_annotate.yml for the rule
Variables not in SLURM job Add #SBATCH --export=ALL or explicitly export in submit script
Filter chrom_regex error Ensure chrom_regex in config_filter.yml matches your chromosome naming convention

Memory Guidelines

  • General recommendation: 4 GB per thread
  • Example: 48 threads = 192 GB memory
  • ncpus and threads should match in config_annotate.yml
  • Some rules need more: mergeSTAR may require ~18 GB per thread for large datasets
  • Check df -h $TMPDIR to ensure temp storage is on real disk, not tmpfs

Citation

Sylvan: A comprehensive genome annotation pipeline. Under review.

License

MIT License - see LICENSE

Contact

Issues: https://github.com/plantgenomicslab/Sylvan/issues

About

Comprehensive genome annotation pipeline for plant combining EVM/PASA, GETA, AUGUSTUS and Helixer with semi-supervised filtering for high-quality gene models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •