Sylvan is a comprehensive genome annotation pipeline that combines EVM/PASA, GETA, and Helixer with semi-supervised random forest filtering for generating high-quality gene models from raw genome assemblies.
- Multi-evidence integration: RNA-seq, protein homology, neighbor species annotations
- Dual RNA-seq alignment pathways: STAR and HiSat2 with StringTie/PsiCLASS
- Multiple ab initio predictors: Helixer (GPU-accelerated), Augustus
- Semi-supervised filtering: Random forest-based spurious gene removal
- Score-based filtering: Alternative logistic regression + random forest scoring pipeline
- HPC-ready: SLURM cluster support with Singularity containers
- Customizable cluster command:
sbatchtemplate lives in the config YAML — no shell script edits needed - TidyGFF: Format annotations for public distribution
- Cleanup utility: Remove intermediate files after pipeline completion
- Complete Installation (conda environment, Singularity image, git clone)
- Run with toy data:
# Dry-run first
snakemake -n --snakefile bin/Snakefile_annotate
# Run annotation
./bin/annotate_toydata.shThe toy data experiment uses A. thaliana chromosome 4 with 12 paired-end RNA-seq samples, 3 neighbor species, and the land_plant Helixer model. For a detailed walkthrough, see the Wiki.
- Linux (tested on CentOS/RHEL)
- Singularity 3.x+
- Conda/Mamba
- SLURM for cluster execution (no HPC? deploy via Cloud Cluster Toolkit)
- Git LFS (for toy data)
Most bioinformatics tools (STAR, Augustus, GeneWise, PASA, EVM, BLAST, BUSCO, etc.) are bundled inside the Singularity container. The host environment needs:
| Package | Purpose |
|---|---|
| Python 3.10+ | Pipeline orchestration |
| Snakemake 7 | Workflow engine |
| pandas | Data manipulation |
| scikit-learn | Random forest classifier |
| NumPy | Numerical operations |
| PyYAML | Config parsing |
| rich | Logging (optional) |
Perl and R scripts (fillingEndsOfGeneModels.pl, filter_distributions.R) run inside the Singularity container and do not require host installation.
# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan
# Download Singularity image
singularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latest
# Clone repository (with Git LFS for toy data)
git lfs install
git clone https://github.com/plantgenomicslab/Sylvan.gitcd Sylvan/singularity
sudo singularity build sylvan.sif Sylvan.defThe Sylvan pipeline consists of two main phases — annotation and filtration — with configurable modules that process evidence from multiple sources and combine them into a unified gene model. The following describes the available tools and modules. Users configure which components to enable and how to parameterize them via config_annotate.yml and config_filter.yml.
The annotation phase generates gene models by integrating multiple configurable evidence sources.
-
Repeat Masking
- Runs RepeatMasker with a user-specified species library (e.g.
Embryophyta,Viridiplantae,Metazoa— configured viageta.RM_species) - Can optionally run RepeatModeler for de novo repeat identification
- Supports user-supplied custom repeat libraries (e.g. from EDTA, configured via
geta.RM_lib)
- Runs RepeatMasker with a user-specified species library (e.g.
-
RNA-seq Processing
- Quality-trims reads with fastp
- Aligns reads via STAR (default) or HiSat2 (alternative pathway — both are available in the pipeline; the active pathway depends on the Snakemake rule graph)
- Assembles transcripts with StringTie and PsiCLASS
- Optionally performs de novo transcript assembly with SPAdes + Evigene clustering
- Refines and clusters transcripts with PASA
-
Protein Homology (sequential pipeline)
- Miniprot performs fast protein-to-genome alignment to identify candidate gene regions
- GeneWise refines gene structures on Miniprot-identified regions
- GMAP provides exonerate-style exon-level alignments
-
Ab Initio Prediction
- Helixer: deep learning–based gene prediction (optionally GPU-accelerated; model selected via
helixer_model—land_plant,vertebrate,invertebrate, orfungi) - Augustus: HMM-based prediction, either trained de novo on the target genome or initialized from an existing species model (via
augustus_start_from), or skipped entirely if a pre-trained model is supplied (viause_augustus)
- Helixer: deep learning–based gene prediction (optionally GPU-accelerated; model selected via
-
Liftover
- LiftOff transfers annotations from one or more neighbor species (configured via
liftoff.neighbor_gffandliftoff.neighbor_fasta)
- LiftOff transfers annotations from one or more neighbor species (configured via
-
GETA Pipeline
- TransDecoder predicts ORFs from assembled transcripts
- Gene models are combined and filtered; repeat-overlapping genes are removed
-
Portcullis
- Filters splice junctions from transcript evidence
-
EvidenceModeler (EVM)
- Integrates all evidence sources using configurable weights (
evm_weights.txt) - Generates consensus gene models
- Genome is partitioned into overlapping segments for parallel execution (partition count configured via
num_evm_files)
- Integrates all evidence sources using configurable weights (
-
PASA Post-processing
- PASA operates at two stages in the pipeline: (1) initial transcript assembly and clustering before EVM, and (2) post-EVM refinement for UTR addition and alternative isoform incorporation
-
AGAT
- Final GFF3 format cleaning and validation
Output: results/complete_draft.gff3
The filter phase computes additional evidence features for each gene model and applies a semi-supervised random forest classifier to separate high-quality genes from spurious predictions.
The following features are computed for every gene model in the draft annotation:
- PfamScan — identifies conserved protein domains using the Pfam-A HMM database
- RSEM — quantifies transcript expression (TPM) from re-aligned RNA-seq reads; bedtools computes read coverage
- BLASTp (homolog) — measures similarity to a user-supplied protein database (parallelized across 20 split peptide files)
- BLASTp (RexDB) — measures similarity to a repeat element protein database (e.g. RepeatExplorer Viridiplantae)
- Ab initio overlap — computes the fraction of each gene model overlapping with Augustus predictions, Helixer predictions, and RepeatMasker annotations
- lncDC — classifies transcripts as protein-coding or long non-coding RNA using an XGBoost model with plant-specific pre-trained parameters
- BUSCO — identifies conserved single-copy orthologs (used only to monitor the filtration process; not used as a classifier feature)
- Initial gene set selection: A data-driven heuristic selects high-confidence positive genes (strong homolog/Pfam/expression evidence) and high-confidence negative genes (repeat-like, no expression) using configurable cutoff thresholds (TPM, coverage, BLAST identity/coverage, repeat overlap)
- Random forest training: A binary classifier is trained on the initial gene set
- Iterative refinement: High-confidence predictions (above the
--recyclethreshold, default 0.95) are added back to the training set, and the model is retrained. This repeats for up to--max-iteriterations (default 5) or until convergence
Output files:
results/FILTER/filter.gff3— Kept gene modelsresults/FILTER/discard.gff3— Discarded gene modelsresults/FILTER/data.tsv— Feature matrix used by random forestresults/FILTER/keep_data.tsv— Evidence data for kept genesresults/FILTER/discard_data.tsv— Evidence data for discarded genesresults/FILTER/{prefix}.cdna— Extracted transcript sequencesresults/FILTER/{prefix}.pep— Extracted peptide sequences
An alternative scoring pipeline (Snakefile_filter_score) uses logistic regression and random forest scoring with pseudo-labels instead of the iterative semi-supervised approach. This requires the same feature generation outputs and produces:
results/FILTER/scores.csv— Per-gene scores and featuresresults/FILTER/scores.metrics.txt— AUC/PR/F1 and chosen thresholds
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
./bin/filter_score_toydata.shThis section describes the inputs, configuration, and commands needed to run the annotation pipeline on your data.
| Input | Description | Config Field |
|---|---|---|
| Genome assembly | FASTA file (.fa, .fasta, .fa.gz, .fasta.gz) |
genome |
| RNA-seq data | Paired-end gzipped FASTQ files (*_1.fastq.gz/*_2.fastq.gz or *_R1.fastq.gz/*_R2.fastq.gz) in a folder |
rna_seq |
| Protein sequences | FASTA from UniProt, OrthoDB, etc. (comma-separated for multiple files) | proteins |
| Neighbor species | Directories containing GFF3 and genome FASTA (.fa, .fasta, .fna) files, one per species |
liftoff.neighbor_gff, liftoff.neighbor_fasta |
| Repeat library | EDTA output (.TElib.fa) |
geta.RM_lib |
| Singularity image | Path to sylvan.sif |
singularity |
# Set config (required)
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
# Dry run
snakemake -n --snakefile bin/Snakefile_annotate
# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=1g \
-J annotate -o annotate.out -e annotate.err \
--wrap="./bin/annotate_toydata.sh"
# Or run directly
./bin/annotate_toydata.shOutput: results/complete_draft.gff3
This section describes the inputs and commands for the filter pipeline. All inputs below are specified in config_filter.yml.
| Input | Description | Config Field |
|---|---|---|
| Annotated GFF | Output from Annotate phase (results/complete_draft.gff3) |
anot_gff |
| Genome | Same as Annotate phase | genome |
| RNA-seq data | Same as Annotate phase | rna_seq |
| Protein sequences | Same as Annotate phase | protein |
| Augustus GFF | Augustus predictions (results/GETA/Augustus/augustus.gff3) |
augustus_gff |
| Helixer GFF | Helixer predictions (results/AB_INITIO/Helixer/helixer.gff3) |
helixer_gff |
| Repeat GFF | RepeatMasker output (results/GETA/RepeatMasker/genome.repeat.gff3) |
repeat_gff |
| HmmDB | Pfam database directory (default: /usr/local/src inside container) |
HmmDB |
| RexDB | RepeatExplorer protein DB (e.g. Viridiplantae_v4.0.fasta from rexdb) |
RexDB |
| BUSCO lineage | e.g., eudicots_odb10 |
busco_lin |
| Chromosome regex | Regex to match chromosome prefixes (e.g. (^Chr)|(^chr)|(^LG)) |
chrom_regex |
Filter cutoff thresholds (in config_filter.yml under Cutoff):
| Parameter | Description | Default |
|---|---|---|
tpm |
TPM threshold for initial gene selection | 3 |
rsem_cov |
RNA-seq coverage threshold | 0.5 |
blast_pident / blast_qcovs |
BLASTp identity / coverage | 0.6 / 0.6 |
rex_pident / rex_qcovs |
RexDB identity / coverage | 0.6 / 0.6 |
helixer_cov / augustus_cov |
Ab initio overlap | 0.8 / 0.8 |
repeat_cov |
Repeat overlap coverage threshold | 0.5 |
# Set config (required)
export SYLVAN_FILTER_CONFIG="toydata/config/config_filter.yml"
# Dry run
snakemake -n --snakefile bin/Snakefile_filter
# Submit to SLURM
sbatch -A [account] -p [partition] -c 1 --mem=4g \
-J filter -o filter.out -e filter.err \
--wrap="./bin/filter_toydata.sh"
# Or run directly
./bin/filter_toydata.shOutput: results/FILTER/filter.gff3
After a filter run completes, run the leave-one-feature-out ablation test:
python bin/filter_feature_importance.py FILTER/data.tsv results/busco/full_table.tsv \
--output-table FILTER/feature_importance.tsvSee the Wiki for detailed usage, optional flags, and workflow.
Sylvan uses several configuration files. Each config YAML serves as both the pipeline config and the Snakemake --cluster-config, so no separate cluster file is needed.
| File | Purpose |
|---|---|
config_annotate.yml |
Pipeline options and SLURM resources: input paths, species parameters, tool settings, plus per-rule CPU/memory/partition allocation and cluster_cmd template |
config_filter.yml |
Filter options: input paths, cutoff thresholds, thread allocation, and cluster_cmd template |
evm_weights.txt |
EVM evidence weights: priority of each evidence source |
config/plant.yaml |
Mikado scoring: transcript selection parameters (plant-specific defaults provided) |
Contains:
- Input file paths (genome, RNA-seq, proteins, neighbor species)
- Species-specific settings (Helixer model, Augustus species)
- Tool parameters (max intron length, EVM weights)
- Output prefix and directories
- SLURM resource allocation (
__default__section with account, partition, memory, ncpus, time,cluster_cmd, plus per-rule overrides)
Controls how EvidenceModeler prioritizes different evidence sources. Higher weights give more influence. Example (from toy data):
ABINITIO_PREDICTION AUGUSTUS 7
ABINITIO_PREDICTION Helixer 3
OTHER_PREDICTION Liftoff 2
OTHER_PREDICTION GETA 5
OTHER_PREDICTION Genewise 2
TRANSCRIPT assembler-pasa.sqlite 10
TRANSCRIPT StringTie 1
TRANSCRIPT PsiClass 1
PROTEIN GeneWise 2
PROTEIN miniprot 2
Adjust weights based on the quality of each evidence type for your organism. PASA transcripts (weight 10) typically have the highest weight as they represent direct transcript evidence.
| Variable | Phase | Description |
|---|---|---|
SYLVAN_CONFIG |
Annotate | Path to config_annotate.yml (default: config_annotate.yml in cwd) |
SYLVAN_FILTER_CONFIG |
Filter | Path to config_filter.yml (default: config_filter.yml in cwd) |
SYLVAN_RESULTS_DIR |
Annotate | Override results output directory (default: $(pwd)/results/) |
TMPDIR |
Both | Temporary directory — critical on HPC (see below) |
SLURM_TMPDIR |
Both | Should match TMPDIR |
SINGULARITY_BIND |
Both | Bind additional host paths into container |
Why TMPDIR matters: Many HPC nodes mount /tmp as tmpfs (RAM-backed). Large temporary files from STAR, RepeatMasker, or Augustus can exhaust memory, causing cryptic segmentation faults or "no space left on device" errors. Always set TMPDIR to disk-backed project storage:
mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"# For toydata
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
# For custom project
export SYLVAN_CONFIG="/path/to/my_config.yml"This is required for any Snakemake command (dry-run, unlock, etc.):
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate # dry-run
snakemake --unlock --snakefile bin/Snakefile_annotate # unlock| Parameter | Description | Example |
|---|---|---|
prefix |
Output file prefix | my_species |
helixer_model |
land_plant, vertebrate, invertebrate, fungi |
land_plant |
helixer_subseq |
64152 (plants), 21384 (fungi), 213840 (vertebrates) | 64152 |
augustus_species |
Augustus species name for training | arabidopsis |
augustus_start_from |
Start Augustus training from an existing species model (skips de novo training if close match available) | arabidopsis |
use_augustus |
Use a pre-trained Augustus species without re-training (set to species name, or placeholder to train fresh) |
placeholder |
num_evm_files |
Number of parallel EVM partitions (more = faster but more SLURM jobs) | 126 |
geta.RM_species |
RepeatMasker species database (e.g. Embryophyta, Viridiplantae, Metazoa) |
Embryophyta |
Helixer benefits significantly from GPU acceleration (~10x speedup). To use a separate GPU partition, add the following per-rule override in config_annotate.yml:
helixer:
ncpus: 4
memory: 32g
account: your-gpu-account # GPU-specific billing account
partition: your-gpu-partition # GPU partition nameFind your SLURM account and partition:
# Show your accounts and partitions
sacctmgr show user "$USER" withassoc format=Account,Partition -nP
# List all available partitions
sinfo -s
# Show partition details (time limits, nodes, etc.)
sinfo -o "%P %l %D %c %m"Set in config_annotate.yml under the __default__ section:
__default__:
account: your-account
partition: your-partition
time: "3-00:00:00"
memory: 8g
ncpus: 1The cluster_cmd field in the __default__ section defines the sbatch template used by Snakemake's --cluster option. All entry scripts (annotate.sh, filter.sh, etc.) read this template via bin/get_cluster_cmd.py, so you only need to edit the config YAML:
__default__:
cluster_cmd: >-
sbatch -N {cluster.nodes} --mem={cluster.memory} --cpus-per-task={cluster.ncpus}
-J {cluster.name} --parsable
-A {cluster.account} -p {cluster.partition}
-t {cluster.time} -o {cluster.output} -e {cluster.error}Common customizations:
- Remove
-A {cluster.account}if your HPC doesn't require a billing account - Add
--gres=gpu:1for GPU rules - Add
--export=ALLto pass environment variables
All outputs are organized under results/:
results/
├── complete_draft.gff3 # Annotate phase final output
│
├── AB_INITIO/
│ └── Helixer/ # Helixer predictions
│
├── GETA/
│ ├── RepeatMasker/ # Repeat masking results
│ ├── Augustus/ # Augustus predictions
│ ├── transcript/ # TransDecoder results
│ ├── homolog/ # Protein alignments (Miniprot → GeneWise)
│ └── CombineGeneModels/ # GETA gene models
│
├── LIFTOVER/
│ └── LiftOff/ # Neighbor species liftover
│
├── TRANSCRIPT/
│ ├── PASA/ # PASA assemblies
│ ├── spades/ # De novo assembly
│ └── evigene/ # Evigene transcript clustering
│
├── PROTEIN/ # Merged protein alignments
│
├── EVM/ # EvidenceModeler output
│
├── FILTER/
│ ├── filter.gff3 # Kept gene models
│ ├── discard.gff3 # Discarded gene models
│ ├── data.tsv # Feature matrix (input to random forest)
│ ├── keep_data.tsv # Evidence data for kept genes
│ ├── discard_data.tsv # Evidence data for discarded genes
│ ├── {prefix}.cdna # Transcript sequences
│ ├── {prefix}.pep # Peptide sequences
│ ├── STAR/ # RNA-seq realignment for filter
│ ├── rsem_outdir/ # RSEM quantification
│ ├── splitPep/ # Parallelized BLAST inputs
│ ├── busco_*/ # BUSCO results (monitoring only)
│ └── lncrna_predict.csv # lncDC predictions
│
├── config/ # Runtime config copies
│
└── logs/ # SLURM job logs
Use TidyGFF to prepare annotations for public distribution:
singularity exec sylvan.sif python bin/TidyGFF.py \
MySpecies results/FILTER/filter.gff3 \
--out MySpecies_v1.0 --splice-name t --justify 5 --sort \
--chrom-regex "^Chr" --source SylvanTidyGFF options:
| Option | Description |
|---|---|
pre (positional) |
Prefix for gene IDs (e.g. Ath produces Ath01G000010) |
gff (positional) |
Input GFF3 file |
--out |
Output file basename (produces .gff3, .cdna, .pep files) |
--splice-name |
Splice variant label style (e.g. t → mRNA1.t1, mRNA1.t2) |
--justify |
Number of digits in gene IDs (default: 8) |
--sort |
Sort output by chromosome and start coordinate |
--chrom-regex |
Regex for chromosome prefixes (auto-detects Chr, chr, LG, Ch, ^\d) |
--contig-regex |
Regex for contig/scaffold naming (e.g. HiC_scaffold_(\d+$),Scaf) |
--source |
Value for GFF column 2 (e.g. Sylvan) |
--remove-names |
Remove Name attributes from GFF |
--no-chrom-id |
Do not number gene IDs by chromosome |
# Force rerun all
./bin/annotate.sh --forceall
# Rerun specific rule
./bin/annotate.sh --forcerun helixer
# Rerun incomplete jobs (jobs that started but didn't finish)
./bin/rerun-incomplete.sh
# Generate report after completion
snakemake --report report.html --snakefile bin/Snakefile_annotate
# Unlock after interruption
./bin/annotate.sh --unlock
# Clean up intermediate files (run after BOTH phases complete)
./bin/cleanup.shbin/cleanup.sh removes intermediate files generated during the annotation phase while preserving:
- Final outputs (
complete_draft.gff3,filter.gff3) - Log files (
results/logs/) - Configuration files
- Filter phase outputs (
FILTER/)
Run this only after both annotation and filter phases have completed successfully.
| Script | Purpose |
|---|---|
bin/generate_cluster_from_config.py |
Regenerate per-rule SLURM defaults within config_annotate.yml — keeps resource requests in sync with the pipeline's threads/memory |
bin/get_cluster_cmd.py |
Extract the cluster_cmd sbatch template from a config YAML (used internally by all entry scripts) |
# Find recent errors
ls -lt results/logs/*.err | head -10
grep -l 'Error\|Traceback' results/logs/*.err
# View specific log
cat results/logs/{rule}_{wildcards}.err| Issue | Solution |
|---|---|
| Out of memory | Increase memory in config_annotate.yml for the rule |
No space left on device |
TMPDIR is on tmpfs or quota exceeded — set TMPDIR to project storage |
Segmentation fault |
Often caused by tmpfs exhaustion — set TMPDIR to disk-backed storage |
| File not found (Singularity) | Path not bound in container — add to SINGULARITY_BIND |
| Singularity bind error | Ensure paths are within working directory or use SINGULARITY_BIND |
Permission denied in container |
Check directory permissions, ensure path is bound |
| SLURM account error | Use account (billing account), not username |
cluster_cmd not found |
Ensure __default__.cluster_cmd exists in your config YAML |
| LFS files not downloaded | Run git lfs pull; verify with ls -la toydata/ (files should be > 200 bytes) |
| Augustus training fails | Needs minimum ~500 training genes; use augustus_start_from with a close species |
| Job timeout | Increase time in config_annotate.yml for the rule |
| Variables not in SLURM job | Add #SBATCH --export=ALL or explicitly export in submit script |
Filter chrom_regex error |
Ensure chrom_regex in config_filter.yml matches your chromosome naming convention |
- General recommendation: 4 GB per thread
- Example: 48 threads = 192 GB memory
ncpusandthreadsshould match inconfig_annotate.yml- Some rules need more:
mergeSTARmay require ~18 GB per thread for large datasets - Check
df -h $TMPDIRto ensure temp storage is on real disk, not tmpfs
Sylvan: A comprehensive genome annotation pipeline. Under review.
MIT License - see LICENSE

