TSEBRA IS DEVELOPED AND MAINTAINED AT https://github.com/Gaius-Augustus/TSEBRA.
TSEBRA is a combiner tool that selects transcripts from gene predictions based on the support by extrisic evidence in form of introns and start/stop codons. It was developed to combine BRAKER11 and BRAKER22 predicitons to increase their accuracies.
Python 3.5.2 or higher is required.
Download TSEBRA:
git clone https://github.com/LarsGab/TSEBRA Or download TSEBRA as submodule of BRAKER with:
git clone --recurse-submodules https://github.com/Gaius-Augustus/BRAKERThe main script is ./bin/tsebra.py. For usage information run ./bin/tsebra.py --help.
TSEBRA needs a list of gene prediciton files, a list of hintfiles and a configuration file as input.
The gene prediction files needs to be in gtf format. This is the standard output format of a BRAKER or AUGUSTUS3,4 gene prediciton.
Example:
2L AUGUSTUS gene 83268 87026 0.88 - . g5332
2L AUGUSTUS transcript 83268 87026 0.88 - . g5332.t1
2L AUGUSTUS intron 84278 87019 1 - . transcript_id "file_1_file_1_g5332.t1"; gene_id "file_1_file_1_g5332";
2L AUGUSTUS CDS 87020 87026 0.88 - 0 transcript_id "file_1_file_1_g5332.t1"; gene_id "file_1_file_1_g5332";
2L AUGUSTUS exon 87020 87026 . - . transcript_id "file_1_file_1_g5332.t1"; gene_id "file_1_file_1_g5332";The hint files have to be in gff format, the last column must include an attribute for the source for the hint with 'src=' and can include the number of hints supporting the gene structure segment with 'mult='. This is the standard file format of the hintfiles.gff in a BRAKER working directory.
Example:
2L ProtHint intron 279806 279869 2 + . src=P;mult=25;pri=4;al_score=0.437399;
2L ProtHint intron 275252 275318 2 - . src=P;mult=19;pri=4;al_score=0.430006;
2L ProtHint stop 293000 293002 1 + 0 grp=7220_0:002b08_g42;src=C;pri=4;
2L ProtHint intron 207632 207710 1 + . grp=7220_0:002afa_g26;src=C;pri=4;
2L ProtHint start 207512 207514 1 + 0 grp=7220_0:002afa_g26;src=C;pri=4;The configuration file has to include three types of parameter:
- The weight for each hint source. A weight is set to 1, if the weight for a source is not determined in the cfg file.
- Required fraction of supported introns or supported start/stop-codons for a transcript.
- Allowed difference between two overlapping transcripts for each feature type.
Example:
# src weights
P 0.1
E 10
C 5
M 1
# Low evidence support
intron_support 0.75
stasto_support 1
# Feature differences
e_1 0
e_2 0.5
e_3 25
e_4 10The recommended and most common usage for TSEBRA is to combine the resulting braker.gtf files of a BRAKER1 and a BRAKER2 run using the hintsfile.gff from both working directories. However, TSEBRA can be applied to any number (>1) of gene predictions and hint files as long as they are in the correct format.
A common case might be that a user wants to annotate a novel genome with BRAKER and has:
- a novel genome with repeats masked:
genome.fasta.masked, - hints for intron positions from RNA-seq reads
rna_seq_hints.gff, - database of homologous proteins:
proteins.fa.
- Run BRAKER1 and BRAKER2 for example with
### BRAKER1
braker.pl --genome=genome.fasta.masked --hints=rna_seq_hints.gff \
--softmasking --species=species_name --workingdir=braker1_out
### BRAKER2
braker.pl --genome=genome.fasta.masked --prot_seq=proteins.fa \
--softmasking --species=species_name --epmode --prg=ph \
--workingdir=braker2_out- Make sure that the gene and transcript IDs of the gene prediction files are in order (this step is optional)
./bin/fix_gtf_ids.py --gtf braker1_out/braker.gtf --out braker1_fixed.gtf
./bin/fix_gtf_ids.py --gtf braker2_out/braker.gtf --out braker2_fixed.gtf- Combine predicitons with TSEBRA
./bin/tsebra.py -g braker1_fixed.gtf,braker2_fixed.gtf -c default.cfg \
-e braker1_out/hintsfile.gff,braker2_out/hintsfile.gff \
-o braker1+2_combined.gtfThe combined gene prediciton is braker1+2_combined.gtf.
A small example is located at example/. Run ./example/run_prevco_example.sh to execute the example and to check if TSEBRA runs properly.
All source code, i.e. bin/*.py are under the Artistic License (see https://opensource.org/licenses/Artistic-2.0).
[1] Hoff, Katharina J, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with Genemark-et and Augustus.” Bioinformatics 32 (5). Oxford University Press: 767--69.↑
[2] Tomas Bruna, Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke and Mark Borodvsky. 2021. “BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics 3(1):lqaa108.↑***
[3] Stanke, Mario, Mark Diekhans, Robert Baertsch, and David Haussler. 2008. “Using Native and Syntenically Mapped cDNA Alignments to Improve de Novo Gene Finding.” Bioinformatics 24 (5). Oxford University Press: 637--44.↑
[4] Stanke, Mario, Oliver Schöffmann, Burkhard Morgenstern, and Stephan Waack. 2006. “Gene Prediction in Eukaryotes with a Generalized Hidden Markov Model That Uses Hints from External Sources.” BMC Bioinformatics 7 (1). BioMed Central: 62.↑