TransGenic

TransGenic is a transformer for DNA-to-annotation machine translation. Gene annotations specify the structure of a gene within a DNA sequence by providing the composition of each mRNA transcript based on the coordinate locations of sub-genic features, including coding sequences (CDS), introns, and unstranslated regions (UTR). TransGenic uses a HyenaDNA encoder with the Longformer decoder to predict a text-based annotation format from raw DNA sequence.

Architecture

TransGenic uses an encoder-decoder architecture:

Encoder: HyenaDNA, a long-range genomic foundation model capable of processing sequences up to 1 million nucleotides using sub-quadratic convolution operations instead of full attention
Decoder: Longformer-based autoregressive decoder that generates structured text annotations

This design enables the model to capture long-range dependencies in DNA while producing human-readable outputs.

Key Features

De novo annotation: Generate complete gene structures from unannotated DNA sequences
Splice variant prediction: Predict alternative isoforms via prompt completion given an existing transcript
Compact output format: Gene Sentence Format (GSF) reduces annotation redundancy for efficient generation
Plant-focused: Trained on 9 phylogenetically diverse plant species
High accuracy: Achieves 92% base-level F1 score on Arabidopsis thaliana test data

Gene sentence format (GSF)

TransGenic produces output in a format modified from the standard Gene Feature Format (GFF). Gene sentence format (GSF) contains identical information as GFF but reduces the redundancy and length of output annotations. This permits generative decoding within reasonable memory requirements for the decoder's attention mechanisms.

Gene sentence format specifies gene model outputs in two parts, a feature list and a transcript list. The feature list specifies the coordinate locations of sub-genic features (CDS, 5'-UTR, and 3'-UTR) and the transcript list specifies the composition of spliced mRNA transcripts based on the components in the feature list.

GSF Format Structure

GSF consists of two parts separated by >:

<feature_list>><transcript_list>

Feature List

Each feature follows the format: start|type|end|strand|phase

start: 0-indexed start coordinate (relative to extracted sequence)
type: Feature type with unique number (CDS1, CDS2, five_prime_UTR1, three_prime_UTR1, etc.)
end: End coordinate (exclusive, like Python slicing)
strand: + (forward) or - (reverse)
phase: Reading frame for CDS features
- A = phase 0 (codon starts at position 0)
- B = phase 1 (codon starts at position 1)
- C = phase 2 (codon starts at position 2)
- . = not applicable (for UTRs)

Multiple features are separated by ;

Transcript List

After the > separator, transcripts list their component features:

Features are separated by |
Multiple transcripts (isoforms) are separated by ;

Examples

Example 1: Simple single-transcript gene (3 CDS)

GFF:

Chr1  source  gene  100  400  .  +  .  ID=gene1
Chr1  source  mRNA  100  400  .  +  .  ID=mRNA1
Chr1  source  CDS   100  150  .  +  0  ID=cds1
Chr1  source  CDS   200  280  .  +  2  ID=cds2
Chr1  source  CDS   350  400  .  +  1  ID=cds3

GSF:

0|CDS1|50|+|A;100|CDS2|180|+|C;250|CDS3|300|+|B>CDS1|CDS2|CDS3

Note: Coordinates are relative to the extracted sequence (gene start = 0).

Example 2: Gene with alternative splicing (2 transcripts)

GFF:

Chr1  source  gene  100  350  .  +  .  ID=gene1
Chr1  source  mRNA  100  350  .  +  .  ID=mRNA1
Chr1  source  CDS   100  130  .  +  0  ID=cds1
Chr1  source  CDS   180  220  .  +  1  ID=cds2
Chr1  source  CDS   280  350  .  +  0  ID=cds3
Chr1  source  mRNA  180  350  .  +  .  ID=mRNA2
Chr1  source  CDS   180  220  .  +  1  ID=cds2
Chr1  source  CDS   280  350  .  +  0  ID=cds3

GSF:

0|CDS1|30|+|A;80|CDS2|120|+|B;180|CDS3|250|+|A>CDS1|CDS2|CDS3;CDS2|CDS3

First transcript uses all three CDS: CDS1|CDS2|CDS3
Second transcript skips CDS1 (alternative start): CDS2|CDS3
Coordinates are relative to gene start (100 → 0)

Example 3: Gene with UTRs

GFF:

Chr1  source  gene            500  900  .  +  .  ID=gene1
Chr1  source  mRNA            500  900  .  +  .  ID=mRNA1
Chr1  source  five_prime_UTR  500  550  .  +  .  ID=utr5
Chr1  source  CDS             550  650  .  +  0  ID=cds1
Chr1  source  CDS             700  800  .  +  1  ID=cds2
Chr1  source  three_prime_UTR 800  900  .  +  .  ID=utr3

GSF:

0|five_prime_UTR1|50|+|.;50|CDS1|150|+|A;200|CDS2|300|+|B;300|three_prime_UTR1|400|+|.>five_prime_UTR1|CDS1|CDS2|three_prime_UTR1

UTRs use . for phase since they are non-coding
Transcript includes UTRs in the proper order

Converting GFF3 to GSF

Use scripts/gff2gsf.py to convert existing GFF3 annotations to GSF format:

# Basic usage (output to stdout)
python scripts/gff2gsf.py annotation.gff3

# Save to file
python scripts/gff2gsf.py annotation.gff3 -o output.gsf

# Use absolute coordinates instead of relative
python scripts/gff2gsf.py annotation.gff3 --absolute

Output format (tab-separated):

gene_id    GSF_string
AT1G01010  0|CDS1|150|+|A;200|CDS2|350|+|B>CDS1|CDS2
AT1G01020  0|five_prime_UTR1|50|+|.;50|CDS1|200|+|A>five_prime_UTR1|CDS1

Using TransGenic

Quick start

Try TransGenic instantly on Google Colab (no installation required):

Minimal Example

import torch
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizers from HuggingFace
model_name = "jlomas/HyenaTransgenic-768L12A6-400M"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
gffTokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
dnaTokenizer = AutoTokenizer.from_pretrained(
    "LongSafari/hyenadna-large-1m-seqlen-hf", trust_remote_code=True
)

# Tokenize DNA sequence
seq = "ATGCGT...your_sequence...TGATGA"
input_ids = dnaTokenizer.batch_encode_plus(
    [seq], return_tensors="pt"
)["input_ids"][:, :-1]

# Generate annotation
model.eval()
if torch.cuda.is_available():
    input_ids = input_ids.to("cuda")
    model.to("cuda")

outputs = model.generate(
    inputs=input_ids,
    max_length=2048,
    num_beams=2,
    do_sample=True
)

# Decode to GSF format
gsf_prediction = gffTokenizer.batch_decode(
    outputs.detach().cpu().numpy(),
    skip_special_tokens=True
)[0]
print(gsf_prediction)
# Output: 0|CDS1|150|+|A;200|CDS2|350|+|B>CDS1|CDS2

For local development, run notebook examples from the examples/ folder after setting up an environment as described below.

Set-up

# Clone the repo
git clone git@github.com:JohnnyLomas/transgenic.git
cd transgenic

Check Your System

Run the system check script to determine which environment file to use:

./scripts/check_system.sh

Example output:

============================================
TransGenic System Check
============================================

[Architecture]
  uname -m: x86_64
  Type: x86_64 (Intel/AMD 64-bit)

[GPU Check]
  nvidia-smi: Found
  GPU: NVIDIA GeForce RTX 4090
  Driver: 550.54.14
  CUDA: 12.4

[Recommended Environment]
  → x86 with CUDA GPU
  → Use: environment.yml
============================================

Architecture	GPU	Recommendation
x86_64	NVIDIA GPU detected	`environment.yml`
x86_64	No GPU	`environment.cpu.yml`
aarch64	NVIDIA GB10	`environment.gb10.base.yml` + install script

Verify CUDA after installation:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

x86 with CUDA GPU

For Linux/Windows systems with NVIDIA GPU (GTX, RTX, Tesla, etc.). Includes CUDA 12.4 support.

# Create environment with all dependencies
conda env create -f environment.yml
conda activate transgenic

# Install the transgenic module
pip install -e .

x86 CPU Only (No GPU)

For systems without NVIDIA GPU (Intel/AMD CPU only, macOS, or VMs without GPU passthrough). Inference will be slower but fully functional.

# Create CPU-only environment
conda env create -f environment.cpu.yml
conda activate transgenic

# Install the transgenic module
pip install -e .

GB10 ARM CPU (NVIDIA Grace Blackwell)

environment.gb10.base.yml and scripts/install_ml_stack_gb10.sh are installation files for NVIDIA GB10 ARM CPU environments. This two-step approach is recommended for aarch64 platforms:

Base environment via conda-forge: Only basic dependencies (numpy, pandas, etc.) are installed through conda-forge
PyTorch via pip from PyTorch index: torch, torchvision, and torchaudio are installed exclusively from the official PyTorch wheel index

This separation provides the most stable setup on aarch64 architectures, avoiding common dependency conflicts.

# Remove existing environment (if present)
conda env remove -n transgenic -y || true

# Create base environment
conda env create -f environment.gb10.base.yml -y
conda activate transgenic

# Install ML stack (PyTorch CUDA + HuggingFace + transgenic)
chmod +x scripts/install_ml_stack_gb10.sh
./scripts/install_ml_stack_gb10.sh

# Optional: For development with editable install, run additionally:
# pip install -e .

Pretrained Checkpoints on HuggingFace

All checkpoints were trained on 9 plant genomes covering diverse phyla, including dicot, monocot, and moss species. The highest performance on test set evaluation (92% base-level F1 in Arabidopsis) was achieved using the 400M parameter model. Both checkpoints used sequences padded with neighboring genomic sequence to the next multiple of 6144 nucleotides.

Training Data

Nine phylogenetically diverse plant species:

Arabidopsis thaliana, Glycine max (Soybean), Oryza sativa (Rice)
Sorghum bicolor, Populus trichocarpa (Poplar), Brachypodium distachyon
Vitis vinifera (Grape), Setaria italica (Millet), Physcomitrella patens (Moss)

Available Models

Model	Parameters	Hidden Size	Layers	Attention Heads	F1 Score
HyenaTransgenic-768L12A6-400M	~400M	768	12	6	92%
HyenaTransgenic-512L9A4-160M	~160M	512	9	4	-

Training Configuration

Learning rate: 5e-5
Batch size: 96 (effective)
Loss: Cross Entropy
Mixed precision: BF16
Input length: Multiples of 6,144nt (max 49,152nt)

Intended Uses

Generate de novo annotations for plant DNA sequences containing genes
Add alternatively spliced isoforms to known primary mRNA transcripts via prompt completion

Inference

The general outline of an inference workflow is:

Create a DuckDB database from a FASTA and a [GFF3|BED] file which describes the sequences to be used for prediction
Initialize a PyTorch Dataset and DataLoader for the database
Generate annotations using model.generate
Convert GSF outputs to a GFF3 formatted output file

Example Notebooks

Single Sequence Inference

Annotate a single DNA sequence using a pretrained model
Basic workflow: load model → encode sequence → generate GSF → convert to GFF3

Multi-Sequence Inference

Batch annotation of multiple gene regions from a genome
De novo prediction from BED file (gene coordinates only)
Splice variant prediction from GFF3 file (prompt completion with existing transcript)

GFF3 Sorting Requirement

When building databases from GFF3 files, TransGenic expects the GFF3 to be sorted using a sort order similar to the one used by AGAT (Another GFF Analysis Toolkit). To sort using AGAT:

agat_convert_sp_gxf2gxf.pl -g [file.gff3] -o [file.sorted.gff3]

See AGAT documentation for installation and usage.

Test Scripts

The test/ folder contains evaluation and benchmark scripts for different model configurations:

Script	Description
`test_AgroSegmentNT.py`	Segmentation evaluation with AgroNT + Segment-NT encoder
`test_HyenaSegmentNT.py`	Segmentation evaluation with HyenaDNA encoder
`testingAdjustCoords.py`	Combined segmentation + generation with coordinate refinement
`testingHyena.py`	HyenaDNA generation model (without post-processing)
`testingHyenaCompletion.py`	Prompt completion for splice variant prediction
`testingHyenaDual.py`	Separate decoder and segmentation model pipeline
`testingHyenaPostnoPost.py`	Compare raw vs post-processed predictions
`testingNT.py`	Nucleotide Transformer based generation + segmentation
`testingT5Hyena.py`	T5 decoder with HyenaDNA encoder
`testingT5Transgenic.py`	T5 decoder with AgroNT encoder + segmentation
`testSingle_tomato.py`	Single sequence inference example (tomato gene)

Scripts

The scripts/ folder contains utility scripts:

Script	Description
`check_system.sh`	Check system architecture and GPU for environment selection
`gff2gsf.py`	Convert GFF3 annotations to GSF format
`install_ml_stack_gb10.sh`	Install PyTorch + HuggingFace stack for GB10 ARM
`test_torch_cuda_gb10.py`	CUDA verification test for GB10

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
Figures		Figures
examples		examples
notes		notes
scripts		scripts
src/transgenic		src/transgenic
test		test
train		train
README.md		README.md
environment.cpu.yml		environment.cpu.yml
environment.gb10.base.yml		environment.gb10.base.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml

JohnnyLomas/transgenic

Folders and files

Latest commit

History

Repository files navigation

TransGenic

Architecture

Key Features

Gene sentence format (GSF)

GSF Format Structure

Feature List

Transcript List

Examples

Example 1: Simple single-transcript gene (3 CDS)

Example 2: Gene with alternative splicing (2 transcripts)

Example 3: Gene with UTRs

Converting GFF3 to GSF

Using TransGenic

Quick start

Minimal Example

Set-up

Check Your System

x86 with CUDA GPU

x86 CPU Only (No GPU)

GB10 ARM CPU (NVIDIA Grace Blackwell)

Pretrained Checkpoints on HuggingFace

Training Data

Available Models

Training Configuration

Intended Uses

Inference

Example Notebooks

GFF3 Sorting Requirement

Test Scripts

Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages