This repository contains the data processing, analysis code, and figure-generation workflows that accompany the Mach-1 RNA language model study. It covers preprocessing of RNA sequencing datasets, downstream probing of the trained model, and assembly of the results reported in the manuscript. Model checkpoints, interactive notebooks, and core inference utilities are maintained in the companion Mach-1 repository.
pre-mach/– data curation pipelines that transform raw sequencing and annotation resources into model-ready inputslong-reads/– long-read RNA-seq processing via Bambu, ESPRESSO, IsoQuant, PIGEON, and TALONshort-reads/– short-read processing including STAR alignment, Kallisto quantification, and alternative splicing analyses (rMATS, MISO, SUPPA, PSI-Sigma)- additional utilities for MPAQT joint analyses, differential expression, cell-type specificity, and Cerberus transcript annotations
post-mach/– analyses driven by Mach-1 outputsprobing/– cell-line-specific abundance and upregulation evaluationszero-shot/– studies of abundance, ClinVar variants, DepMap mutations, exon trap validation, motif discovery, RBP binding, and splice sitesknockdown/– scripts and notebooks for perturbation experiments, includingpost-mach/knockdown/analyze_knockdown_data.R
- Mach-1 notebooks (in the Mach-1 repository) – interactive workflows for preparing sequences and variants (
prepare_seqs.ipynb,prepare_variants.ipynb,prepare_exon_trap_seqs.ipynb), tokenizing sequences (tokenize_seqs.ipynb), running inference (get_likelihoods.ipynb,get_variant_lls.ipynb,get_exon_trap_lls.ipynb), extracting representations (get_embeddings.ipynb), scoring variants (get_variant_scores.ipynb,get_exon_trap_scores.ipynb), and generating sequences (generate_seqs.ipynb); these notebooks rely on the StripedHyena architecture and specialized tokenization that distinguishes exonic (uppercase) and intronic (lowercase) nucleotides to produce the outputs analyzed here
Each folder contains task-specific scripts, configuration files, and intermediate data products that can be chained together to reproduce the figures in the manuscript.
- Use the Mach-1 notebooks or command-line scripts to generate sequence likelihoods, embeddings, variant scores, and synthetic sequences.
- Stage the resulting files in the expected directory structure (see comments within each analysis script).
- Execute the pipelines in
pre-mach/to prepare RNA-seq inputs, followed by the appropriate workflows underpost-mach/to recreate figures and supplementary tables.
Many scripts are designed for high-performance computing environments and make use of SLURM submission helpers. Replace cluster-specific paths and modules as needed before execution.
The repository is provided under the terms of the included LICENSE.