Mach-1 Manuscript

This repository contains the data processing, analysis code, and figure-generation workflows that accompany the Mach-1 RNA language model study. It covers preprocessing of RNA sequencing datasets, downstream probing of the trained model, and assembly of the results reported in the manuscript. Model checkpoints, interactive notebooks, and core inference utilities are maintained in the companion Mach-1 repository.

pre-mach/ – data curation pipelines that transform raw sequencing and annotation resources into model-ready inputs
- long-reads/ – long-read RNA-seq processing via Bambu, ESPRESSO, IsoQuant, PIGEON, and TALON
- short-reads/ – short-read processing including STAR alignment, Kallisto quantification, and alternative splicing analyses (rMATS, MISO, SUPPA, PSI-Sigma)
- additional utilities for MPAQT joint analyses, differential expression, cell-type specificity, and Cerberus transcript annotations
post-mach/ – analyses driven by Mach-1 outputs
- probing/ – cell-line-specific abundance and upregulation evaluations
- zero-shot/ – studies of abundance, ClinVar variants, DepMap mutations, exon trap validation, motif discovery, RBP binding, and splice sites
- knockdown/ – scripts and notebooks for perturbation experiments, including post-mach/knockdown/analyze_knockdown_data.R
Mach-1 notebooks (in the Mach-1 repository) – interactive workflows for preparing sequences and variants (prepare_seqs.ipynb, prepare_variants.ipynb, prepare_exon_trap_seqs.ipynb), tokenizing sequences (tokenize_seqs.ipynb), running inference (get_likelihoods.ipynb, get_variant_lls.ipynb, get_exon_trap_lls.ipynb), extracting representations (get_embeddings.ipynb), scoring variants (get_variant_scores.ipynb, get_exon_trap_scores.ipynb), and generating sequences (generate_seqs.ipynb); these notebooks rely on the StripedHyena architecture and specialized tokenization that distinguishes exonic (uppercase) and intronic (lowercase) nucleotides to produce the outputs analyzed here

Each folder contains task-specific scripts, configuration files, and intermediate data products that can be chained together to reproduce the figures in the manuscript.

Reproducing the Manuscript Analyses

Use the Mach-1 notebooks or command-line scripts to generate sequence likelihoods, embeddings, variant scores, and synthetic sequences.
Stage the resulting files in the expected directory structure (see comments within each analysis script).
Execute the pipelines in pre-mach/ to prepare RNA-seq inputs, followed by the appropriate workflows under post-mach/ to recreate figures and supplementary tables.

Many scripts are designed for high-performance computing environments and make use of SLURM submission helpers. Replace cluster-specific paths and modules as needed before execution.

License

The repository is provided under the terms of the included LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
post-mach		post-mach
pre-mach		pre-mach
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mach-1 Manuscript

Contents

Reproducing the Manuscript Analyses

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mach-1 Manuscript

Contents

Reproducing the Manuscript Analyses

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages