Skip to content

csglab/mach-1-manuscript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mach-1 Manuscript

This repository contains the data processing, analysis code, and figure-generation workflows that accompany the Mach-1 RNA language model study. It covers preprocessing of RNA sequencing datasets, downstream probing of the trained model, and assembly of the results reported in the manuscript. Model checkpoints, interactive notebooks, and core inference utilities are maintained in the companion Mach-1 repository.

Contents

  • pre-mach/ – data curation pipelines that transform raw sequencing and annotation resources into model-ready inputs
    • long-reads/ – long-read RNA-seq processing via Bambu, ESPRESSO, IsoQuant, PIGEON, and TALON
    • short-reads/ – short-read processing including STAR alignment, Kallisto quantification, and alternative splicing analyses (rMATS, MISO, SUPPA, PSI-Sigma)
    • additional utilities for MPAQT joint analyses, differential expression, cell-type specificity, and Cerberus transcript annotations
  • post-mach/ – analyses driven by Mach-1 outputs
    • probing/ – cell-line-specific abundance and upregulation evaluations
    • zero-shot/ – studies of abundance, ClinVar variants, DepMap mutations, exon trap validation, motif discovery, RBP binding, and splice sites
    • knockdown/ – scripts and notebooks for perturbation experiments, including post-mach/knockdown/analyze_knockdown_data.R
  • Mach-1 notebooks (in the Mach-1 repository) – interactive workflows for preparing sequences and variants (prepare_seqs.ipynb, prepare_variants.ipynb, prepare_exon_trap_seqs.ipynb), tokenizing sequences (tokenize_seqs.ipynb), running inference (get_likelihoods.ipynb, get_variant_lls.ipynb, get_exon_trap_lls.ipynb), extracting representations (get_embeddings.ipynb), scoring variants (get_variant_scores.ipynb, get_exon_trap_scores.ipynb), and generating sequences (generate_seqs.ipynb); these notebooks rely on the StripedHyena architecture and specialized tokenization that distinguishes exonic (uppercase) and intronic (lowercase) nucleotides to produce the outputs analyzed here

Each folder contains task-specific scripts, configuration files, and intermediate data products that can be chained together to reproduce the figures in the manuscript.

Reproducing the Manuscript Analyses

  1. Use the Mach-1 notebooks or command-line scripts to generate sequence likelihoods, embeddings, variant scores, and synthetic sequences.
  2. Stage the resulting files in the expected directory structure (see comments within each analysis script).
  3. Execute the pipelines in pre-mach/ to prepare RNA-seq inputs, followed by the appropriate workflows under post-mach/ to recreate figures and supplementary tables.

Many scripts are designed for high-performance computing environments and make use of SLURM submission helpers. Replace cluster-specific paths and modules as needed before execution.

License

The repository is provided under the terms of the included LICENSE.

About

Pre-/Post-Mach-1 Analyses Scripts and Notebooks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors