Skip to content

Mvgnu/Lasso

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lasso Workbench

A GUI application for semantic lasso peptide precursor discovery and dataset curation.

Overview

The Lasso Workbench implements a semantic discovery pipeline combining 6-frame translation with ESM-2 embeddings to identify lasso peptide precursors in genomic data.

Key Features:

  • Semantic Discovery: Encodes candidate ORFs using ESM-2 (t6/t12/t30/t33) and ranks them by cosine similarity to validated precursors.
  • Exhaustive Search: Extracts all potential ORFs (limited by min/max AA length) from 6-frame translation, ensuring no candidates are missed due to annotation errors.
  • Rule-Based Ranking: Filters candidates using customizable heuristic constraints (leader motifs, core length) on top of semantic scores.
  • Dataset Management: Utilities for curating of validated precursors.

Gaps:

  • No Tail handling: The current pipeline does not handle tail pruning in core prediction.

Quick Start

Installation

# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Launch Application
lasso-workbench

Usage

  1. Pipeline: Upload GBK files (MiBIG, antiSMASH, or GenBank).
  2. Configuration: Select an ESM-2 model (default: esm2_t6_8M_UR50D) and a validated precursor reference set.
  3. Execution: The pipeline performs:
    • Extraction: 6-frame translation of input sequences.
    • Embedding: Batch inference via ESM-2 (masked padding).
    • Scoring: Compute pairwise cosine similarity against references.
    • Ranking: Sort by Top-N Mean similarity and apply rule filters.
  4. Analysis: Review ranked candidates in the tabular view, html visualisation and export to JSON/CSV/FASTA.

Architecture

  • Core Logic: lasso_workbench/core - Embedding scoring (NumPy/Scikit-learn), Rule Engine.
  • Pipeline: lasso_workbench/pipeline - ORF extraction (BioPython), ESM-2 Embedding (HuggingFace Transformers).
  • UI: lasso_workbench/ui - Gradio-based interface.
  • Data: JSON-based persistence for easy portability.

Precursor Datasets

Validated precursor reference sets live in data/precursors/.

Generated artifacts (do not edit by hand):

  • precursor_proteins_multi_strict.faa — multi‑candidate validated set generated by scripts/generate_precursor_dataset_multi.py
  • lab_core_candidates_multi_strict.tsv
  • lab_core_loci_multi_strict.tsv
  • lab_core_cases_summary.tsv
  • lab_core_dataset_summary.json

Persistent curated additions (safe to edit via script):

  • precursor_proteins_curated.faa
  • precursor_proteins_curated.tsv

Default validated set used by the UI:

  • precursor_proteins_verified.faa — merged, de‑duplicated set combining multi_strict + curated
  • precursor_proteins_verified.tsv — merged metadata table (read‑only; used by the Dataset tab)

Reference FASTA format (Pipeline + Bench UI)

Any reference FASTA you drop into the Pipeline or Bench UI must include a locus= token in the header. This is required for grouped top‑N mean scoring.

Minimal example:

>name=foo|locus=foo_001
MSEQ...

Lasso benchmark (peptidase holdout) expects:

>pep=WP_012345678|locus=lab_0001_locus_01|case=lab_0001|orf=0001
MSEQ...

Beta‑lactamase benchmark expects:

>prot=AAC09015|locus=AAC09015|name=AER-1|class=A|source=ambler_table
MSEQ...

If locus= is missing, scoring will error with: Missing locus= token in reference id.

Add new curated precursors

Use the helper script to append curated sequences without touching generated files:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/newlassos.txt \
  --format newlassos \
  --output-dir data/precursors \
  --merge

The script also accepts TSV/CSV files with these columns (case-insensitive):

  • sequence (or precursor / full_precursor)
  • optional core (or core_aa)
  • optional name
  • optional orientation (leader_core / core_leader)

Example:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/curated_pairs.tsv \
  --format tsv \
  --output-dir data/precursors \
  --merge

This updates:

  • precursor_proteins_curated.faa
  • precursor_proteins_curated.tsv
  • precursor_proteins_verified.faa (merged set used by default)
  • precursor_proteins_verified.tsv (merged metadata table)

Alt‑Frame Switches (Experimental)

Beyond precursor retrieval, the codebase includes an experimental alt‑frame conservation analysis. It looks for alt‑frame ORFs overlapping annotated genes that show unexpected conservation patterns. The current implementation lives in:

lasso_workbench/altframe/experimental/constraint_conservation

The Null Model

We anchor on CDS position (not ORF), bin the gene into regions, and ask:

  • Survival: does the alt‑frame avoid stop codons?
  • Identity: when it survives, is the peptide conserved?

A codon‑preserving null shuffles synonymous codons while keeping the primary protein intact, then re‑checks alt‑frame survival and identity across permutations (default: 200).

Interpreting the stats (example row)

Example single locus from results/altframe_constraint_tests/altframe_constraint_conservation.tsv (phnU, out_of_frame, +2, bin 0–0):

Metric Observed Null Interpretation
Survival ~47% ~94.6% lower than null → stops are favored
Identity ~69% ~27% higher than null → survivors are conserved
Z‑score 70 very large separation

Translation: Some lineages appear to turn the alt‑frame “OFF” (introduce stops) while others keep it highly conserved (ON). This is compatible with a binary regulatory switch hypothesis, but the statistics are null‑model‑dependent.

Important: p‑values in the TSV are empirical with a finite number of permutations (with 200 iterations, the minimum is ~0.005). Treat Z‑scores as effect sizes, not exact tail probabilities.

Cost heuristic

Low survival + High identity = High cost

This is a practical heuristic: a peptide that is costly to maintain will be absent in many genomes (OFF) but highly conserved when present (ON). It is not a formal likelihood.

Files

  • results/altframe_constraint_tests/altframe_constraint_conservation.tsv

Lasso ORF Index (Essence)

The streamlined “essence” path is an ORF indexer that scans only antiSMASH lasso regions and stores 20–120 aa peptides in a local SQLite database for fast conservation queries.

Build the DB:

.venv/bin/python scripts/build_lasso_orf_db.py \
  --gbk-dir data/antismash_lasso/gbk \
  --db-path results/lasso_orf_index.sqlite \
  --min-aa 20 \
  --max-aa 120 \
  --reset

The pipeline/UI will automatically attach genome_count (found‑in‑X‑genomes) to candidates when results/lasso_orf_index.sqlite exists.

Spec reference:

lasso_workbench/altframe/specs/orf_indexer.md

LICENSE

MIT License

Made by Magnus Ohle

NOTICE

See NOTICE for attribution and reuse guidance.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published