Lasso Workbench

A GUI application for semantic lasso peptide precursor discovery and dataset curation.

Overview

The Lasso Workbench implements a semantic discovery pipeline combining 6-frame translation with ESM-2 embeddings to identify lasso peptide precursors in genomic data.

Key Features:

Semantic Discovery: Encodes candidate ORFs using ESM-2 (t6/t12/t30/t33) and ranks them by cosine similarity to validated precursors.
Exhaustive Search: Extracts all potential ORFs (limited by min/max AA length) from 6-frame translation, ensuring no candidates are missed due to annotation errors.
Rule-Based Ranking: Filters candidates using customizable heuristic constraints (leader motifs, core length) on top of semantic scores.
Dataset Management: Utilities for curating of validated precursors.

Gaps:

No Tail handling: The current pipeline does not handle tail pruning in core prediction.

Quick Start

Installation

# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Launch Application
lasso-workbench

Usage

Pipeline: Upload GBK files (MiBIG, antiSMASH, or GenBank).
Configuration: Select an ESM-2 model (default: esm2_t6_8M_UR50D) and a validated precursor reference set.
Execution: The pipeline performs:
- Extraction: 6-frame translation of input sequences.
- Embedding: Batch inference via ESM-2 (masked padding).
- Scoring: Compute pairwise cosine similarity against references.
- Ranking: Sort by Top-N Mean similarity and apply rule filters.
Analysis: Review ranked candidates in the tabular view, html visualisation and export to JSON/CSV/FASTA.

Architecture

Core Logic: lasso_workbench/core - Embedding scoring (NumPy/Scikit-learn), Rule Engine.
Pipeline: lasso_workbench/pipeline - ORF extraction (BioPython), ESM-2 Embedding (HuggingFace Transformers).
UI: lasso_workbench/ui - Gradio-based interface.
Data: JSON-based persistence for easy portability.

Precursor Datasets

Validated precursor reference sets live in data/precursors/.

Generated artifacts (do not edit by hand):

precursor_proteins_multi_strict.faa — multi‑candidate validated set generated by scripts/generate_precursor_dataset_multi.py
lab_core_candidates_multi_strict.tsv
lab_core_loci_multi_strict.tsv
lab_core_cases_summary.tsv
lab_core_dataset_summary.json

Persistent curated additions (safe to edit via script):

precursor_proteins_curated.faa
precursor_proteins_curated.tsv

Default validated set used by the UI:

precursor_proteins_verified.faa — merged, de‑duplicated set combining multi_strict + curated
precursor_proteins_verified.tsv — merged metadata table (read‑only; used by the Dataset tab)

Reference FASTA format (Pipeline + Bench UI)

Any reference FASTA you drop into the Pipeline or Bench UI must include a locus= token in the header. This is required for grouped top‑N mean scoring.

Minimal example:

>name=foo|locus=foo_001
MSEQ...

Lasso benchmark (peptidase holdout) expects:

>pep=WP_012345678|locus=lab_0001_locus_01|case=lab_0001|orf=0001
MSEQ...

Beta‑lactamase benchmark expects:

>prot=AAC09015|locus=AAC09015|name=AER-1|class=A|source=ambler_table
MSEQ...

If locus= is missing, scoring will error with: Missing locus= token in reference id.

Add new curated precursors

Use the helper script to append curated sequences without touching generated files:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/newlassos.txt \
  --format newlassos \
  --output-dir data/precursors \
  --merge

The script also accepts TSV/CSV files with these columns (case-insensitive):

sequence (or precursor / full_precursor)
optional core (or core_aa)
optional name
optional orientation (leader_core / core_leader)

Example:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/curated_pairs.tsv \
  --format tsv \
  --output-dir data/precursors \
  --merge

This updates:

precursor_proteins_curated.faa
precursor_proteins_curated.tsv
precursor_proteins_verified.faa (merged set used by default)
precursor_proteins_verified.tsv (merged metadata table)

Alt‑Frame Switches (Experimental)

Beyond precursor retrieval, the codebase includes an experimental alt‑frame conservation analysis. It looks for alt‑frame ORFs overlapping annotated genes that show unexpected conservation patterns. The current implementation lives in:

lasso_workbench/altframe/experimental/constraint_conservation

The Null Model

We anchor on CDS position (not ORF), bin the gene into regions, and ask:

Survival: does the alt‑frame avoid stop codons?
Identity: when it survives, is the peptide conserved?

A codon‑preserving null shuffles synonymous codons while keeping the primary protein intact, then re‑checks alt‑frame survival and identity across permutations (default: 200).

Interpreting the stats (example row)

Example single locus from results/altframe_constraint_tests/altframe_constraint_conservation.tsv (phnU, out_of_frame, +2, bin 0–0):

Metric	Observed	Null	Interpretation
Survival	~47%	~94.6%	lower than null → stops are favored
Identity	~69%	~27%	higher than null → survivors are conserved
Z‑score	70	—	very large separation

Translation: Some lineages appear to turn the alt‑frame “OFF” (introduce stops) while others keep it highly conserved (ON). This is compatible with a binary regulatory switch hypothesis, but the statistics are null‑model‑dependent.

Important: p‑values in the TSV are empirical with a finite number of permutations (with 200 iterations, the minimum is ~0.005). Treat Z‑scores as effect sizes, not exact tail probabilities.

Cost heuristic

Low survival + High identity = High cost

This is a practical heuristic: a peptide that is costly to maintain will be absent in many genomes (OFF) but highly conserved when present (ON). It is not a formal likelihood.

Files

results/altframe_constraint_tests/altframe_constraint_conservation.tsv

Lasso ORF Index (Essence)

The streamlined “essence” path is an ORF indexer that scans only antiSMASH lasso regions and stores 20–120 aa peptides in a local SQLite database for fast conservation queries.

Build the DB:

.venv/bin/python scripts/build_lasso_orf_db.py \
  --gbk-dir data/antismash_lasso/gbk \
  --db-path results/lasso_orf_index.sqlite \
  --min-aa 20 \
  --max-aa 120 \
  --reset

The pipeline/UI will automatically attach genome_count (found‑in‑X‑genomes) to candidates when results/lasso_orf_index.sqlite exists.

Spec reference:

lasso_workbench/altframe/specs/orf_indexer.md

LICENSE

MIT License

Made by Magnus Ohle

NOTICE

See NOTICE for attribution and reuse guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
beta-lactamase-bench		beta-lactamase-bench
data		data
lasso_workbench		lasso_workbench
poster_assets		poster_assets
results		results
scripts		scripts
specs		specs
tests		tests
.DS_Store		.DS_Store
.coverage		.coverage
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
benchmark-issues.md		benchmark-issues.md
poster.aux		poster.aux
poster.fdb_latexmk		poster.fdb_latexmk
poster.fls		poster.fls
poster.log		poster.log
poster.nav		poster.nav
poster.out		poster.out
poster.snm		poster.snm
poster.tex		poster.tex
poster.toc		poster.toc
poster.xdv		poster.xdv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uni-tuebingen.png		uni-tuebingen.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lasso Workbench

Overview

Quick Start

Installation

Usage

Architecture

Precursor Datasets

Reference FASTA format (Pipeline + Bench UI)

Add new curated precursors

Alt‑Frame Switches (Experimental)

The Null Model

Interpreting the stats (example row)

Cost heuristic

Files

Lasso ORF Index (Essence)

LICENSE

NOTICE

About

Uh oh!

Releases

Packages

Languages

License

Mvgnu/Lasso

Folders and files

Latest commit

History

Repository files navigation

Lasso Workbench

Overview

Quick Start

Installation

Usage

Architecture

Precursor Datasets

Reference FASTA format (Pipeline + Bench UI)

Add new curated precursors

Alt‑Frame Switches (Experimental)

The Null Model

Interpreting the stats (example row)

Cost heuristic

Files

Lasso ORF Index (Essence)

LICENSE

NOTICE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages