A GUI application for semantic lasso peptide precursor discovery and dataset curation.
The Lasso Workbench implements a semantic discovery pipeline combining 6-frame translation with ESM-2 embeddings to identify lasso peptide precursors in genomic data.
Key Features:
- Semantic Discovery: Encodes candidate ORFs using ESM-2 (t6/t12/t30/t33) and ranks them by cosine similarity to validated precursors.
- Exhaustive Search: Extracts all potential ORFs (limited by min/max AA length) from 6-frame translation, ensuring no candidates are missed due to annotation errors.
- Rule-Based Ranking: Filters candidates using customizable heuristic constraints (leader motifs, core length) on top of semantic scores.
- Dataset Management: Utilities for curating of validated precursors.
Gaps:
- No Tail handling: The current pipeline does not handle tail pruning in core prediction.
# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Launch Application
lasso-workbench- Pipeline: Upload GBK files (MiBIG, antiSMASH, or GenBank).
- Configuration: Select an ESM-2 model (default:
esm2_t6_8M_UR50D) and a validated precursor reference set. - Execution: The pipeline performs:
- Extraction: 6-frame translation of input sequences.
- Embedding: Batch inference via ESM-2 (masked padding).
- Scoring: Compute pairwise cosine similarity against references.
- Ranking: Sort by
Top-N Meansimilarity and apply rule filters.
- Analysis: Review ranked candidates in the tabular view, html visualisation and export to JSON/CSV/FASTA.
- Core Logic:
lasso_workbench/core- Embedding scoring (NumPy/Scikit-learn), Rule Engine. - Pipeline:
lasso_workbench/pipeline- ORF extraction (BioPython), ESM-2 Embedding (HuggingFace Transformers). - UI:
lasso_workbench/ui- Gradio-based interface. - Data: JSON-based persistence for easy portability.
Validated precursor reference sets live in data/precursors/.
Generated artifacts (do not edit by hand):
precursor_proteins_multi_strict.faa— multi‑candidate validated set generated byscripts/generate_precursor_dataset_multi.pylab_core_candidates_multi_strict.tsvlab_core_loci_multi_strict.tsvlab_core_cases_summary.tsvlab_core_dataset_summary.json
Persistent curated additions (safe to edit via script):
precursor_proteins_curated.faaprecursor_proteins_curated.tsv
Default validated set used by the UI:
precursor_proteins_verified.faa— merged, de‑duplicated set combiningmulti_strict+curatedprecursor_proteins_verified.tsv— merged metadata table (read‑only; used by the Dataset tab)
Any reference FASTA you drop into the Pipeline or Bench UI must include a locus= token
in the header. This is required for grouped top‑N mean scoring.
Minimal example:
>name=foo|locus=foo_001
MSEQ...
Lasso benchmark (peptidase holdout) expects:
>pep=WP_012345678|locus=lab_0001_locus_01|case=lab_0001|orf=0001
MSEQ...
Beta‑lactamase benchmark expects:
>prot=AAC09015|locus=AAC09015|name=AER-1|class=A|source=ambler_table
MSEQ...
If locus= is missing, scoring will error with:
Missing locus= token in reference id.
Use the helper script to append curated sequences without touching generated files:
.venv/bin/python scripts/add_validated_precursors.py \
--input data/lab_dataset/newlassos.txt \
--format newlassos \
--output-dir data/precursors \
--mergeThe script also accepts TSV/CSV files with these columns (case-insensitive):
sequence(orprecursor/full_precursor)- optional
core(orcore_aa) - optional
name - optional
orientation(leader_core / core_leader)
Example:
.venv/bin/python scripts/add_validated_precursors.py \
--input data/lab_dataset/curated_pairs.tsv \
--format tsv \
--output-dir data/precursors \
--mergeThis updates:
precursor_proteins_curated.faaprecursor_proteins_curated.tsvprecursor_proteins_verified.faa(merged set used by default)precursor_proteins_verified.tsv(merged metadata table)
Beyond precursor retrieval, the codebase includes an experimental alt‑frame conservation analysis. It looks for alt‑frame ORFs overlapping annotated genes that show unexpected conservation patterns. The current implementation lives in:
lasso_workbench/altframe/experimental/constraint_conservation
We anchor on CDS position (not ORF), bin the gene into regions, and ask:
- Survival: does the alt‑frame avoid stop codons?
- Identity: when it survives, is the peptide conserved?
A codon‑preserving null shuffles synonymous codons while keeping the primary protein intact, then re‑checks alt‑frame survival and identity across permutations (default: 200).
Example single locus from results/altframe_constraint_tests/altframe_constraint_conservation.tsv
(phnU, out_of_frame, +2, bin 0–0):
| Metric | Observed | Null | Interpretation |
|---|---|---|---|
| Survival | ~47% | ~94.6% | lower than null → stops are favored |
| Identity | ~69% | ~27% | higher than null → survivors are conserved |
| Z‑score | 70 | — | very large separation |
Translation: Some lineages appear to turn the alt‑frame “OFF” (introduce stops) while others keep it highly conserved (ON). This is compatible with a binary regulatory switch hypothesis, but the statistics are null‑model‑dependent.
Important: p‑values in the TSV are empirical with a finite number of permutations (with 200 iterations, the minimum is ~0.005). Treat Z‑scores as effect sizes, not exact tail probabilities.
Low survival + High identity = High cost
This is a practical heuristic: a peptide that is costly to maintain will be absent in many genomes (OFF) but highly conserved when present (ON). It is not a formal likelihood.
results/altframe_constraint_tests/altframe_constraint_conservation.tsv
The streamlined “essence” path is an ORF indexer that scans only antiSMASH lasso regions and stores 20–120 aa peptides in a local SQLite database for fast conservation queries.
Build the DB:
.venv/bin/python scripts/build_lasso_orf_db.py \
--gbk-dir data/antismash_lasso/gbk \
--db-path results/lasso_orf_index.sqlite \
--min-aa 20 \
--max-aa 120 \
--resetThe pipeline/UI will automatically attach genome_count (found‑in‑X‑genomes) to candidates
when results/lasso_orf_index.sqlite exists.
Spec reference:
lasso_workbench/altframe/specs/orf_indexer.md
MIT License
Made by Magnus Ohle
See NOTICE for attribution and reuse guidance.