A GUI application for semantic lasso peptide precursor discovery and dataset curation.
The Lasso Workbench implements a semantic discovery pipeline combining 6-frame translation with ESM-2 embeddings to identify lasso peptide precursors in genomic data.
Key Features:
- Semantic Discovery: Encodes candidate ORFs using ESM-2 (t6/t12/t30/t33) and ranks them by cosine similarity to validated precursors.
- Exhaustive Search: Extracts all potential ORFs (limited by min/max AA length) from 6-frame translation, ensuring no candidates are missed due to annotation errors.
- Rule-Based Ranking: Filters candidates using customizable heuristic constraints (leader motifs, core length) on top of semantic scores.
- Dataset Management: Utilities for curating of validated precursors.
Gaps:
- No Tail handling: The current pipeline does not handle tail pruning in core prediction.
# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Launch Application
lasso-workbench- Pipeline: Upload GBK files (MiBIG, antiSMASH, or GenBank).
- Configuration: Select an ESM-2 model (default:
esm2_t6_8M_UR50D) and a validated precursor reference set. - Execution: The pipeline performs:
- Extraction: 6-frame translation of input sequences.
- Embedding: Batch inference via ESM-2 (masked padding).
- Scoring: Compute pairwise cosine similarity against references.
- Ranking: Sort by
Top-N Meansimilarity and apply rule filters.
- Analysis: Review ranked candidates in the tabular view, html visualisation and export to JSON/CSV/FASTA.
- Core Logic:
lasso_workbench/core- Embedding scoring (NumPy/Scikit-learn), Rule Engine. - Pipeline:
lasso_workbench/pipeline- ORF extraction (BioPython), ESM-2 Embedding (HuggingFace Transformers). - UI:
lasso_workbench/ui- Gradio-based interface. - Data: JSON-based persistence for easy portability.
Validated precursor reference sets live in data/precursors/.
Generated artifacts (do not edit by hand):
precursor_proteins_multi_strict.faa— multi‑candidate validated set generated byscripts/generate_precursor_dataset_multi.pylab_core_candidates_multi_strict.tsvlab_core_loci_multi_strict.tsvlab_core_cases_summary.tsvlab_core_dataset_summary.json
Persistent curated additions (safe to edit via script):
precursor_proteins_curated.faaprecursor_proteins_curated.tsv
Default validated set used by the UI:
precursor_proteins_verified.faa— merged, de‑duplicated set combiningmulti_strict+curatedprecursor_proteins_verified.tsv— merged metadata table (read‑only; used by the Dataset tab)
Any reference FASTA you drop into the Pipeline or Bench UI must include a locus= token
in the header. This is required for grouped top‑N mean scoring.
Minimal example:
>name=foo|locus=foo_001
MSEQ...
Lasso benchmark (peptidase holdout) expects:
>pep=WP_012345678|locus=lab_0001_locus_01|case=lab_0001|orf=0001
MSEQ...
Beta‑lactamase benchmark expects:
>prot=AAC09015|locus=AAC09015|name=AER-1|class=A|source=ambler_table
MSEQ...
If locus= is missing, scoring will error with:
Missing locus= token in reference id.
Use the helper script to append curated sequences without touching generated files:
.venv/bin/python scripts/add_validated_precursors.py \
--input data/lab_dataset/newlassos.txt \
--format newlassos \
--output-dir data/precursors \
--mergeThe script also accepts TSV/CSV files with these columns (case-insensitive):
sequence(orprecursor/full_precursor)- optional
core(orcore_aa) - optional
name - optional
orientation(leader_core / core_leader)
Example:
.venv/bin/python scripts/add_validated_precursors.py \
--input data/lab_dataset/curated_pairs.tsv \
--format tsv \
--output-dir data/precursors \
--mergeThis updates:
precursor_proteins_curated.faaprecursor_proteins_curated.tsvprecursor_proteins_verified.faa(merged set used by default)precursor_proteins_verified.tsv(merged metadata table)
MIT License
Made by Magnus Ohle
See NOTICE for attribution and reuse guidance.