Lasso Workbench

A GUI application for semantic lasso peptide precursor discovery and dataset curation.

Overview

The Lasso Workbench implements a semantic discovery pipeline combining 6-frame translation with ESM-2 embeddings to identify lasso peptide precursors in genomic data.

Key Features:

Semantic Discovery: Encodes candidate ORFs using ESM-2 (t6/t12/t30/t33) and ranks them by cosine similarity to validated precursors.
Exhaustive Search: Extracts all potential ORFs (limited by min/max AA length) from 6-frame translation, ensuring no candidates are missed due to annotation errors.
Rule-Based Ranking: Filters candidates using customizable heuristic constraints (leader motifs, core length) on top of semantic scores.
Dataset Management: Utilities for curating of validated precursors.

Gaps:

No Tail handling: The current pipeline does not handle tail pruning in core prediction.

Quick Start

Installation

# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Launch Application
lasso-workbench

Usage

Pipeline: Upload GBK files (MiBIG, antiSMASH, or GenBank).
Configuration: Select an ESM-2 model (default: esm2_t6_8M_UR50D) and a validated precursor reference set.
Execution: The pipeline performs:
- Extraction: 6-frame translation of input sequences.
- Embedding: Batch inference via ESM-2 (masked padding).
- Scoring: Compute pairwise cosine similarity against references.
- Ranking: Sort by Top-N Mean similarity and apply rule filters.
Analysis: Review ranked candidates in the tabular view, html visualisation and export to JSON/CSV/FASTA.

Architecture

Core Logic: lasso_workbench/core - Embedding scoring (NumPy/Scikit-learn), Rule Engine.
Pipeline: lasso_workbench/pipeline - ORF extraction (BioPython), ESM-2 Embedding (HuggingFace Transformers).
UI: lasso_workbench/ui - Gradio-based interface.
Data: JSON-based persistence for easy portability.

Precursor Datasets

Validated precursor reference sets live in data/precursors/.

Generated artifacts (do not edit by hand):

precursor_proteins_multi_strict.faa — multi‑candidate validated set generated by scripts/generate_precursor_dataset_multi.py
lab_core_candidates_multi_strict.tsv
lab_core_loci_multi_strict.tsv
lab_core_cases_summary.tsv
lab_core_dataset_summary.json

Persistent curated additions (safe to edit via script):

precursor_proteins_curated.faa
precursor_proteins_curated.tsv

Default validated set used by the UI:

precursor_proteins_verified.faa — merged, de‑duplicated set combining multi_strict + curated
precursor_proteins_verified.tsv — merged metadata table (read‑only; used by the Dataset tab)

Reference FASTA format (Pipeline + Bench UI)

Any reference FASTA you drop into the Pipeline or Bench UI must include a locus= token in the header. This is required for grouped top‑N mean scoring.

Minimal example:

>name=foo|locus=foo_001
MSEQ...

Lasso benchmark (peptidase holdout) expects:

>pep=WP_012345678|locus=lab_0001_locus_01|case=lab_0001|orf=0001
MSEQ...

Beta‑lactamase benchmark expects:

>prot=AAC09015|locus=AAC09015|name=AER-1|class=A|source=ambler_table
MSEQ...

If locus= is missing, scoring will error with: Missing locus= token in reference id.

Add new curated precursors

Use the helper script to append curated sequences without touching generated files:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/newlassos.txt \
  --format newlassos \
  --output-dir data/precursors \
  --merge

The script also accepts TSV/CSV files with these columns (case-insensitive):

sequence (or precursor / full_precursor)
optional core (or core_aa)
optional name
optional orientation (leader_core / core_leader)

Example:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/curated_pairs.tsv \
  --format tsv \
  --output-dir data/precursors \
  --merge

This updates:

precursor_proteins_curated.faa
precursor_proteins_curated.tsv
precursor_proteins_verified.faa (merged set used by default)
precursor_proteins_verified.tsv (merged metadata table)

LICENSE

MIT License

Made by Magnus Ohle

NOTICE

See NOTICE for attribution and reuse guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
beta-lactamase-bench		beta-lactamase-bench
data		data
lasso_workbench		lasso_workbench
results		results
scripts		scripts
specs		specs
tests		tests
.DS_Store		.DS_Store
.coverage		.coverage
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lasso Workbench

Overview

Quick Start

Installation

Usage

Architecture

Precursor Datasets

Reference FASTA format (Pipeline + Bench UI)

Add new curated precursors

LICENSE

NOTICE

About

Uh oh!

Releases

Packages

Languages

License

ZiemertLab/Lasso

Folders and files

Latest commit

History

Repository files navigation

Lasso Workbench

Overview

Quick Start

Installation

Usage

Architecture

Precursor Datasets

Reference FASTA format (Pipeline + Bench UI)

Add new curated precursors

LICENSE

NOTICE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages