Skip to content

ZiemertLab/Lasso

 
 

Repository files navigation

Lasso Workbench

A GUI application for semantic lasso peptide precursor discovery and dataset curation.

Overview

The Lasso Workbench implements a semantic discovery pipeline combining 6-frame translation with ESM-2 embeddings to identify lasso peptide precursors in genomic data.

Key Features:

  • Semantic Discovery: Encodes candidate ORFs using ESM-2 (t6/t12/t30/t33) and ranks them by cosine similarity to validated precursors.
  • Exhaustive Search: Extracts all potential ORFs (limited by min/max AA length) from 6-frame translation, ensuring no candidates are missed due to annotation errors.
  • Rule-Based Ranking: Filters candidates using customizable heuristic constraints (leader motifs, core length) on top of semantic scores.
  • Dataset Management: Utilities for curating of validated precursors.

Gaps:

  • No Tail handling: The current pipeline does not handle tail pruning in core prediction.

Quick Start

Installation

# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Launch Application
lasso-workbench

Usage

  1. Pipeline: Upload GBK files (MiBIG, antiSMASH, or GenBank).
  2. Configuration: Select an ESM-2 model (default: esm2_t6_8M_UR50D) and a validated precursor reference set.
  3. Execution: The pipeline performs:
    • Extraction: 6-frame translation of input sequences.
    • Embedding: Batch inference via ESM-2 (masked padding).
    • Scoring: Compute pairwise cosine similarity against references.
    • Ranking: Sort by Top-N Mean similarity and apply rule filters.
  4. Analysis: Review ranked candidates in the tabular view, html visualisation and export to JSON/CSV/FASTA.

Architecture

  • Core Logic: lasso_workbench/core - Embedding scoring (NumPy/Scikit-learn), Rule Engine.
  • Pipeline: lasso_workbench/pipeline - ORF extraction (BioPython), ESM-2 Embedding (HuggingFace Transformers).
  • UI: lasso_workbench/ui - Gradio-based interface.
  • Data: JSON-based persistence for easy portability.

Precursor Datasets

Validated precursor reference sets live in data/precursors/.

Generated artifacts (do not edit by hand):

  • precursor_proteins_multi_strict.faa — multi‑candidate validated set generated by scripts/generate_precursor_dataset_multi.py
  • lab_core_candidates_multi_strict.tsv
  • lab_core_loci_multi_strict.tsv
  • lab_core_cases_summary.tsv
  • lab_core_dataset_summary.json

Persistent curated additions (safe to edit via script):

  • precursor_proteins_curated.faa
  • precursor_proteins_curated.tsv

Default validated set used by the UI:

  • precursor_proteins_verified.faa — merged, de‑duplicated set combining multi_strict + curated
  • precursor_proteins_verified.tsv — merged metadata table (read‑only; used by the Dataset tab)

Reference FASTA format (Pipeline + Bench UI)

Any reference FASTA you drop into the Pipeline or Bench UI must include a locus= token in the header. This is required for grouped top‑N mean scoring.

Minimal example:

>name=foo|locus=foo_001
MSEQ...

Lasso benchmark (peptidase holdout) expects:

>pep=WP_012345678|locus=lab_0001_locus_01|case=lab_0001|orf=0001
MSEQ...

Beta‑lactamase benchmark expects:

>prot=AAC09015|locus=AAC09015|name=AER-1|class=A|source=ambler_table
MSEQ...

If locus= is missing, scoring will error with: Missing locus= token in reference id.

Add new curated precursors

Use the helper script to append curated sequences without touching generated files:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/newlassos.txt \
  --format newlassos \
  --output-dir data/precursors \
  --merge

The script also accepts TSV/CSV files with these columns (case-insensitive):

  • sequence (or precursor / full_precursor)
  • optional core (or core_aa)
  • optional name
  • optional orientation (leader_core / core_leader)

Example:

.venv/bin/python scripts/add_validated_precursors.py \
  --input data/lab_dataset/curated_pairs.tsv \
  --format tsv \
  --output-dir data/precursors \
  --merge

This updates:

  • precursor_proteins_curated.faa
  • precursor_proteins_curated.tsv
  • precursor_proteins_verified.faa (merged set used by default)
  • precursor_proteins_verified.tsv (merged metadata table)

LICENSE

MIT License

Made by Magnus Ohle

NOTICE

See NOTICE for attribution and reuse guidance.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 77.0%
  • Python 23.0%