Skip to content

Identification of CLE signal peptides using protein Language Models

Notifications You must be signed in to change notification settings

sales-lab/uncleash

Repository files navigation

CLE Peptide Discovery Pipeline

An AI-driven pipeline for the identification of CLE signaling peptides in plant proteomes using protein language model embeddings.

Overview

This pipeline leverages state-of-the-art Protein Language Models (ESM2 and ProtT5) to discover novel CLE peptides directly from plant proteomes. By coupling evolutionary-scale sequence embeddings with unsupervised clustering and supervised machine learning, this dual-model approach captures deep semantic features of the CLE family that escape traditional sequence alignment methods.

Pipeline Architecture

  • Step 1: Embedding extraction (ESM2 + ProtT5):

    • Embeddings_ESM2.py
    • Embeddings_T5.py
  • Step 2: Clustering analysis:

    • Cluster_maps.py
  • Step 3: XGBoost training:

    • XGB_training.py
  • Step 4: XGBoost cluster prediction:

    • Cluster_prediction.py
  • Step 5: Sequence extraction:

    • Sequence_MEME_analysis.py
    • extract_all_candidates.py (candidates ≥ 0.5) → optional

Visualization

  • Bokeh_visualization_positives.py

About

Identification of CLE signal peptides using protein Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages