An AI-driven pipeline for the identification of CLE signaling peptides in plant proteomes using protein language model embeddings.
This pipeline leverages state-of-the-art Protein Language Models (ESM2 and ProtT5) to discover novel CLE peptides directly from plant proteomes. By coupling evolutionary-scale sequence embeddings with unsupervised clustering and supervised machine learning, this dual-model approach captures deep semantic features of the CLE family that escape traditional sequence alignment methods.
-
Step 1: Embedding extraction (ESM2 + ProtT5):
- Embeddings_ESM2.py
- Embeddings_T5.py
-
Step 2: Clustering analysis:
- Cluster_maps.py
-
Step 3: XGBoost training:
- XGB_training.py
-
Step 4: XGBoost cluster prediction:
- Cluster_prediction.py
-
Step 5: Sequence extraction:
- Sequence_MEME_analysis.py
- extract_all_candidates.py (candidates ≥ 0.5) → optional
- Bokeh_visualization_positives.py