[NAACL 2025] Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

📌 Introduction

Proteins are fundamental to biological processes, yet 99.7% of the 227 million known protein sequences remain uncharacterized due to experimental limitations. Protein2Text is a multimodal large language model designed to bridge this gap by interpreting protein sequences and generating informative text-based answers to open-ended scientific questions.

Built on an adapted LLaVA framework with an advanced resampling mechanism, Protein2Text effectively translates protein sequences into a language-compatible space, allowing it to handle diverse and complex biological queries.

🚀 Features

✔ Multimodal Capability: Interprets protein sequences and generates textual insights.
✔ LLaVA-Based Architecture: Integrates a resampling mechanism for improved protein-to-text mapping.
✔ Curated Training Data: Derived from PubMed articles to enhance domain-specific knowledge.
✔ Comprehensive Benchmarking: Evaluated on six rigorous benchmarks, including in-domain, cross-domain, zero-shot, and classification benchmarks.
✔ Open-Ended Q&A: Outperforms existing models in generating informative and relevant protein function insights.
✔ Publicly Available: Model weights, evaluation datasets, and evaluation scripts are fully open-source.

📊 Performance Highlights

Protein2Text has been rigorously evaluated using multiple benchmarks and has demonstrated superior performance in open-ended question-answering compared to existing models.

🔹 Outperforms several baselines in biological Q&A tasks.
🔹 Addresses limitations in existing evaluation metrics for template-based methods.
🔹 Proposes improved assessment strategies to reduce bias in evaluating protein-related NLP models.

Release

[03/7/25] The evaluation datasets of Protein2Text are released here, evaluation scripts are released here, and LoRA-based model weights 🤗 are released here.
[02/12/25] Protein2Text is accepted by NAACL 2025 -Industry Track.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to LLaVA and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-3.1, LLaMA-2, and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Install

Clone this repository and navigate to the Protein2Text folder:

git clone https://github.com/alaaj27/Protein2Text.git
cd Protein2Text

Download model weights:

mkdir checkpoints
cd checkpoints
git git clone https://huggingface.co/tumorailab/protein2text-llama3.1-8B-instruct-esm2-650M
cd ..

Install packages:

3.1. Using conda

conda create -n protein2text_env python=3.10 -y
conda activate protein2text_env
pip install --upgrade pip  
pip install -e .

3.2. Using python venv: [Note: This will create a new directory called protein2text_env]

python -m venv protein2text_env
source protein2text_env/bin/activate
pip install --upgrade pip  
pip install -e .

Download checkpoint:

mkdir checkpoints
cd checkpoints
git clone https://huggingface.co/tumorailab/protein2text-llama3.1-8B-instruct-esm2-650M

Inference a batch:

cd .. 
python evaluation/inference_model.py
  --input_file data/[FILE_NAME].json 
  --model_path checkpoints/protein2text-llama3.1-8B-instruct-esm2-650M/
  --model_base meta-llama/Meta-Llama-3.1-8B-Instruct

Train

Ptotein2Text training consists of two stages:

Projector and Resampler Alignment Stage (Pretraining): We collected a dataset that spans 394,000 protein amino acid sequences and function descriptions collected from UniProt. This dataset is entirely used to train the resampler and the projector during the pretraining stage to connect a frozen pretrained protein encoder to a frozen pretrained LLM.
Sequence Instruction Tuning stage(Fine-tuning): we generate a comprehensive question and answering dataset ( Protein2Text-QA) to fine-tune the model parameters. The dataset spans approximately 210,000 pairs of QA. We utilize research carried out on proteins from published articles in the PubMed Central (PMC) database to create questions and answers. Articles mentioning the protein names are located and fed to the LLaMA3.1 model to generate a series of QA pairs, such that they focus only on the protein name given.

Model training and inferencing were mainly performed on 2 NVIDIA H100 PCIe GPUs of 80GB VRAM. The estimated training time is roughly dependent on the number of parameters, the batch sizes, and other configurations such as gradient checkpointing, LoRA parameters, and resampler configurations. However, the estimated training time for the pretraining stage varies from 8 to 13 hours while the fine-tuning stage varies from 12-20 hours.

Hyperparameters

We tried to align the hyperparameters of Protein2Text with the set of hyperparameters provided by LLaVA. Both hyperparameters used in pretraining and fine-tuning are provided below.

Training Configuration

Phase	Global Batch Size	Learning Rate	Epochs	Max Length	Weight Decay	Precision	Optimizer	Gradient Accumulation Steps	Warmup Ratio
Pretraining	256	2 × 10⁻³	1	2048	0	bf16 (Mixed Precision)	AdamW	1 step	0.03
Fine-tuning	128	8 × 10⁻⁶	5	2048	0	bf16 (Mixed Precision)	AdamW	1 step	0.03

Additional Model Components

Protein Encoder:
- Model: ESM2-650M
- Output Tokens: All (i.e., no truncation)
- Feature Layer: -2 (second to last)
Language Model:
- Model: LLaMA-3.1-8B-Instruct
- LoRA Rank: 64
- LoRA Alpha: 16
- Context Length: 2048
Projector:
- Number of Layers: 2
- Activation: GELU
- Hidden Dimensions: 4096
Perceiver Resampler:
- Number of Attention Layers: 4096
- Attention Heads: 8
- Dimension of Attention Heads: 4
- Multiplication Factor of Hidden State: 2
- Number of Latent Tokens: 128

Dataset Collection

The Protein2Text-QA dataset was created in two main steps: retrieving relevant abstracts and generating question-answer (QA) pairs using LLaMA3.

1. Abstract Retrieval

Protein-related abstracts were collected using systematic queries in the PubMed Central (PMC) database via the Entrez library. Abstracts containing specific protein keywords were fetched using their PMC IDs, ensuring relevance by including only those explicitly mentioning the queried proteins. The abstracts were preprocessed to remove redundant text and formatting inconsistencies before being passed to the QA generation pipeline.

2. QA Generation with LLaMA3

The cleaned abstracts, along with protein names and roles, were input into LLaMA3.1-8B-Instruct to generate conversation-style QAs. The model was prompted to create QAs focusing solely on the protein’s function and attributes while ignoring unrelated information. Each abstract produced up to 10 QA pairs, which were further filtered to remove irrelevant or uninformative responses. The dataset was refined to ensure questions remained general and protein-specific rather than abstract-specific.

The final dataset includes structured QA pairs, covering unique proteins and their associated attributes. The complete data pipeline, including extraction and preprocessing, is outlined in the repository, along with dataset statistics such as the number of QA pairs, unique proteins, and sequence lengths.

Download LLaMA 3.1 and ESM 2 checkpoints:

Please refer to the following pages, LLaMA 3.1 8B and ESM-2 3B, for instructions and requirements on how to download the model weights for the base models.

Citation

If you find Protein2Text useful for your research and applications, please cite using this BibTeX:

@inproceedings{jararweh2025protein2text,
  title={Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text},
  author={Jararweh, Ala and Macaulay, Oladimeji and Arredondo, David and Hu, Yue and Tafoya, Luis E and Virupakshappa, Kushal and Sahu, Avinash},
  booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)},
  pages={918--937},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
evaluation		evaluation
figures		figures
results		results
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NAACL 2025] Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

📌 Introduction

🚀 Features

📊 Performance Highlights

Release

Contents

Install

Train

Hyperparameters

Training Configuration

Additional Model Components

Dataset Collection

1. Abstract Retrieval

2. QA Generation with LLaMA3

Download LLaMA 3.1 and ESM 2 checkpoints:

Citation

About

Uh oh!

Releases

Packages

Languages

License

alaaj27/Protein2Text

Folders and files

Latest commit

History

Repository files navigation

[NAACL 2025] Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

📌 Introduction

🚀 Features

📊 Performance Highlights

Release

Contents

Install

Train

Hyperparameters

Training Configuration

Additional Model Components

Dataset Collection

1. Abstract Retrieval

2. QA Generation with LLaMA3

Download LLaMA 3.1 and ESM 2 checkpoints:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages