This repository contains the code accompanying the paper "Towards life with a 19 amino acid alphabet via generative AI design." With this pipeline, you can generate and evaluate protein sequences that use only 19 amino acids, leveraging state-of-the-art generative models. The workflow integrates language models (ESM-2 and MSA Transformer) and structure-based methods (AlphaFold2 and ProteinMPNN) for protein design, assessment, and ranking.
To set up the environment and dependencies for this pipeline, follow these steps:
It is recommended to use mamba (a faster drop-in replacement for conda), but you may also use conda itself.
cd ec19
mamba env create -f environment.yml
mamba activate ec19_envClone the ProteinMPNN repository next to this project's root directory (not inside it):
cd ../ec19
git clone https://github.com/dauparas/ProteinMPNN.gitDownload structure files and multiple sequence alignments (MSAs) from Zenodo:
curl -L https://zenodo.org/api/records/18155920/files-archive -o data.zip
unzip data.zip -d data_dir/Note: Replace data_dir/ with your preferred data directory name or location. This will create the directory and extract the data files required to run the pipeline.
Your data directory should be organized as follows:
data_dir/
genes.fasta # Input FASTA file with wild-type protein sequences
{UNIPROT_ID}_nearby_protein_I.pdb # Protein structure file
{UNIPROT_ID}.i90c90qid30diff384.a3m # MSA file for the protein
output/ # Output directory (auto-created)
logs/ # Logs directory (auto-created)
Notes:
- The FASTA file is needed only when designing with protein language models (ESM-2 or MSA Transformer).
- Protein structure (
.pdb) and multiple sequence alignment (.a3m) files are only needed if you use structure-based methods, design with spatial neighbors, or the MSA Transformer. - The
outputandlogssubdirectories will be generated automatically during runtime if they do not exist.
Your input FASTA file should contain records with headers in the format:
>{Uniprot_ID}_{gene_name}
SEQUENCE
Example:
>P0A7R9_rpsA
MAVQK... (sequence)
- You can include multiple records (genes) in a single FASTA file.
- Each sequence should be the wild-type amino acid sequence for the gene of interest.
We provide a sample input file:
data/ecoli_ribosomal_genes.fasta
This contains the 50 E. coli ribosomal genes designed in the study.
You can use one of the supported protein language models:
- esm2: runs recoding using the ESM-2 sequence-only language model.
- msa_trans: runs recoding using the MSA Transformer, which requires a multiple sequence alignment (MSA) in
.a3mformat for each gene.
Example usage:
python recode_lm.py --infile <genes.fasta> --model esm2 --outfile <output.fasta> [additional parameters]--infile: Path to your input FASTA file. (required)--outfile: Path for saving recoded sequences. (required)--model: Choose which PLM to use. (required) Options:esm2(ESM-2 language model)msa_trans(MSA Transformer; requires--msa_dir)
--gpu: GPU index to use (default: 0)--residue: Residue to recode (default: I; options: I or V)--scheme: Recoding strategy. (required) Options:simultaneous: Modify all relevant residues at once (default for most use-cases)autoregressive: Sequentially modify residuesgibbs: Iteratively re-sample residue identities (Gibbs sampling)
--msa_dir: Directory with MSA.a3mfiles (required if usingmsa_trans)--structure_dir: Directory with PDB files (optional, required if using spatial neighbors)--seq_neighbors: If set, also recode residues adjacent to target residue in the sequence--spatial_neighbors: If set, recode residues spatially close to the target (requires PDBs)--evo_neighbors: If set, recode residues showing evolutionary coupling to target--min_3d_dist/--max_3d_dist: Set thresholds for what counts as a spatial neighbor (in Angstroms)
Example with spatial neighbors:
python recode_lm.py --infile data/ecoli_ribosomal_genes.fasta --model esm2 --outfile my_recoded_seqs.fasta --structure_dir data_dir/ --spatial_neighbors --min_3d_dist 3.0 --max_3d_dist 5.0Example for MSA Transformer:
python recode_lm.py --infile data/ecoli_ribosomal_genes.fasta --model msa_trans --outfile my_recoded_msa.fasta --msa_dir data_dir/For more information about the arguments and their defaults, see src/recode_lm.py or run:
python recode_lm.py --helpFor structure-based sequence design, you have two main options:
- mpnn: Uses ProteinMPNN to design sequences compatible with a given 3D structure.
- afdesign_mpnn_bias: Runs AFDesign with MPNN-based bias for enhanced sequence diversity and structure compatibility.
Quick start/example ("dry run"):
python design_and_rank.py data/test.fasta --method mpnn --dry_run --data_dir /path/to/dataThis runs a configuration check and validates input/output paths without performing full computations, allowing for easier debugging.
Generate and rank designs using ProteinMPNN across multiple GPUs:
python design_and_rank.py data/ecoli_ribosomal_genes.fasta --method mpnn --gpus 4 --data_dir /path/to/dataTo use AFDesign + MPNN:
python design_and_rank.py data/ecoli_ribosomal_genes.fasta --method afdesign_mpnn_bias --gpus 4 --data_dir /path/to/data--method: Structure-based recoding method. (required)mpnn: Use ProteinMPNNafdesign_mpnn_bias: AFDesign with MPNN-biased sampling
--infile: Input FASTA file with gene sequences to design for (positional argument)--data_dir: Path to data directory (see structure below)--gpus: Number of GPUs to use for parallel design/ranking (default: 1). Use a single GPU (e.g.,--gpus 1) if running locally or increase for cluster environments.--dry_run: If set, validate setup without performing computation- Other options: See
python design_and_rank.py --helpfor advanced controls (e.g., specifying custom structure/model files, output controls, chain selection, etc.)
- Ranked and designed FASTA files are written to:
data_dir/output/ranked_designs_{method}_I_{date}.fasta - Log files and intermediate data are written to
data_dir/logs/and subfolders. - Note: Due to randomness in neural models and differences in environment or hardware, output sequences may differ across runs, and may not exactly match those from the associated publication.
For full details about arguments and defaults, see src/design_and_rank.py or run:
python design_and_rank.py --help