Target-aware 3D Molecular Generation Based on Guided Equivariant Diffusion Model

Official implementation of DiffGui, a guided diffusion model for de novo structure-based drug design and lead optimization, by Qiaoyu Hu^1,#, Changzhi Sun¹, Huang He¹, Jiazheng Xu¹, Danlin Liu¹, Wenqing Zhang, Sumeng Shi, Kang Zhang, and Honglin Li^#.

Fig 1. DiffGui framework

Fig 2. Animation of molecule generation by DiffGui

Installation

Install conda environment via yaml file

# Create the environment
conda env create -f env.yml
# Activate the environment
conda activate diffgui

Install Vina Docking

pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3

Install other required softwares

pip install diffusers==0.21.4 docutils==0.17.1 filelock==3.12.2 fsspec==2023.1.0
pip install softwares/torch_cluster-1.6.1+pt113cu116-cp37-cp37m-linux_x86_64.whl
pip install softwares/torch_scatter-2.1.1+pt113cu116-cp37-cp37m-linux_x86_64.whl

The package version should be changed according to your need.

Datasets

The benchmark datasets utilized in this project are PDBbind and CrossDocked.

PDBbind

To train the model from scratch, you need to download the preprocessed lmdb file and split file from here:

PDBbind_v2020_pocket10_processed_final.lmdb
PDBbind_pocket10_split.pt

To process the dataset from scratch, you need to download PDBbind_v2020 from here, save it in data, unzip it, and run the following scripts in data:

clean_pdbbind.py will clean the original dataset, extract the binding affinity and calculate QED, SA, LogP, and TPSA of ligands. It will generate a index.pkl file and save it in data/PDBbind_v2020 folder.
```
python clean_pdbbind.py --source PDBbind_v2020
```
extract_pockets.py will extract the pocket file from a 10 Å region around the binding ligand in the original protein file.
```
python extract_pockets.py --source PDBbind_v2020 --desti PDBbind_v2020_pocket10
```

split_dataset.py will split the train, validation and test set.

python split_dataset.py --path PDBbind_v2020_pocket10 --desti PDBbind_pocket10_split.pt --train 17327 --val 1825 --test 100

CrossDocked

To train the model from scratch, you need to download the preprocessed lmdb file and split file from here:

crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb
crossdocked_pocket10_pose_split.pt

To process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it in data, unzip it, and run the following scripts in data:

clean_crossdocked.py will filter the original dataset and retain the poses with RMSD < 1.0 Å. It will generate a index.pkl file and create a new directory containing the filtered data .
```
python clean_crossdocked.py --source CrossDocked2020 --dest crossdocked_v1.1_rmsd1.0 --rmsd_thr 1.0
```
extract_pockets.py will extract the pocket file from a 10 Å region around the binding ligand in the original protein file.
```
python extract_pockets.py --source crossdocked_v1.1_rmsd1.0 --dest crossdocked_v1.1_rmsd1.0_pocket10
```

split_dataset.py will split the training and test set. We use the split file split_by_name.pt.

python split_dataset.py --path data/crossdocked_v1.1_rmsd1.0_pocket10 --dest data/crossdocked_pocket10_pose_split.pt --fixed_split data/split_by_name.pt

Training

Trained model checkpoint

The trained model checkpoint files are stored in here.

trained.pt is the checkpoint file trained on the PDBbind dataset with labeling.
bond_trained.pt is the checkpoint file of bond predictor trained on the PDBbind dataset. This should be used as guidance during the sampling process.

Training from scratch

python scripts/train.py --config configs/train/train.yml

If you want to resume the training, you need to revise the train.yml file.
Set resume to True and set resume_ckpt to the checkpoint that you want to resume, eg. 100000.pt. In addition, log_dir should be defined by args.logdir (the previous training directory) instead of using get_new_log_dir function.

Training bond predictor

python scripts/train_bond.py --config configs/train/train_bond.yml

Inference

python scripts/sample.py --config configs/sample/sample.yml

Place the downloaded or self-trained checkpoint files in the ckpt folder.

The values of logp, tpsa, sa, qed, aff can be adjusted to generate molecules with desired properties.

logp is octanol-water partition coefficient. It ranges from -2.0 to 5.0. High value indicates hydrophobicity and low value indicates hydrophilicity. The logp values of most drugs are located between 1.0 and 3.0.
tpsa is topogical polar surface area. High value indicates high water solubility and low value indicates high lipid solubility. The tpsa values of most drugs are located between 20 and 60.
sa is synthetic accessibility. It ranges from 0.0 to 10.0. The lower the sa value, the easier the organic synthesis. The reasonable sa values of most drugs are located between 0.0 and 5.0.
qed is quantitative estimate of drug-likeness. It ranges from 0.0 to 1.0. The higher the qed value, the greater the drug-likeness.
aff is binding affinity. It is calculated by -log₁₀(K_d or K_i or IC₅₀). High value indicates high binding affinity. For instance, 8.0 corresponds to 10 nM K_d or K_i or IC₅₀.

Sample molecules for given protein pocket

Revise the sample.yml file to sample molecules for any given protein pocket.
Set target to pocket file (eg. sample/3ztx_pocket.pdb), set gen_mode to denovo, and set mode to pocket.
In sample folder, you can use the protein file and ligand file to extract the pocket file. For example:

python extract_pockets.py --protein 3ztx_protein.pdb --ligand 3ztx_ligand.sdf --radius 10 --pocket 3ztx_pocket.pdb

If you encounter rdkit error, then use Obabel to formalize the ligand file.

Obabel 3ztx_ligand.sdf -O3ztx_ligand.sdf

Sample molecules for all pockets in test set

Revise the sample.yml file to sample molecules for all pockets in test set.
Set target to None and set mode to test.

If you want to sample for the PDBbind test set, then set path to data/PDBbind_v2020_pocket10, set split to data/PDBbind_pocket10_split.pt, set protein_root to data/PDBbind_v2020, and set dataset to pdbbind.
If you want to sample for the CrossDocked test set, then set path to data/crossdocked_v1.1_rmsd1.0_pocket10, set split to data/crossdocked_pocket10_pose_split.pt, set protein_root to data/crossdocked_v1.1_rmsd1.0, and set dataset to crossdocked.

Sample molecules based on given fragments (lead optimization)

Revise the sample.yml file to sample molecules based on given fragments.
Set target to pocket file (eg. sample/3ztx_pocket.pdb), set frag to fragment file (eg. sample/3ztx_frag.sdf), set gen_mode to frag_cond or frag_diff, and set mode to pocket.

Evaluate

python scripts/evaluate.py --config configs/eval/eval.yml

The docking mode can be chosen from {qvina, vina_score, vina_dock, none}.

Citation

Please consider citing our paper if you find it helpful. Thank you!

Qiaoyu Hu^1,#, Changzhi Sun¹, Huang He¹, Jiazheng Xu¹, Danlin Liu¹, Wenqing Zhang, Sumeng Shi, Kang Zhang, and Honglin Li^#. Target-aware 3D molecular generation based on guided equivariant diffusion. Nat. Commun., 2025, 16, 7928. https://doi.org/10.1038/s41467-025-63245-0

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
configs		configs
data		data
figures		figures
models		models
sample		sample
scripts		scripts
softwares		softwares
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
requirements.txt		requirements.txt
run.sh		run.sh
setup_diffgui.sh		setup_diffgui.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Target-aware 3D Molecular Generation Based on Guided Equivariant Diffusion Model

Installation

Install conda environment via yaml file

Install Vina Docking

Install other required softwares

Datasets

PDBbind

CrossDocked

Training

Trained model checkpoint

Training from scratch

Training bond predictor

Inference

Sample molecules for given protein pocket

Sample molecules for all pockets in test set

Sample molecules based on given fragments (lead optimization)

Evaluate

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

jedguz/DiffGui

Folders and files

Latest commit

History

Repository files navigation

Target-aware 3D Molecular Generation Based on Guided Equivariant Diffusion Model

Installation

Install conda environment via yaml file

Install Vina Docking

Install other required softwares

Datasets

PDBbind

CrossDocked

Training

Trained model checkpoint

Training from scratch

Training bond predictor

Inference

Sample molecules for given protein pocket

Sample molecules for all pockets in test set

Sample molecules based on given fragments (lead optimization)

Evaluate

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages