Official implementation of DiffGui, a guided diffusion model for de novo structure-based drug design and lead optimization, by Qiaoyu Hu1,#, Changzhi Sun1, Huang He1, Jiazheng Xu1, Danlin Liu1, Wenqing Zhang, Sumeng Shi, Kang Zhang, and Honglin Li#.

Fig 2. Animation of molecule generation by DiffGui
# Create the environment
conda env create -f env.yml
# Activate the environment
conda activate diffguipip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3pip install diffusers==0.21.4 docutils==0.17.1 filelock==3.12.2 fsspec==2023.1.0
pip install softwares/torch_cluster-1.6.1+pt113cu116-cp37-cp37m-linux_x86_64.whl
pip install softwares/torch_scatter-2.1.1+pt113cu116-cp37-cp37m-linux_x86_64.whlThe package version should be changed according to your need.
The benchmark datasets utilized in this project are PDBbind and CrossDocked.
To train the model from scratch, you need to download the preprocessed lmdb file and split file from here:
PDBbind_v2020_pocket10_processed_final.lmdbPDBbind_pocket10_split.pt
To process the dataset from scratch, you need to download PDBbind_v2020 from here, save it in data, unzip it, and run the following scripts in data:
- clean_pdbbind.py will clean the original dataset, extract the binding affinity and calculate QED, SA, LogP, and TPSA of ligands. It will generate a
index.pklfile and save it indata/PDBbind_v2020folder.python clean_pdbbind.py --source PDBbind_v2020
- extract_pockets.py will extract the pocket file from a 10 Å region around the binding ligand in the original protein file.
python extract_pockets.py --source PDBbind_v2020 --desti PDBbind_v2020_pocket10
- split_dataset.py will split the train, validation and test set.
python split_dataset.py --path PDBbind_v2020_pocket10 --desti PDBbind_pocket10_split.pt --train 17327 --val 1825 --test 100
To train the model from scratch, you need to download the preprocessed lmdb file and split file from here:
crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdbcrossdocked_pocket10_pose_split.pt
To process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it in data, unzip it, and run the following scripts in data:
- clean_crossdocked.py will filter the original dataset and retain the poses with RMSD < 1.0 Å. It will generate a
index.pklfile and create a new directory containing the filtered data .python clean_crossdocked.py --source CrossDocked2020 --dest crossdocked_v1.1_rmsd1.0 --rmsd_thr 1.0
- extract_pockets.py will extract the pocket file from a 10 Å region around the binding ligand in the original protein file.
python extract_pockets.py --source crossdocked_v1.1_rmsd1.0 --dest crossdocked_v1.1_rmsd1.0_pocket10
- split_dataset.py will split the training and test set. We use the split file
split_by_name.pt.python split_dataset.py --path data/crossdocked_v1.1_rmsd1.0_pocket10 --dest data/crossdocked_pocket10_pose_split.pt --fixed_split data/split_by_name.pt
The trained model checkpoint files are stored in here.
trained.ptis the checkpoint file trained on the PDBbind dataset with labeling.bond_trained.ptis the checkpoint file of bond predictor trained on the PDBbind dataset. This should be used as guidance during the sampling process.
python scripts/train.py --config configs/train/train.ymlIf you want to resume the training, you need to revise the train.yml file.
Set resume to True and set resume_ckpt to the checkpoint that you want to resume, eg. 100000.pt. In addition, log_dir should be defined by args.logdir (the previous training directory) instead of using get_new_log_dir function.
python scripts/train_bond.py --config configs/train/train_bond.ymlpython scripts/sample.py --config configs/sample/sample.ymlPlace the downloaded or self-trained checkpoint files in the ckpt folder.
The values of logp, tpsa, sa, qed, aff can be adjusted to generate molecules with desired properties.
logpis octanol-water partition coefficient. It ranges from -2.0 to 5.0. High value indicates hydrophobicity and low value indicates hydrophilicity. The logp values of most drugs are located between 1.0 and 3.0.tpsais topogical polar surface area. High value indicates high water solubility and low value indicates high lipid solubility. The tpsa values of most drugs are located between 20 and 60.sais synthetic accessibility. It ranges from 0.0 to 10.0. The lower the sa value, the easier the organic synthesis. The reasonable sa values of most drugs are located between 0.0 and 5.0.qedis quantitative estimate of drug-likeness. It ranges from 0.0 to 1.0. The higher the qed value, the greater the drug-likeness.affis binding affinity. It is calculated by -log10(Kd or Ki or IC50). High value indicates high binding affinity. For instance, 8.0 corresponds to 10 nM Kd or Ki or IC50.
Revise the sample.yml file to sample molecules for any given protein pocket.
Set target to pocket file (eg. sample/3ztx_pocket.pdb), set gen_mode to denovo, and set mode to pocket.
In sample folder, you can use the protein file and ligand file to extract the pocket file. For example:
python extract_pockets.py --protein 3ztx_protein.pdb --ligand 3ztx_ligand.sdf --radius 10 --pocket 3ztx_pocket.pdbIf you encounter rdkit error, then use Obabel to formalize the ligand file.
Obabel 3ztx_ligand.sdf -O3ztx_ligand.sdfRevise the sample.yml file to sample molecules for all pockets in test set.
Set target to None and set mode to test.
- If you want to sample for the PDBbind test set, then set
pathto data/PDBbind_v2020_pocket10, setsplitto data/PDBbind_pocket10_split.pt, setprotein_rootto data/PDBbind_v2020, and setdatasetto pdbbind. - If you want to sample for the CrossDocked test set, then set
pathto data/crossdocked_v1.1_rmsd1.0_pocket10, setsplitto data/crossdocked_pocket10_pose_split.pt, setprotein_rootto data/crossdocked_v1.1_rmsd1.0, and setdatasetto crossdocked.
Revise the sample.yml file to sample molecules based on given fragments.
Set target to pocket file (eg. sample/3ztx_pocket.pdb), set frag to fragment file (eg. sample/3ztx_frag.sdf), set gen_mode to frag_cond or frag_diff, and set mode to pocket.
python scripts/evaluate.py --config configs/eval/eval.ymlThe docking mode can be chosen from {qvina, vina_score, vina_dock, none}.
Please consider citing our paper if you find it helpful. Thank you!
Qiaoyu Hu1,#, Changzhi Sun1, Huang He1, Jiazheng Xu1, Danlin Liu1, Wenqing Zhang, Sumeng Shi, Kang Zhang, and Honglin Li#. Target-aware 3D molecular generation based on guided equivariant diffusion. Nat. Commun., 2025, 16, 7928. https://doi.org/10.1038/s41467-025-63245-0
