VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification
Implementation of the VN-EGNN, state-of-the-art method for protein binding site identfication, by Florian Sestak, Lisa Schneckenreiter, Johannes Brandstetter, Sepp Hochreiter, Andreas Mayr, GΓΌnter Klambauer. This repository contains all code, instructions and model weights necessary to run the method or to retrain a model. If you have any question, feel free to open an issue or reach out to: sestak@ml.jku.at.
- Python 3.9+
- PyTorch 2.1+ (2.7+ recommended)
- CUDA 11.8+ or 12.x (for GPU support)
- PyTorch Geometric 2.4+
git clone https://github.com/ml-jku/vnegnn
cd vnegnnconda env create -f environment.yaml
conda activate vnegnnChoose the appropriate command based on your CUDA version. Visit PyTorch Get Started for other configurations.
For CUDA 12.x:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121The torch-scatter, torch-sparse, and torch-cluster packages are CUDA-version specific and must match your PyTorch and CUDA versions.
First, check your PyTorch and CUDA versions:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.version.cuda}')"Then install PyTorch Geometric and extensions. Replace ${TORCH} with your PyTorch version (e.g., 2.1.0, 2.7.0) and ${CUDA} with your CUDA version (e.g., cu121, cu118, cpu):
pip install torch-geometric
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.htmlExamples:
For PyTorch 2.7.0 with CUDA 12.8:
pip install torch-geometric
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-2.7.0+cu128.htmlNote: See the PyTorch Geometric Installation Guide for the full list of available wheel versions.
python -c "import torch; import torch_geometric; print(f'PyTorch: {torch.__version__}'); print(f'PyG: {torch_geometric.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"Setup the environment variables for logging in .env, a template can be found here .env.template.
source .envThe datasets are processed and be downloaded from this link. Place the datasets in the folder data/data.
Run the following commands, to setup files used for training:
./process_data.shTo rerun the Equipocket baseline you need to specify MSMS path, for surface genration.
./process_data_equipocket.shThe splits for each experiments are provided in the uploaded dataset, e.g. COACH420 (data/data/coach420/splits).
Experiment are logged via Weights and Biases, use the [RUN_ID] to evalute the model. To reproduce the results in our publication, run the following commands for the individuall experiments. The evaluation metrics are loggend in wandb and can then be exported as csv for further processing. If you run an experiment, our script saves the graph as dataset, as described in Pytorch Geometric.
# Train
python src/train.py experiment=vnegnn
# Eval
python src/eval.py wandb_run_id=[RUN_ID]
# Train
python src/train.py experiment=vnegnn_pdbbind2020
# Eval
python src/eval.py wandb_run_id=[RUN_ID]
We also compared VN-EGNN on a different scPDB dataset split proposed by GrASP. The datasets for this, can be found in their provided repository. (To rerun their experiments, place their datasets under data/grasp and run the data processing pipline desribed above on this folder.)
# Train
python src/train.py experiment=vnegnn_pdbbind2020
# Eval
python src/eval.py wandb_run_id=[RUN_ID]
# Train
python src/train.py experiment=equipocket
# Eval
python src/eval.py wandb_run_id=[RUN_ID]
Configuration is managed with Hydra configs, structured as follows.
π configs
βββ π callbacks # Callbacks (e.g. checkpointing, ...)
βββ π data # Dataset configs
βββ π debug # Debug configs
βββ π experiment # Contains all experiments reported in the publication.
βββ π extras # Extra configurations.
βββ π hydra # Hydra configurations.
βββ π local # Local setup files.
βββ π logger # Logger setup (wandb logger was used for all experiments)
βββ π model # Model configurations
βββ π paths # Paths setup.
βββ π trainer # Lighting trainer configuration
βββ π eval.yaml # Train config.
βββ π train.yaml # Eval config.
The following shows the structure of the source code. The training pipeline is setup with Pytorch Lightning.
π src
βββ π datasets # Dataset implementations
β βββ π binding_dataset.py # Binding site dataset class
β βββ π equipocket_dataset.py # Equipocket dataset class
β βββ π utils.py # Dataset utilities
βββ π models # Model architectures
β βββ π equipocket # Equipocket baseline models
β β βββ π baseline_models.py # Baseline model implementations
β β βββ π egnn_clean.py # Clean EGNN implementation
β β βββ π equipocket.py # Equipocket model
β β βββ π surface_egnn.py # Surface-based EGNN
β βββ π vnegnn # VN-EGNN models
β βββ π aggregation.py # Aggregation layers
β βββ π utils.py # Model utilities
β βββ π vnegnn.py # VN-EGNN implementation
βββ π modules # Training components
β βββ π callbacks.py # Custom Lightning callbacks
β βββ π cluster.py # Clustering utilities
β βββ π ema.py # Exponential moving average
β βββ π losses.py # Loss functions
β βββ π metrics.py # Evaluation metrics
β βββ π schedulers.py # Learning rate schedulers
βββ π utils # Utility functions
β βββ π constants.py # Constants and definitions
β βββ π graph.py # Graph processing utilities
β βββ π instantiators.py # Hydra instantiation helpers
β βββ π logging_utils.py # Logging utilities
β βββ π misc.py # Miscellaneous utilities
β βββ π protein.py # Protein processing
β βββ π pylogger.py # Python logger
β βββ π rich_utils.py # Rich text formatting
β βββ π tensor_utils.py # Tensor manipulation
β βββ π torch_utils.py # PyTorch utilities
β βββ π utils.py # General utilities
βββ π wrappers # Lightning module wrappers
β βββ π base.py # Base wrapper class
β βββ π bindingsites.py # VNEGNN wrapper
β βββ π equipocket.py # Equipocket wrapper
βββ π train.py # Training script
βββ π eval.py # Evaluation script
@misc{sestak2024vnegnn,
title={VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification},
author={Florian Sestak and Lisa Schneckenreiter and Johannes Brandstetter and Sepp Hochreiter and Andreas Mayr and GΓΌnter Klambauer},
year={2024},
eprint={2404.07194},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MIT
The ELLIS Unit Linz, the LIT AI Lab, the Institute for Ma- chine Learning, are supported by the Federal State Upper Austria. We thank the projects AI-MOTION (LIT-2018- 6-YOU-212), DeepFlood (LIT-2019-8-YOU-213), Medi- cal Cognitive Computing Center (MC3), INCONTROL- RL (FFG-881064), PRIMAL (FFG-873979), S3AI (FFG- 872172), DL for GranularFlow (FFG-871302), EPILEP- SIA (FFG-892171), AIRI FG 9-N (FWF-36284, FWF- 36235), AI4GreenHeatingGrids(FFG- 899943), INTE- GRATE (FFG-892418), ELISE (H2020-ICT-2019-3 ID: 951847), Stars4Waters (HORIZON-CL6-2021-CLIMATE- 01-01). We thank Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google, ZF Friedrichshafen AG, Robert Bosch GmbH, UCB Bio- pharma SRL, Merck Healthcare KGaA, Verbund AG, GLS (Univ. Waterloo) Software Competence Center Hagen- berg GmbH, TΓV Austria, Frauscher Sensonic, TRUMPF and the NVIDIA Corporation. We acknowledge EuroHPC Joint Undertaking for awarding us access to Karolina at IT4Innovations, Czech Republic; MeluXina at LuxProvide, Luxembourg; LUMI at CSC, Finland.

