Skip to content

ml-jku/vnegnn

Repository files navigation

VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification

Overview

Implementation of the VN-EGNN, state-of-the-art method for protein binding site identfication, by Florian Sestak, Lisa Schneckenreiter, Johannes Brandstetter, Sepp Hochreiter, Andreas Mayr, GΓΌnter Klambauer. This repository contains all code, instructions and model weights necessary to run the method or to retrain a model. If you have any question, feel free to open an issue or reach out to: sestak@ml.jku.at.

Installation

Requirements

  • Python 3.9+
  • PyTorch 2.1+ (2.7+ recommended)
  • CUDA 11.8+ or 12.x (for GPU support)
  • PyTorch Geometric 2.4+

Quick Setup

1. Clone the repository:

git clone https://github.com/ml-jku/vnegnn
cd vnegnn

2. Create and activate conda environment:

conda env create -f environment.yaml
conda activate vnegnn

3. Install PyTorch with CUDA support:

Choose the appropriate command based on your CUDA version. Visit PyTorch Get Started for other configurations.

For CUDA 12.x:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

4. Install PyTorch Geometric and CUDA-dependent extensions:

The torch-scatter, torch-sparse, and torch-cluster packages are CUDA-version specific and must match your PyTorch and CUDA versions.

First, check your PyTorch and CUDA versions:

python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.version.cuda}')"

Then install PyTorch Geometric and extensions. Replace ${TORCH} with your PyTorch version (e.g., 2.1.0, 2.7.0) and ${CUDA} with your CUDA version (e.g., cu121, cu118, cpu):

pip install torch-geometric
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html

Examples:

For PyTorch 2.7.0 with CUDA 12.8:

pip install torch-geometric
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-2.7.0+cu128.html

Note: See the PyTorch Geometric Installation Guide for the full list of available wheel versions.

Verify Installation

python -c "import torch; import torch_geometric; print(f'PyTorch: {torch.__version__}'); print(f'PyG: {torch_geometric.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

Setup environment variables

Setup the environment variables for logging in .env, a template can be found here .env.template.

source .env

Data

The datasets are processed and be downloaded from this link. Place the datasets in the folder data/data. Run the following commands, to setup files used for training:

./process_data.sh

To rerun the Equipocket baseline you need to specify MSMS path, for surface genration.

./process_data_equipocket.sh

The splits for each experiments are provided in the uploaded dataset, e.g. COACH420 (data/data/coach420/splits).

Experiments

Experiment are logged via Weights and Biases, use the [RUN_ID] to evalute the model. To reproduce the results in our publication, run the following commands for the individuall experiments. The evaluation metrics are loggend in wandb and can then be exported as csv for further processing. If you run an experiment, our script saves the graph as dataset, as described in Pytorch Geometric.

VN-EGNN

# Train
python src/train.py experiment=vnegnn

# Eval
python src/eval.py wandb_run_id=[RUN_ID]

VN-EGNN (Train PDBBind2020)

# Train
python src/train.py experiment=vnegnn_pdbbind2020

# Eval
python src/eval.py wandb_run_id=[RUN_ID]

VN-EGNN (Train GRASP benchmark)

We also compared VN-EGNN on a different scPDB dataset split proposed by GrASP. The datasets for this, can be found in their provided repository. (To rerun their experiments, place their datasets under data/grasp and run the data processing pipline desribed above on this folder.)

# Train
python src/train.py experiment=vnegnn_pdbbind2020

# Eval
python src/eval.py wandb_run_id=[RUN_ID]

Baseline Equipocket

# Train
python src/train.py experiment=equipocket

# Eval
python src/eval.py wandb_run_id=[RUN_ID]

Project structure

Configruation

Configuration is managed with Hydra configs, structured as follows.

πŸ“ configs
β”œβ”€β”€ πŸ“ callbacks                # Callbacks (e.g. checkpointing, ...)
β”œβ”€β”€ πŸ“ data                     # Dataset configs
β”œβ”€β”€ πŸ“ debug                    # Debug configs
β”œβ”€β”€ πŸ“ experiment               # Contains all experiments reported in the publication.
β”œβ”€β”€ πŸ“ extras                   # Extra configurations.
β”œβ”€β”€ πŸ“ hydra                    # Hydra configurations.
β”œβ”€β”€ πŸ“ local                    # Local setup files.
β”œβ”€β”€ πŸ“ logger                   # Logger setup (wandb logger was used for all experiments)
β”œβ”€β”€ πŸ“ model                    # Model configurations
β”œβ”€β”€ πŸ“ paths                    # Paths setup.
β”œβ”€β”€ πŸ“ trainer                  # Lighting trainer configuration
β”œβ”€β”€ πŸ“„ eval.yaml                # Train config.
└── πŸ“„ train.yaml               # Eval config.

Source code

The following shows the structure of the source code. The training pipeline is setup with Pytorch Lightning.

πŸ“ src
β”œβ”€β”€ πŸ“ datasets                    # Dataset implementations
β”‚   β”œβ”€β”€ πŸ“„ binding_dataset.py      # Binding site dataset class
β”‚   β”œβ”€β”€ πŸ“„ equipocket_dataset.py   # Equipocket dataset class
β”‚   └── πŸ“„ utils.py                # Dataset utilities
β”œβ”€β”€ πŸ“ models                      # Model architectures
β”‚   β”œβ”€β”€ πŸ“ equipocket              # Equipocket baseline models
β”‚   β”‚   β”œβ”€β”€ πŸ“„ baseline_models.py  # Baseline model implementations
β”‚   β”‚   β”œβ”€β”€ πŸ“„ egnn_clean.py       # Clean EGNN implementation
β”‚   β”‚   β”œβ”€β”€ πŸ“„ equipocket.py       # Equipocket model
β”‚   β”‚   └── πŸ“„ surface_egnn.py     # Surface-based EGNN
β”‚   └── πŸ“ vnegnn                  # VN-EGNN models
β”‚       β”œβ”€β”€ πŸ“„ aggregation.py      # Aggregation layers
β”‚       β”œβ”€β”€ πŸ“„ utils.py            # Model utilities
β”‚       └── πŸ“„ vnegnn.py           # VN-EGNN implementation
β”œβ”€β”€ πŸ“ modules                     # Training components
β”‚   β”œβ”€β”€ πŸ“„ callbacks.py            # Custom Lightning callbacks
β”‚   β”œβ”€β”€ πŸ“„ cluster.py              # Clustering utilities
β”‚   β”œβ”€β”€ πŸ“„ ema.py                  # Exponential moving average
β”‚   β”œβ”€β”€ πŸ“„ losses.py               # Loss functions
β”‚   β”œβ”€β”€ πŸ“„ metrics.py              # Evaluation metrics
β”‚   └── πŸ“„ schedulers.py           # Learning rate schedulers
β”œβ”€β”€ πŸ“ utils                       # Utility functions
β”‚   β”œβ”€β”€ πŸ“„ constants.py            # Constants and definitions
β”‚   β”œβ”€β”€ πŸ“„ graph.py                # Graph processing utilities
β”‚   β”œβ”€β”€ πŸ“„ instantiators.py        # Hydra instantiation helpers
β”‚   β”œβ”€β”€ πŸ“„ logging_utils.py        # Logging utilities
β”‚   β”œβ”€β”€ πŸ“„ misc.py                 # Miscellaneous utilities
β”‚   β”œβ”€β”€ πŸ“„ protein.py              # Protein processing
β”‚   β”œβ”€β”€ πŸ“„ pylogger.py             # Python logger
β”‚   β”œβ”€β”€ πŸ“„ rich_utils.py           # Rich text formatting
β”‚   β”œβ”€β”€ πŸ“„ tensor_utils.py         # Tensor manipulation
β”‚   β”œβ”€β”€ πŸ“„ torch_utils.py          # PyTorch utilities
β”‚   └── πŸ“„ utils.py                # General utilities
β”œβ”€β”€ πŸ“ wrappers                    # Lightning module wrappers
β”‚   β”œβ”€β”€ πŸ“„ base.py                 # Base wrapper class
β”‚   β”œβ”€β”€ πŸ“„ bindingsites.py         # VNEGNN wrapper
β”‚   └── πŸ“„ equipocket.py           # Equipocket wrapper
β”œβ”€β”€ πŸ“„ train.py                    # Training script
└── πŸ“„ eval.py                     # Evaluation script

Citation

@misc{sestak2024vnegnn,
    title={VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification},
    author={Florian Sestak and Lisa Schneckenreiter and Johannes Brandstetter and Sepp Hochreiter and Andreas Mayr and GΓΌnter Klambauer},
    year={2024},
    eprint={2404.07194},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

License

MIT

Acknowledgements

The ELLIS Unit Linz, the LIT AI Lab, the Institute for Ma- chine Learning, are supported by the Federal State Upper Austria. We thank the projects AI-MOTION (LIT-2018- 6-YOU-212), DeepFlood (LIT-2019-8-YOU-213), Medi- cal Cognitive Computing Center (MC3), INCONTROL- RL (FFG-881064), PRIMAL (FFG-873979), S3AI (FFG- 872172), DL for GranularFlow (FFG-871302), EPILEP- SIA (FFG-892171), AIRI FG 9-N (FWF-36284, FWF- 36235), AI4GreenHeatingGrids(FFG- 899943), INTE- GRATE (FFG-892418), ELISE (H2020-ICT-2019-3 ID: 951847), Stars4Waters (HORIZON-CL6-2021-CLIMATE- 01-01). We thank Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, Silicon Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH, Google, ZF Friedrichshafen AG, Robert Bosch GmbH, UCB Bio- pharma SRL, Merck Healthcare KGaA, Verbund AG, GLS (Univ. Waterloo) Software Competence Center Hagen- berg GmbH, TÜV Austria, Frauscher Sensonic, TRUMPF and the NVIDIA Corporation. We acknowledge EuroHPC Joint Undertaking for awarding us access to Karolina at IT4Innovations, Czech Republic; MeluXina at LuxProvide, Luxembourg; LUMI at CSC, Finland.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published