Augmenting molecular structure representation learning using semantic biomedical knowledge
- Python ≥ 3.8.0;
requirements.txtcontains the Python packages requirements.
The data used are made available through the following box folder, where you can find:
data/contains pretraining dataset, the Knowledge Graphs created with the relative dictionary of entities and their ids, the classification datasets (datasets_valid_and_splits/contais for each assay the tabular dataset with Smiles string, MACCS key, chemicals name and labels, and the training, validataion and test index for the 5 random runs), andtsne_2d_embeddings_all_chemicals_37tox21_emb.xlsx, that is a dataframe containing chemical names, MACCS keys, physical properties and the 2D t-SNE projections for all the n = 8541 chemicals that belong to the set of the 37 Tox21 assays considered;ckpt/contains the pretrained GNN molecule encoder.
- Machine learning: baseline ML models can be trained by running
ML.py. The results will be written inresults/MLwith a directory for each random runs (seed).
python ML.py- Finetune MolCLR: MolCLR can be finetuned by running
MolCLR/finetune.py. The results will be written inresults/graph_structure_comptoxAIwith a directory for each random runs (seed).
python MolCLR/finetune.py- Semantic GNN: Semantic GNN model can be trained by running
semantic.py. The results will be written inresults/semantic_gatwith a directory for each random runs (seed).
python semantic.py- MolCLR+Sem: MolCLR+Sem model can be trained by running
semantic_and_MolCLR.py. The results will be written inresults/semantic_and_graphwith a directory for each random runs (seed).
python semantic_and_MolCLR.pyExplainability with GNNExplainer can be obtained for positive chemicals by running the explain.py script. The results will be written in results/gnn_xai with a directory for each random runs.
python explain.pyThe evaluation.py script contains code for:
- compute pretrained embeddings for all the chemcials involved in the Tox21 assays considered, project them in 2D with t-SNE and colour them according to chemical and physical properties of the molecules (extracted from ComptoxAI or through puchem API);
- process the classification results by computing the mean classification metrics for each model and for each assay, to create a dataframe than can be used to compute the violin plot with the mean results computedall the assay and the heatmap with the single assay results.
- process the xai results, by thresholding the number of edges to keep, and create the images with the molecule graph and most important subgraph identified a specific compound in input.
python evaluate.py