Predicting and explaining the impact of genetic disruptions and interactions on cell and organismal viability
This repositroy contains all the source code necessary to reproduce the results of our paper, "Predicting and explaining the impact of genetic disruptions and interactions on cell and organismal viability".
The following files are responsible for extracting features and tasks from raw bioinformatic data.
create_ppc.pycreates the protein-protein interaction networks for the budding yeast, fission yeast, human, and fruit fly.create_tasks.pygenerates the single-, double-, and triple-mutant tasks studied in the paper. It assumes thatcreate_ppc.pyhas already been executed.create_features.pycreates the single and pairwise gene features for all four organisms. This requires theowltoolapplication from geneontology.org to be present in../tools, and requires an NCBI-Blast+ installation (on Ubuntu, this can be installed viasudo apt install ncbi-blast+).create_datasets.pycombines features and tasks into one csv file, for each task. GI and Triple GI tasks only include the pairwise features as it would be too much to include the features of individual genes. For those tasks, the GI and Triple GI models require the single gene feature files as well.create_pseudo_triplets_task.pycreates randomly sampled pseudo triplets within- and across-complexes.explore_hybrid_costanzo.ipynbexamines the overlap between costanzo and Biogrid datasets.
The above-mentioned files require original third-party data files to be present in the ../data-sources directory (e.g., datasets such as BioGRID, uniprot, etc.). Since there are many original data files required and due to the difficulty of downloading and placing them in the right organization, we provide, a zip file containing all the processed data necessary to replicate the analyses and modeling experiments. Thus, the user doesn't need to deal with original third-party data. The file can be downloaded here.
After download, unzip the contents of the file into ../generated-data directory.
After downloading and extracting the processed data files, the following scripts should be executed.
create_mn_datasets.pycreates datasets for the MN models, based on those generated bycreate_datasets.py.create_splits.pycreates all the cross-validation splits for all tasks studied in the paper. This includes the development/test splits for yeast.figures.ipynbproduces all the non-modeling figures in the paper.
The following files can reproduce the results of the modeling experiments of the paper. You can just run them, there are no arguments or parameters to pass.
exp_optimize_hyperparams.pyruns the hyper parameter optimization experiments on the single- and double-gene neural network models. Produces supplementary table 1.exp_feature_selection.pyruns feature selection experiments on the development portion of the budding yeast datasets. Produces supplementary tables 2, 4, 7.exp_yeast_smf.pyevaluates the S-Full, S-Refined, S-MN, and null models on the development portion (CV) and test portions of the yeast SMF dataset. Produces Figure 1A.exp_yeast_gi_hybrid.pyevaluates the D-Full, D-Refined, D-MN, and null models on the development (CV) and test portions of the yeast hybrid GI dataset. Produces Figure 2A.exp_yeast_tgi.pyevaluates the T-Full, T-Refined, T-MN, and null models on the development (CV) and test portions of the yeast triple mutant GI dataset. Produces Figure 3A.exp_smf_binary.pyevaluates the S-Refined, S-MN, and null models on the SMF datasets of all four organisms, as well as the multi organismal lethal (MO) vs. viable (V) dataset of humans and fruit flies. Training and evaluation is done using CV. Produces Figure 4.exp_gi_binary.pyevaluates the D-Refined, D-MN (with and without slim GO terms), and the null models on the GI datasets of all four organisms. Produces Figure 5.exp_gi_costanzo_pombe.pyevaluates the 4-way D-Full, D-Refined, D-MN, and null models on the yeast Costanzo GI dataset, and the D-Refiend, D-MN, and null models on the pombe GI dataset. Produces Supplementary Figure 5.exp_smf_other_orgs.pyevaluates the 3-way S-Refined, S-MN, and null models on the pombe, human, and fruit fly SMF datasets. Produces Supplementary Figure 7.exp_smf_ca_mo_v.pyevaluates S-Refined, S-MN, and null models on the task of predicting cellular autonomous lethality (CA) vs multi-organismal lethality (MO) vs viability (V) in humans and fruit flies. Produces Supplementary Figure 8.exp_lit.pycompares the binary S-MN and D-MN models to other single-mutant fitness models from literature on the yeast SMF and hybrid GI datasets. Produces Supplementary Figure 9.exp_cross_prediction.pytrains the D-MN model on the GI prediction task on the yeast hybrid GI dataset, and evaluates it on the task of predicting GI, coprecipitation, phosphorylation, and transcription. Produces Supplementary Figure 10.exp_mn_feature_contribution.pycomputes the drop in balanced accuracy when each feature of the S-MN, D-MN, and T-MN models is removed. Produces Supplementary Figure 11.exp_generalization.pytrains S-MN and D-MN models on the yeast SMF and hybrid GI datasets, and evaluates them on the other organisms' datasets. Produces Supplementary Figure 12.
Those reside under cfgs/ and specify the configuration of the NN and MN models used in the paper. Note that those configuration specify the models in their "full" form, the refined variants and created dynamically from those files in the experiment scripts above.
Tensorflow model classes reside under models/. In addition to the neural network and MN models, a helper module, train_and_evaluate.py is provided to carry out CV training and evaluation. The module takes advantage of multiple cores to run several splits at the same time.
The code was tested with the follow modules:
sklearn 1.0.2dcor 0.5.3Bio 1.79igraph 0.9.10matplotlib 3.5.1networkx 2.8numpy 1.22.3obonet 0.3.0pandas 1.4.2scipy 1.8.0seaborn 0.11.2statsmodels 0.13.2tensorflow 2.8.0