A balanced end-to-end deep learning model for interactome prediction from co-fractionation/mass-spectrometry (CF-MS) data
SPIFFED is modified from Elution Profile-Based Inference of Protein Complexes (EPIC), a widely used protein protein interaction predictor and protein complex inference software. SPIFFED differs from EPIC in that it uses a convolutional neural network to analyze raw co-elution data, thereby eliminating the need for manual feature engineering. This approach enhances the accuracy of protein interaction predictions.
To install SPIFFED, first make sure you have Python 2.7
$ git clone https://github.com/bio-it-station/SPIFFED
$ conda create -n "EPIC_test" python=2.7.16
$ pip install -r requirements.txt
$ pip install beautifulsoup4
$ pip install tensorflow==1.13.1
$ pip install Keras==2.2.4
$ conda install rpy2
$ pip install scikit-plot
Here is a list of dependent packages:
1. scikit-learn
2. requests
3. scikit-learn
4. beautifulsoup4
5. mock
6. kohonen
7. numpy
8. matplotlib
Here is the main and only one command that you need to run:
python ./src/main.py -s
feature_selectioninput_directory-cgold_standard_file_pathoutput_directory-ooutput_filename_prefix-Mtraining_method-nnumber_of_cores-m EXP -f STRING --LEARNING_SELECTIONlearning method selection--K_D_TRAINfold_or_direct_training--FOLD_NUMnumber_of_folds--TRAIN_TEST_RATIOtesting_data_ratio--POS_NEG_RATIOnegative_PPIs_ratio--NUM_EPnumber_of_elution_profiles--NUM_FRCnumber_of_fractions--CNN_ENSEMBLEensemble_bool
-
(
-sfeature_selection) or (--feature_selectionfeature_selection): Specify correlation scores to be used in SPIFFED. Eight different correlation socres are implemented in SPIFFED, in order: Mutual Information, Bayes Correlation, Euclidean Distance, Weighted Cross-Correlation, Jaccard Score, PCCN, Pearson Correlation Coefficient, Apex Score, and Raw elution profile. "0" indicates that we don't use this correlation score and "1" indicates that we use this correlation score.- If you want to run Convolutional Neural Network (CNN) or Label Spreading (LS), you must set this parameter to "
-s000000001". (* note that there are 9 characters in the string). - If you want to run EPPC with SPIFFED scores, then you can set this parameter to "
-s11101001". (* note that there are 8 characters in the string). In this example, it will use Mutual Information, Bayes Correlation, Euclidean Distance, Jaccard Score and Apex Score. To specify the correlation scores to use:
- If you want to run Convolutional Neural Network (CNN) or Label Spreading (LS), you must set this parameter to "
-
input_directory: This parameter stores the input directory where you store your elution profile file. It is recommended to use the abosulte path instead of relative path. -
(
-cgold_standard_file_path) or (--clustergold_standard_file_path): This parameter stores the path to the gold standard file that you curated. -
output_directory: This parameter stores the path to the ouput directory. Make sure that you've already created the directory before running the command. It is recommended to use the abosulte path instead of relative path. -
(
-ooutput_filename_prefix) or (--output_prefixoutput_filename_prefix): You can specify a prefix name for all the output files. The default is "Out" -
(
-Mtraining_method) or (--classifiertraining_method): This parameter specifies what kind of classifier that you use. Possible options includeRF,CNN,LS. Note thatRFmust comes with selected SPIFFED scores like "-s11101001" instead of raw elution profile ("-s000000001").CNNandLSmust come with raw elution profile ("-s000000001"). -
(
-nnumber_of_cores) or (--num_coresnumber_of_cores): You need to specify the number of cores used to run EPPC, the default number is 1. Assume you want to use six cores to run SPIFFED, you can set "-n6" -
--LEARNING_SELECTIONlearning method selection: This parameter specifies whether you want to use supervised learning or semi-supervised learning. If you want to run with supervised learning, then set "--LEARNING_SELECTIONsl" (Yourtraining_methodcan beRForCNN); if you want to run with semi-supervised learning, then set "--LEARNING_SELECTIONssl" (Yourtraining_methodcan beCNNorLS). -
--K_D_TRAINfold_or_direct_training: Setdto directly train the model; setkto run with k-fold training. (options:dandk; default:d) -
--FOLD_NUMnumber_of_folds: If you set--K_D_TRAINk, then this parameter stores how many folds you are going to evaluate your mode. Note that this parameter must be bigger than2. (default:5) -
--TRAIN_TEST_RATIOtesting_data_ratio: This parameter stores the ratio of testing data to all data. (default:0.3) -
--POS_NEG_RATIOnegative_PPIs_ratio: This parameter stores the ratio of negative PPIs to positive PPIs. (default:1) -
--NUM_EPnumber_of_elution_profiles: This parameter stores the number of elution profiles inside each PPI. (default:2) -
--NUM_FRCnumber_of_fractions: This parameter stores the number of fractions in the elution profile file. (default:27) -
--CNN_ENSEMBLEnumber_of_fractions: This parameter is a boolean value. If it's0, users need to provide one elution profile; if it's1, users need to provide multiple elution profiles.
To run SPIFFED:
python ./main.py -s 000000001 /ccb/salz3/kh.chao/SPIFFED/input/EPIC_DATA/beadsALF -c /ccb/salz3/kh.chao/SPIFFED/input/EPIC_DATA/Worm_reference_complexes.txt /ccb/salz3/kh.chao/SPIFFED/output/EPIC_DATA/beadsALF/TEST/CNN_SL/FOLDS/beadsALF__K_D__k__CNN_SL__fold_number_5__negative_ratio_5__test_ratio_30 -o TEST -M CNN -n 10 -m EXP -f STRING --LEARNING_SELECTION sl --K_D_TRAIN k --FOLD_NUM 5 --TRAIN_TEST_RATIO 0.3 --POS_NEG_RATIO 5 --CNN_ENSEMBLE 0
To run SPIFFED with ensemble model:
python ./main.py -s 000000001 /home/kuan-hao/SPIFFED/input/OUR_DATA/intensity_HML_ensemble/ -c /home/kuan-hao/SPIFFED/input/OUR_DATA/gold_standard.tsv /home/kuan-hao/SPIFFED/output/SELF_DATA/intensity_HML_ensemble__negative_ratio_5/ -o out -M CNN -n 10 -m EXP -f STRING --LEARNING_SELECTION sl --K_D_TRAIN d --FOLD_NUM 5 --TRAIN_TEST_RATIO 0.7 --POS_NEG_RATIO 5 --NUM_EP 2 --NUM_FRC 27 --CNN_ENSEMBLE 1
