Analysis pipeline for TPCAV
You can use your own environment for the model, in addition, you need to install the following packages:
- captum 0.7
- seqchromloader 0.8.5
- scikit-learn 1.5.2
-
Since not every saved pytorch model stores the computation graph, you need to manually add functions to let the script know how to get the activations of the intermediate layer and how to proceed from there.
There are 3 places you need to insert your own code.
-
Model class definition in models.py
- Please first copy your class definition into
Model_Classin the script, it already has several pre-defined class functions, you need to fill in the following two functions:forward_until_select_layer: this is the function that takes your model input and forward until the layer you want to compute TPCAV score onresume_forward_from_select_layer: this is the function that starts from the activations of your select layer and forward all the way until the end
- There are also functions necessary for TPCAV computation, don't change them:
forward_from_start: this function callsforward_until_select_layerandresume_forward_from_select_layerto do a full forward passforward_from_projected_and_residual: this function takes the PCA projected activations and unexplained residual to do the forward passproject_avs_to_pca: this function takes care of the PCA projection
NOTE: you can modify your final output tensor to specifically explain certain transformation of your output, for example, you can take weighted sum of base pair resolution signal prediction to emphasize high signal region.
- Please first copy your class definition into
-
Function
load_modelin utils.py- Take care of the model initialization and load saved parameters in
load_model, return the model instance.
NOTE: you need to use your own model class definition in models.py, as we need the functions defined in step 1.
- Take care of the model initialization and load saved parameters in
-
Function
seq_transform_fnin utils.py- By default the dataloader provides one hot coded DNA array of shape (batch_size, 4, len), coded in the order [A, C, G, T], if your model takes a different kind of input, modify
seq_transform_fnto transform the input
- By default the dataloader provides one hot coded DNA array of shape (batch_size, 4, len), coded in the order [A, C, G, T], if your model takes a different kind of input, modify
-
Function
chrom_transform_fnin utils.py- By default the dataloader provides signal array from bigwig files of shape (batch_size, # bigwigs, len), if your model takes a different kind of chromatin input, modify
chrom_transform_fnto transform the input, if your model is sequence only, leave it to return None.
- By default the dataloader provides signal array from bigwig files of shape (batch_size, # bigwigs, len), if your model takes a different kind of chromatin input, modify
-
-
Compute CAVs on your model, example command:
srun -n1 -c8 --gres=gpu:1 --mem=128G python scripts/run_tcav_sgd_pca.py \
cavs_test 1024 data/hg19.fa data/hg19.fa.fai \
--meme-motifs data/motif-clustering-v2.1beta_consensus_pwms.test.meme \
--bed-chrom-concepts data/ENCODE_DNase_peaks.bed- Then compute the layer attributions, example command:
srun -n1 -c8 --gres=gpu:1 --mem=128G \
python scripts/compute_layer_attrs_only.py cavs_test/tpcav_model.pt \
data/ChIPseq.H1-hESC.MAX.conservative.all.shuf1k.narrowPeak \
1024 data/hg19.fa data/hg19.fa.fai cavs_test/test - run the jupyer notebook to generate summary of your results
papermill -f scripts/compute_tcav_v2_pwm.example.yaml scripts/compute_tcav_v2_pwm.py.ipynb cavs_test/tcav_report.py.ipynb