-
Notifications
You must be signed in to change notification settings - Fork 14
InfereClaDR #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
InfereClaDR #68
Conversation
…d Standard. This significantly reduces run time for yeast and gives the same results in terms of calculated AUPR (which is how I estimate RNA half-lives).
…en GS is split into prior and validation set, because in such a case, good predictions that are artificially removed from the validation set should be put at the end of the list of confidence scores. Although maybe the better way to do this would not be to copy/paste and modify that function, but to add an extra small function in the original function, and only change that small function.
… only need to replace the filtering step (creating filtered gold standard and confidences) the Gold Standard split class
…as well as the gene names for the 5 gene clusters for yeast. note that also expression data cluster files are required, but they probably will not fit on the GitHub server so they would have to be downloaded separately
… (previously it was only by condition cluster); started writing code that uses the xarray module to make a 4D DataArray of AUPRs as a function of condition and gene cluster, random seed, and tau. Next step is plotting AUPR as a function of tau for each bicluster and seed, and recording the optimal (median across seeds of max across taus) tau for each bicluster.
… every prior resample, and predicting the optimal half-life for each bicluster by taking the median across prior resamples
…workflow, which added the prior-splitting class, which is already added in the inferecladr class in the first place
…t can be modified by other versions; i.e. for the InfereCLaDR to have tau as a vector instead of a number
… compute_response_variable
…ld standard) to the gene clusters
…) from inside run() into general variables of class BBSR_TFA_Workflow, so that they can be modified in child classes (in particular, with the different calculation of tau)
…_TFA_Workflow(). Instead, I initiate different instances of BBSR_TFA_Workflow(), and one of them is modified to be with the GS_split. Also added PythonDRDriver_with_tau_vector that inherits from PythonDRDriver but allows tau to be a vector. Also now there is a run() function that first optimizes taus and then uses those predicted taus for a full run on each condition cluster
… process initiates with workflow.run()
…rices that accumulated at the end of every iteration of the loop in optimize_taus(). also removed an empty dimension that caused if you run this for one condition cluster or one gene cluster
|
I finally finished running each cluster on the NYU HPC, but because of the KVS memory accumulation bug, I had to run each condition cluster separately, and was only able to go up to around 10 prior resamples (ideally would be 20), 15 different taus (ideally it would be 17), and 30 expression data bootstraps (ideally it would be 50) for each of them. But the good news is that the bicluster-specific AUPR-vs-half-life curves, as well as the optimal taus for each bicluster, closely match the results I got in my InfereCLaDR paper using R. This means that the code was implemented correctly in principle and that the original predictions were robust. So now the main goal should be eliminating the memory accumulation bug and hopefully adding some unit tests. |
This code now runs front to back for me on my yeast data set. It creates a version of the inferelator that first fits optimal RNA half-lives for each gene and condition cluster (those have to be provided into the system currently) and then uses the vector of predicted taus to predict a network of interactions for each condition cluster.
This is the first version that actually works, so feel free to look over the changes and even try running it yourself if you want. But there are still many issues that I need to address before it can be merged into master:
Because we use kvs.view for sharing mi_clr info from rank==0 worker with others, kvs.put keeps adding more and more mi_clr matrices to each worker. Because I have a few nested loops, this means that when I would never be able to do 17 taus, 20 random seeds, (50 bootstraps - although the number of bootstraps does not affect this), and 4 condition condition clusters, because the mi_clr variables accumulate. I think this is a similar issue to one of Nick's pull requests tries to address, but I am confused about that one.
I need to add a last step where I merge the final predicted networks from all four clusters (with already optimized taus), and put all of the outputs in one folder (currently there is one folder for the output of the optimization step, and one folder for each condition cluster).
I need to make sure that I get the same results I got using R (Tchourine et al 2018), because currently the results are looking different, and I think this might be because of incorrectly doing the leave-out set of the prior, or something like that.
I need to add unit tests for the new code (or maybe someone will later).
I need to figure out a way to make it easy to download the expression data for each condition cluster, since currently the expression files are too big for the yeast dataset. But that information is also contained in the meta data so one can just subset the full expression dataset using the conditions in each condition-cluster-specific meta data file.
Eventually the goal is to merge the last step with Dayanne's code. This should only require inheriting from the InfereCLaDR class and rewriting one function (run()).
Note that I'm using one more module that has not been used in the Inferelator before: xarray. So that probably needs to be added to the file that downloads all the necessary modules (which would be necessary for Travis to pass, for example)