-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Description
Before selecting the cutoffs
Modifications for the current code:
- Add hits within same genome and remove duplicates (e.g. same BGC from NCBI + personal database);
- Improve subclustering (matrix rules?) in order to remove multiple self loops;
- Refine cluster's boarders removing unique hypothetical genes;
New additions to the code:
- Calculate average Jaccard Index between all gene cluster that are in the network for domains, creating a second similarity matrix. Then use DBScan to separate the gene cluster into groups;
- Make the network output as an interactive chart (just like Numbers does), named calibration graph, allowing to see the networks to change throughout a range of cutoffs, highlighting family of "gold standards BGCs" (just like an "internal standard") and using second DBScan groups to color nodes; PS: only include edges for biosynthetic or hypothetical (uncolored)
After selecting the cutoffs
- Add (a better) filtering script, where the user will point the best cutoff he could find using this calibration graph;
- Automatically generate output images (one with and other w/o regulatory/mobile/resistance genes) for the selected network (using NetworkX?), but also provide cytoscape output;
- Add multiple gene alignment images upon clicking family in the outpout;
For future
- Run analysis on multiple samples (multiCOMPASS module?). Suggestion: run analysis for the genome with most BGCs, then loop until all BGCs from query are in the final network.
Remaining challanges
- How to improve subclustering rules?
- How to better select best DBScan subclustering itineration?
- How to select best EPS for Jaccard index DBScan?
- How to run multiple samples with different subclustering?
Reactions are currently unavailable