MetaMini_KG_Automation

This is an automation pipeline that cleans and produces basic output performance statistics of node synonymization mapping.

Note: This pipeline is decommissioned due to change in project direction and change in data schema. The input Microbiome knowledge-graph (MicrobiomeKG) was provided by Glusman Lab at Institute for Systems Biology (ISB).

Modify `config.yaml` File If Needed

Before executing the pipelline one should modify some global variables in the config.yml file. We listed some required variables below:

  MICROBIOME_KG_VERSION: 'v0.2.1' # change to correct version of MicrobiomeKG data
  MICROBIOME_KG_DOWNLOAD_URL: 'https://db.systemsbiology.net/gestalt/KG/' # if web address is changed, input correct link
  NODE_SYN_DIR:  # input correct RTX Node Synonymizer path (git repo clone)

Schematic

The basic schematic of the snakemake pipeline is as followed:

Step 1: Node Synonymization (Round 1)

This is the first round of node synonymization. Nodes ID and/or name are passed through ARAX Node Synonymization (version 2.10.1). To obtian the ARAX Node Synonymization, clone the repository and follow the git instruction. The node ID is used first in Node Synonymizer to get the mapping result, however if the result is not produced then the node Name will be used to yield mapping result. If both ID and Name do not provide result, the node will be logged as failed_both. If either ID or Name produces successfully synonymized nodes mapping results it will store the canonical ID and Name in TSV file.

Step 2: Performance Statistics

This will give basic performance statistics on previous mapping results as summary (NS_performance_summary.txt) and CSV files of breakdown of failure results for later detailed analysis (wrong_name,no_name,curie_no_mapping,name_no_mapping).

Step 3: Post Processing Step

We realized that there were prefixes that needed to be modified and gene names were not correct in the failed_both nodes that were contributing to the failure rate in node mapping. Therefore, we employed additional data cleaning step (correcting gene name, prefixes, removing strains etc.) to maximize mapping hit rate.

Step 4: Failed Both NS

This is Round 2(-ish) of Node Synonymization. After cleaning the failed_both nodes, we ran these failed_both nodes through node synonymization again to obtain mapping result.

The scripts for each step used in the snakemake pipeline are located in the microKG_NS folder in this repo. To execute this pipeline use the following command below:

  snakemake --cores 16 -s MetaMini_KG_pipeline.smk targets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetaMini_KG_Automation

Modify `config.yaml` File If Needed

Schematic

Step 1: Node Synonymization (Round 1)

Step 2: Performance Statistics

Step 3: Post Processing Step

Step 4: Failed Both NS

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
image		image
microKG_NS		microKG_NS
MetaMini_KG_pipeline.smk		MetaMini_KG_pipeline.smk
README.md		README.md
config.yml		config.yml

stephwon/MetaMini_KG_Automation

Folders and files

Latest commit

History

Repository files navigation

MetaMini_KG_Automation

Modify config.yaml File If Needed

Schematic

Step 1: Node Synonymization (Round 1)

Step 2: Performance Statistics

Step 3: Post Processing Step

Step 4: Failed Both NS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Modify `config.yaml` File If Needed

Packages