Skip to content

Parsing prediction tool output#45

Open
HobnobMancer wants to merge 122 commits intomasterfrom
parsing_prediction_tool_output
Open

Parsing prediction tool output#45
HobnobMancer wants to merge 122 commits intomasterfrom
parsing_prediction_tool_output

Conversation

@HobnobMancer
Copy link
Owner

@HobnobMancer HobnobMancer commented Nov 9, 2020

Notes:

  • still need to add the unit tests
  • development for these functions is within the Jupyter notebook 'parsing_prediction_tools_output', which is stored in HTML and .ipyn format within the directory ./notebooks/planning_parsing_prediction_tool_output, which additionally contains the output files from the tools that were used to develop the new functions
  • have no added calls to these function from the main script (predict_cazymes.py)

Add the following new functions to pyrewton.cazymes.prediction.parse for the standardising/parsing the output from the prediction tools:

  • parse_dbcan_output()
  • add_hotpep_ec_predictions()
  • parse_hmmer_output()
  • parse_hotpep_output()
  • parse_diamond_output()
  • get_dbcan_consensus()
  • add_hotpep_ec_predictions()
  • parse_cupp_output()
  • parse_ecami_output()

A dataframe of the output is created each for dbCAN (containing the consensus result, defined as all CAZy families that at least 2 tools predict for a query protein sequence), HMMER, Hotpep, DIAMOND, CUPP and eCAMI.

For each prediction tool the following data is retrieved:
dbCAN: CAZy family, CAZy subfamily (can predict multiple domains per protein)
HMMER: CAZy family, CAZy subfamily (can predict multiple domains per protein), domain ranges (the starting and end amino acid of the domain)
Hotpep: CAZy family, CAZy subfamily (can predict multiple domains per protein)
DIAMOND: CAZy family, CAZy subfamily (can predict multiple domains per protein)
CUPP: CAZy family, CAZy subfamily, predicated EC number and domain range
eCAMI: CAZy family, CAZy subfamily (can predict multiple domains per protein), EC number, here the best result is listed under the CAZy fam and subfam headings and additional domains under "additional_domains"

HobnobMancer and others added 5 commits November 3, 2020 13:16
…ut' jupyter notebook

For the Jupyter notebook that contains all the development of these functions, see the directory 'notebooks' within the root of the pyrewton repository
@codecov
Copy link

codecov bot commented Nov 9, 2020

Codecov Report

Merging #45 (70891eb) into master (d2e7066) will decrease coverage by 17.06%.
The diff coverage is 1.73%.

@@             Coverage Diff             @@
##           master      #45       +/-   ##
===========================================
- Coverage   93.67%   76.60%   -17.07%     
===========================================
  Files          20       20               
  Lines         759      932      +173     
===========================================
+ Hits          711      714        +3     
- Misses         48      218      +170     

…, statistical evaluationa and writing summary reports from main() and then program closes
…voking each prediction tool is invoked by its own function
add quality checking to retrieval of domain ranges. Improve the retrieval of retrieving EC numbers by checking if multiple to given, how they are given and collecting all EC numbers and separating them by ', '. Factorise out the many additional tasks to separate functions
HobnobMancer and others added 30 commits November 25, 2020 09:35
add wuality checking, checking EC# are formated correctly, and standardised EC numbers so missing digits are represented by '-'. Standardise the domain range so ranges are spearated by '..'. Add checking of CAZy family and subfamily names. Log any irregularities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant