Merge Stats Refactor and Error Processing Fixes

Mid last week, I came across some errors in the stats output and error processing code on the `qa_workflow` side. 

Specifically there was code that needed to be adapted from how autocrop QA was handling its output from the autocropper vs how line extraction QA is handling output from the watershed line extraction. For watershed line extraction QA the QA workflow is: 'clear', 'run', 'collate' (and specifically meaning, 'collate_errors'), and then 'output_stats'. The latter two steps are reversed in autocropping QA. A revised error handling and new stats merging function was necessary for the outputs we actually want from line extraction QA. Part of this work also includes readying line extraction QA (`qa_line_extraction.py`) for the upcoming, new line extraction method.
 
Below is a checklist of the work that is necessary/has been done for this so far.

- [X] Refactor helper functions for line extraction QA's `output_stats` for computing metrics and tallying stats across books and pages.
- [X] Make sure this functionality is separated out as code for watershed line extraction in preparation for the new line extraction method
- [X] Make sure error files are being properly written out by run_dhsegment.py and watershed_line_extraction.py
	- [X] Change their write location to new, common folder like 'qa_results' inside the input book directory
	NOTES: dhsegment error source: tbraddyll_R4267_duke_8_essaytoheraldry1684
	watershed error source: tbraddyll_R4267_duke_8_essaytoheraldry1684
	dhsegment error output
		le_dhsegment_errors_<run_uuid>.txt in <book_directory>/qa_results
		le_watershed_errors_<run_uuid>.txt in <book_directory>/qa_results

- [X] Make sure error files per book are being merged from two files into one file
	- [X] Make sure error files are all read from the new 'qa_results' folder inside the input book directory
- [X] Write new function __correlate_errors_watershed_merge_all() in qa_line_extraction.py to put all errors into one csv file
	- [X] This should be called before output_stats() - just as correlate_errors() is already called before output_stats()
- [X] Sample necessary code from collate_results functions for merge stats function and then remove collate_results and its helpers
- [ ] Make sure errors are properly tallied for book and run level results files
	NOTES: anon_R11260_wellcome_4_generalhistoryair1692 has no stats file - it errors out during watershed
                      tbraddyll has significant traceback errors
- [X] Take outputs - book level results, run level summary results, and errors - and put them into one Google sheets file
- [X] In order to temporarily facilitate coordination of runs of the QA script between 'collate_errors' and 'output_stats' a new config yaml variable `ERROR_RUN_UUID` was added for use in the `__tally_stats_for_book_watershed` function. This allows the script to find the appropriate error file in the `qa_results` directory output by watershed line extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Stats Refactor and Error Processing Fixes #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Merge Stats Refactor and Error Processing Fixes #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions