Skip to content

Merge Stats Refactor and Error Processing Fixes #21

@jarmoza

Description

@jarmoza

Mid last week, I came across some errors in the stats output and error processing code on the qa_workflow side.

Specifically there was code that needed to be adapted from how autocrop QA was handling its output from the autocropper vs how line extraction QA is handling output from the watershed line extraction. For watershed line extraction QA the QA workflow is: 'clear', 'run', 'collate' (and specifically meaning, 'collate_errors'), and then 'output_stats'. The latter two steps are reversed in autocropping QA. A revised error handling and new stats merging function was necessary for the outputs we actually want from line extraction QA. Part of this work also includes readying line extraction QA (qa_line_extraction.py) for the upcoming, new line extraction method.

Below is a checklist of the work that is necessary/has been done for this so far.

  • Refactor helper functions for line extraction QA's output_stats for computing metrics and tallying stats across books and pages.

  • Make sure this functionality is separated out as code for watershed line extraction in preparation for the new line extraction method

  • Make sure error files are being properly written out by run_dhsegment.py and watershed_line_extraction.py

    • Change their write location to new, common folder like 'qa_results' inside the input book directory
      NOTES: dhsegment error source: tbraddyll_R4267_duke_8_essaytoheraldry1684
      watershed error source: tbraddyll_R4267_duke_8_essaytoheraldry1684
      dhsegment error output
      le_dhsegment_errors_<run_uuid>.txt in <book_directory>/qa_results
      le_watershed_errors_<run_uuid>.txt in <book_directory>/qa_results
  • Make sure error files per book are being merged from two files into one file

    • Make sure error files are all read from the new 'qa_results' folder inside the input book directory
  • Write new function __correlate_errors_watershed_merge_all() in qa_line_extraction.py to put all errors into one csv file

    • This should be called before output_stats() - just as correlate_errors() is already called before output_stats()
  • Sample necessary code from collate_results functions for merge stats function and then remove collate_results and its helpers

  • Make sure errors are properly tallied for book and run level results files
    NOTES: anon_R11260_wellcome_4_generalhistoryair1692 has no stats file - it errors out during watershed
    tbraddyll has significant traceback errors

  • Take outputs - book level results, run level summary results, and errors - and put them into one Google sheets file

  • In order to temporarily facilitate coordination of runs of the QA script between 'collate_errors' and 'output_stats' a new config yaml variable ERROR_RUN_UUID was added for use in the __tally_stats_for_book_watershed function. This allows the script to find the appropriate error file in the qa_results directory output by watershed line extraction

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingfixA small adjustment or enhancement

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions