added parsing for html images and all supporting code by foroveralls · Pull Request #3 · guardian/google-ad-database-processing-scripts

foroveralls · 2022-04-26T23:33:35Z

What does this change?

This adds support for html image parsing which has been tested with png images but should work with all image formats. Changes have been integrated across necessary scripts imageParser .py and ocrImages. A full functionality test still needs to be run.

Unfortunately Tesseract is not picking up text from the parsed html images,. This may be because they have transparent backgrounds and a possible solution could involve adding a white background.

added parsing for html images and all supporting code

5ee9829

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added parsing for html images and all supporting code#3

added parsing for html images and all supporting code#3
foroveralls wants to merge 1 commit intoguardian:mainfrom
foroveralls:main

foroveralls commented Apr 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

foroveralls commented Apr 26, 2022

What does this change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant