Skip to content

added parsing for html images and all supporting code#3

Open
foroveralls wants to merge 1 commit intoguardian:mainfrom
foroveralls:main
Open

added parsing for html images and all supporting code#3
foroveralls wants to merge 1 commit intoguardian:mainfrom
foroveralls:main

Conversation

@foroveralls
Copy link

What does this change?

This adds support for html image parsing which has been tested with png images but should work with all image formats. Changes have been integrated across necessary scripts imageParser .py and ocrImages. A full functionality test still needs to be run.

Unfortunately Tesseract is not picking up text from the parsed html images,. This may be because they have transparent backgrounds and a possible solution could involve adding a white background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant