Newspaper-OCR-LLM

Uses LLM to iterate through a collection of scanned pages of historical multi-column newspapers and OCR the output to extract: articles and headline, along with some additional metadata. It returns a collection of sticles identifies on each page in JSON format.

What it does

Reads scanned page images (JPG) from an input directory
Sends them to an OpenAI model (e.g. o4-mini-2025-04-16) with a structured prompt optimised externally on OpenAI Playground for layout‑aware OCR
Returns per‑page JSON with:
- articles[] (title/headline, content/article body)
- page/issue metadata
Writes outputs to an output directory (one JSON file per scanned page image)

How to run

Regiter and get an OpenAI API Key;
Configure input and output folders;
Add scanned page images as JPG;
Run the Cell on Jupyter notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
extractions		extractions
pages		pages
LICENSE		LICENSE
OpenAI-API.ipynb		OpenAI-API.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Newspaper-OCR-LLM

What it does

How to run

About

Uh oh!

Releases

Packages

Languages

License

kstepanyan/Newspaper-OCR-LLM

Folders and files

Latest commit

History

Repository files navigation

Newspaper-OCR-LLM

What it does

How to run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages