Uses LLM to iterate through a collection of scanned pages of historical multi-column newspapers and OCR the output to extract: articles and headline, along with some additional metadata. It returns a collection of sticles identifies on each page in JSON format.
- Reads scanned page images (JPG) from an input directory
- Sends them to an OpenAI model (e.g. o4-mini-2025-04-16) with a structured prompt optimised externally on OpenAI Playground for layout‑aware OCR
- Returns per‑page JSON with:
articles[](title/headline, content/article body)- page/issue metadata
- Writes outputs to an output directory (one JSON file per scanned page image)
- Regiter and get an OpenAI API Key;
- Configure input and output folders;
- Add scanned page images as JPG;
- Run the Cell on Jupyter notebook.
