Skip to content

Enhance PDF Parsing Pipeline with GPT-4o Image Analysis for Rich Content Extraction #81

@aliamerj

Description

@aliamerj

We currently extracting raw text from PDFs, which works fine for straightforward documents but struggles with slide decks, diagrams, tables, and other visually rich content. To deliver a more comprehensive retrieval-augmented generation (RAG) experience, let’s augment our pipeline by:

1. Dual Extraction Strategy

  • Text Layer: Continue using pdfminer (via plumpdf) to grab the “machine” text.
  • Image Layer: Convert each page into an image (pdf2image) and feed it into GPT-4o’s vision mode to describe diagrams, tables, and layout nuances.

2. Integration Outline

  • Add a preprocessing step that checks page complexity (e.g., presence of images or tables).
  • Route simple pages through text-only extraction; route complex pages through GPT-4o image analysis.
  • Merge the two outputs into a unified chunked corpus (e.g., interleave or concatenate based on page order).

3. Benefits & Use Cases

  • Slide Decks: Extract slide titles, bullet points, and visuals in context.
  • Exported Web Pages: Preserve both textual and graphical insights (e.g., charts).
  • Enhanced QA: Users querying on “How does the process flow?” will get both narrative and diagrammatic explanations.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions