-
-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Open
Copy link
Labels
Milestone
Description
We currently extracting raw text from PDFs, which works fine for straightforward documents but struggles with slide decks, diagrams, tables, and other visually rich content. To deliver a more comprehensive retrieval-augmented generation (RAG) experience, let’s augment our pipeline by:
1. Dual Extraction Strategy
- Text Layer: Continue using pdfminer (via plumpdf) to grab the “machine” text.
- Image Layer: Convert each page into an image (pdf2image) and feed it into GPT-4o’s vision mode to describe diagrams, tables, and layout nuances.
2. Integration Outline
- Add a preprocessing step that checks page complexity (e.g., presence of images or tables).
- Route simple pages through text-only extraction; route complex pages through GPT-4o image analysis.
- Merge the two outputs into a unified chunked corpus (e.g., interleave or concatenate based on page order).
3. Benefits & Use Cases
- Slide Decks: Extract slide titles, bullet points, and visuals in context.
- Exported Web Pages: Preserve both textual and graphical insights (e.g., charts).
- Enhanced QA: Users querying on “How does the process flow?” will get both narrative and diagrammatic explanations.