Skip to content

A Dual-Stream Processing Architecture, where information extraction pipeline designed for the automated analysis of complex PDF documents

License

Notifications You must be signed in to change notification settings

Raghavan-04/RAG_model

Repository files navigation

Intelligent Multimodal PDF Summarizer & Captioning Engine

Project Title: Intelligent Multimodal PDF Summarizer & Captioning Engine Domain: Natural Language Processing (NLP) & Computer Vision Device Support: MPS (Apple Silicon), CUDA, CPU


💡 Project Summary & Technical Need

This work presents a multimodal information extraction pipeline designed for the automated analysis of complex PDF documents. Unlike traditional summarizers that only process text, this system bridges the gap between visual and textual data by integrating Optical Character Recognition (OCR) principles with Generative AI.

The primary innovation is the Dual-Stream Processing Architecture—simultaneously extracting raw text via PyMuPDF while interpreting embedded figures using the BLIP Image Captioning Model, ensuring that diagrams and charts contribute to the final summary rather than being ignored.

🔬 Core Inference Models: Transformers

Unlike simple frequency-based summarization (e.g., TF-IDF), this project utilizes deep learning models based on the Transformer architecture:

  • Visual Model: BLIP (Bootstrapping Language-Image Pre-training).
  • Text Model: BART (Bidirectional and Auto-Regressive Transformers).
  • Principle: The system treats the document as a composite sequence of text tokens () and image tensors ().
  • Key Relationship: The final summary () is a function of the combined textual and visual context:

Where represents the concatenation of extracted text and generated image captions, creating a unified context window for the summarizer.

📡 System Architecture & Signal Chain

The platform integrates PDF parsing, tensor processing, and sequence generation into a unified pipeline.

1. Data Extraction Module (Input Layer)

Component Function Detail
PyMuPDF (Fitz) Text & Asset Extraction Scrapes raw string data and extracts binary image streams () from the PDF structure.
PDF2Image Rasterization Converts PDF pages into PIL-compatible images for visual analysis if direct extraction fails.
Time Tracking Performance Logging Logs and per page to monitor processing latency.

2. AI Inference Engine (Processing Layer)

Component Function Detail
BLIP Processor Visual Encoding Converts raw image data into tensor inputs and generates semantic captions (e.g., "A graph showing sales trends").
BART Tokenizer Sequence Compression Analyzing the combined text + captions to generate a concise abstractive summary.
Hardware Accelerator MPS/CUDA Dynamically assigns tensor operations to the GPU (Apple Metal or NVIDIA) for rapid inference.

3. Logic & Control (Robustness)

Feature Logic Applied Purpose
Chunking Algorithm Sliding Window Splits text characters into manageable blocks to prevent Transformer token-limit errors.
Recursive Summarization Two-Pass Method If the initial summary is still too long, it is recursively fed back into BART for a second pass.

🧪 Proposed Operational Methodology

The execution flow is designed to maximize accuracy while managing memory constraints:

  1. Initialization: The system detects the hardware backend (checking torch.backends.mps.is_available()) to optimize tensor allocation.
  2. Multimodal Extraction: The document is iterated page-by-page. Text is appended to a global string, while images are converted from bytes to PIL objects.
  3. Visual Captioning: The BLIP model performs conditional generation on the extracted images, converting visual information into textual descriptions.
  4. Context Fusion: The extracted text and the generated captions are merged:
  • Context = Raw Text + "Page X: [Generated Caption]"
  1. Summarization: The combined context is passed to the BART model. If the text length exceeds the model's max input (1024 tokens), the "Chunking Strategy" is automatically triggered.

⚠️ Computational Challenges and Robustness

The design explicitly addresses common bottlenecks in local LLM/Model execution:

Error Category Challenge Addressed Solution/Design Feature
Memory Errors Token Limit Exceeded: Large PDFs exceed the 1024 token limit of BART. Automated Chunking: Text is sliced into 1000-char segments, summarized individually, and then concatenated.
Performance Inference Latency: CPU processing for Transformers is slow. Device Agnostic Code: Auto-switches to mps (Mac) or cuda (Linux/Win) if available, reducing time by ~60%.
Data Loss Image Ignorance: Standard OCR ignores charts/graphs. BLIP Integration: Visual data is converted to text, ensuring the summary includes info from charts.
Output Errors Hallucination/Repetition: Models repeating phrases. Parameter Tuning: no_repeat_ngram_size and min/max_length constraints are hardcoded into the pipeline.

🌐 User Interface (CLI)

The tool operates via a robust Command Line Interface for ease of automation.

  • Input: Prompts user for target filename (input("Enter PDF name...")).
  • Real-Time Feedback: Displays processing speed per page in seconds (e.g., Processed page 1 in 0.4502 seconds).
  • Draft Output: Prints the intermediate captions and draft summaries before the final result.
  • Final Output: Delivers a refined, cohesive summary of the entire document.

🧑‍💻 Project Maintainers

| Developer | [Raghavan] |

About

A Dual-Stream Processing Architecture, where information extraction pipeline designed for the automated analysis of complex PDF documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published