Project Title: Intelligent Multimodal PDF Summarizer & Captioning Engine Domain: Natural Language Processing (NLP) & Computer Vision Device Support: MPS (Apple Silicon), CUDA, CPU
This work presents a multimodal information extraction pipeline designed for the automated analysis of complex PDF documents. Unlike traditional summarizers that only process text, this system bridges the gap between visual and textual data by integrating Optical Character Recognition (OCR) principles with Generative AI.
The primary innovation is the Dual-Stream Processing Architecture—simultaneously extracting raw text via PyMuPDF while interpreting embedded figures using the BLIP Image Captioning Model, ensuring that diagrams and charts contribute to the final summary rather than being ignored.
Unlike simple frequency-based summarization (e.g., TF-IDF), this project utilizes deep learning models based on the Transformer architecture:
- Visual Model: BLIP (Bootstrapping Language-Image Pre-training).
- Text Model: BART (Bidirectional and Auto-Regressive Transformers).
- Principle: The system treats the document as a composite sequence of text tokens () and image tensors ().
- Key Relationship: The final summary () is a function of the combined textual and visual context:
Where represents the concatenation of extracted text and generated image captions, creating a unified context window for the summarizer.
The platform integrates PDF parsing, tensor processing, and sequence generation into a unified pipeline.
| Component | Function | Detail |
|---|---|---|
| PyMuPDF (Fitz) | Text & Asset Extraction | Scrapes raw string data and extracts binary image streams () from the PDF structure. |
| PDF2Image | Rasterization | Converts PDF pages into PIL-compatible images for visual analysis if direct extraction fails. |
| Time Tracking | Performance Logging | Logs and per page to monitor processing latency. |
| Component | Function | Detail |
|---|---|---|
| BLIP Processor | Visual Encoding | Converts raw image data into tensor inputs and generates semantic captions (e.g., "A graph showing sales trends"). |
| BART Tokenizer | Sequence Compression | Analyzing the combined text + captions to generate a concise abstractive summary. |
| Hardware Accelerator | MPS/CUDA | Dynamically assigns tensor operations to the GPU (Apple Metal or NVIDIA) for rapid inference. |
| Feature | Logic Applied | Purpose |
|---|---|---|
| Chunking Algorithm | Sliding Window | Splits text characters into manageable blocks to prevent Transformer token-limit errors. |
| Recursive Summarization | Two-Pass Method | If the initial summary is still too long, it is recursively fed back into BART for a second pass. |
The execution flow is designed to maximize accuracy while managing memory constraints:
- Initialization: The system detects the hardware backend (checking
torch.backends.mps.is_available()) to optimize tensor allocation. - Multimodal Extraction: The document is iterated page-by-page. Text is appended to a global string, while images are converted from bytes to PIL objects.
- Visual Captioning: The BLIP model performs conditional generation on the extracted images, converting visual information into textual descriptions.
- Context Fusion: The extracted text and the generated captions are merged:
- Context = Raw Text + "Page X: [Generated Caption]"
- Summarization: The combined context is passed to the BART model. If the text length exceeds the model's max input (1024 tokens), the "Chunking Strategy" is automatically triggered.
The design explicitly addresses common bottlenecks in local LLM/Model execution:
| Error Category | Challenge Addressed | Solution/Design Feature |
|---|---|---|
| Memory Errors | Token Limit Exceeded: Large PDFs exceed the 1024 token limit of BART. | Automated Chunking: Text is sliced into 1000-char segments, summarized individually, and then concatenated. |
| Performance | Inference Latency: CPU processing for Transformers is slow. | Device Agnostic Code: Auto-switches to mps (Mac) or cuda (Linux/Win) if available, reducing time by ~60%. |
| Data Loss | Image Ignorance: Standard OCR ignores charts/graphs. | BLIP Integration: Visual data is converted to text, ensuring the summary includes info from charts. |
| Output Errors | Hallucination/Repetition: Models repeating phrases. | Parameter Tuning: no_repeat_ngram_size and min/max_length constraints are hardcoded into the pipeline. |
The tool operates via a robust Command Line Interface for ease of automation.
- Input: Prompts user for target filename (
input("Enter PDF name...")). - Real-Time Feedback: Displays processing speed per page in seconds (e.g., Processed page 1 in 0.4502 seconds).
- Draft Output: Prints the intermediate captions and draft summaries before the final result.
- Final Output: Delivers a refined, cohesive summary of the entire document.
| Developer | [Raghavan] |