A high-precision Java-based PDF parsing and extraction framework built on top of Apache PDFBox. Provides structured content extraction with advanced layout preservation, making it ideal for document processing, data mining, and content analysis applications.
- Hierarchical Content Structure: Extracts text at multiple granularity levels (words, lines, chunks)
- Layout Preservation: Maintains original document layout and formatting
- Style Detection: Identifies text styles, fonts, and formatting information
- Bounding Box Tracking: Precise spatial positioning of all extracted elements
- Text Content: Structured extraction with confidence scores and font information
- Image Extraction: High-quality image capture with metadata and spatial context
- Table Recognition: Advanced table detection and structure extraction
- Metadata Extraction: Comprehensive document metadata and custom properties
- Flexible Extraction Config: Enable/disable specific content types (text, images, tables, metadata)
- Quality Settings: Adjustable image DPI, size limits, and processing parameters
- Layout Options: Choose between layout preservation or raw text extraction
- JSON: Structured data with spatial relationships and metadata
- XML: Standardized document representation
- HTML: Web-friendly format with layout preservation
- TEXT: Clean text extraction
- CSV: Tabular data export
- PDF: Processed document output