Skip to content

Precision PDF is a high-precision Java-based PDF parsing and extraction framework built on top of Apache PDFBox. It provides structured content extraction with advanced layout preservation, making it ideal for document processing, data mining, and content analysis applications.

Notifications You must be signed in to change notification settings

YRL-AIDA/precision_pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Precision PDF Parser

A high-precision Java-based PDF parsing and extraction framework built on top of Apache PDFBox. Provides structured content extraction with advanced layout preservation, making it ideal for document processing, data mining, and content analysis applications.

๐ŸŽฏ Key Features

Precision Text Extraction

  • Hierarchical Content Structure: Extracts text at multiple granularity levels (words, lines, chunks)
  • Layout Preservation: Maintains original document layout and formatting
  • Style Detection: Identifies text styles, fonts, and formatting information
  • Bounding Box Tracking: Precise spatial positioning of all extracted elements

๐Ÿ“Š Multi-Content Extraction

  • Text Content: Structured extraction with confidence scores and font information
  • Image Extraction: High-quality image capture with metadata and spatial context
  • Table Recognition: Advanced table detection and structure extraction
  • Metadata Extraction: Comprehensive document metadata and custom properties

โš™๏ธ Configurable Processing

  • Flexible Extraction Config: Enable/disable specific content types (text, images, tables, metadata)
  • Quality Settings: Adjustable image DPI, size limits, and processing parameters
  • Layout Options: Choose between layout preservation or raw text extraction

๐Ÿ“ค Multiple Export Formats

  • JSON: Structured data with spatial relationships and metadata
  • XML: Standardized document representation
  • HTML: Web-friendly format with layout preservation
  • TEXT: Clean text extraction
  • CSV: Tabular data export
  • PDF: Processed document output

About

Precision PDF is a high-precision Java-based PDF parsing and extraction framework built on top of Apache PDFBox. It provides structured content extraction with advanced layout preservation, making it ideal for document processing, data mining, and content analysis applications.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published