A collection of CLI utilities for working with PDF files.
A tiny CLI utility to stream large PDF files into plain text without loading the entire file into memory. It wraps pdfminer.six with page-based iteration, configurable LAParams, a friendly CLI spinner, and safe logging so you can batch-process enormous PDFs. When conversion finishes, the CLI prints a summary showing file sizes and elapsed time.
Replace text in PDF files with support for case-sensitive/insensitive search, whole word matching, regular expressions, and custom font sizing. The tool preserves font, size, and style formatting when not using a custom font size.
Key Features:
- Simple text replacement with formatting preservation
- Case-insensitive search option
- Whole word matching
- Regular expression support
- Custom font size specification (
--size) - Batch processing of multiple occurrences
- Enhanced debugging with detailed logging
A Python utility to replace a page in a PDF file with an image. The image is automatically scaled to fit the dimensions of the page being replaced while maintaining its aspect ratio.
Key Features:
- Replace any page in a PDF with an image
- Automatically scales images to match page dimensions
- Maintains image aspect ratio
- Preserves all other pages in the PDF
- Supports various image formats (PNG, JPEG, BMP, etc.)
- Create or activate your Python virtual environment (the repository already contains
.venv/). - Install the requirements:
pip install -r requirements.txtpython pdf_to_text.py INPUT_PDF [-o OUTPUT_TXT] [OPTIONS]
Convert an entire PDF:
python pdf_to_text.py documents/manual.pdfExtract a subsection without overwriting an existing file:
python pdf_to_text.py big-output.pdf --page-range 50-150 \
--output extracted.txt --overwrite--page-range: specifystart-endto control the page window (e.g.,10-for everything after page 10)--encoding: control the output text encoding (defaultutf-8)--char-margin,--line-margin,--word-margin,--boxes-flow,--detect-vertical: customizepdfminer.sixlayout heuristics--quiet/--log-level: mute or raise logging verbosity--no-spinner: disable the CLI animation
python replace_text.py INPUT_PDF SEARCH_TEXT REPLACE_TEXT [OPTIONS]
Replace text while preserving formatting:
python replace_text.py document.pdf "old text" "new text"Replace with custom font size:
python replace_text.py document.pdf "11-21-2020" "11-21-2025" --size 14Case-insensitive replacement:
python replace_text.py document.pdf "Old Text" "New Text" --ignore-caseRegex replacement (email redaction):
python replace_text.py document.pdf "\b[\w.-]+@[\w.-]+\.\w+\b" "[EMAIL]" --regexPhone number redaction:
python replace_text.py document.pdf "\b\d{3}-\d{3}-\d{4}\b" "[PHONE]" --regex--ignore-case(-i): Case-insensitive search--whole-word(-w): Match whole words only--regex(-r): Treat search text as a regular expression--size SIZE: Font size to use for replacement text (preserves original size if not specified)--overwrite: Overwrite the output file if it exists--output(-o): Specify output file path--quiet: Suppress informational logging--log-level: Set logging level (DEBUG, INFO, WARNING, ERROR)
Limitations:
- Works best with PDFs that have selectable text
- Scanned PDFs (images) require OCR preprocessing
- Complex layouts may not be perfectly preserved
- Encrypted PDFs require password
python replace_page.py INPUT_PDF IMAGE_FILE [OPTIONS]
Replace the first page (cover):
python replace_page.py document.pdf new_cover.pngReplace a specific page:
python replace_page.py report.pdf diagram.jpg --page 3 --output updated_report.pdfReplace with overwrite:
python replace_page.py document.pdf image.png --page 5 --overwrite-p, --page PAGE: Page number to replace (1-indexed, default: 1)-o, --output OUTPUT: Path to the output PDF file--overwrite: Overwrite the output file if it exists--quiet: Suppress informational logging--log-level LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)
Supported Image Formats: PNG, JPEG, BMP, GIF, TIFF
Run the CLIs with --help to verify the scripts start without errors:
python pdf_to_text.py --help
python replace_text.py --help
python replace_page.py --helpRun the test suite:
pytest tests/ -v- Python 3.8+
- PyMuPDF (fitz) >= 1.23.0
- PyPDF2 >= 4.0
- Pillow >= 10.0
- pdfminer.six
See requirements.txt for complete dependencies.
- Always test on a copy of your PDF first
- Complex PDFs with multiple layers may not work perfectly
- The tools preserve images and non-text content when possible
- Text replacement preserves formatting when custom font size is not specified