Skip to content

A Python library for downloading PDF files from webpages with recursive crawling, PDF merging, and security hardening.

Notifications You must be signed in to change notification settings

MALathon/fetcharoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fetcharoo

Tests Python 3.10+ License: MIT

A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.

Features

  • Download PDF files from a specified webpage
  • Recursive crawling with configurable depth (up to 5 levels)
  • Merge downloaded PDFs into a single file or save separately
  • Smart merge ordering: Sort PDFs numerically, alphabetically, or with custom sort keys
  • Automatic deduplication: Remove duplicate PDF URLs across pages
  • Custom output filenames: Name your merged PDF files
  • Rich result reporting: Get detailed download statistics with ProcessResult
  • Command-line interface for quick downloads
  • Quiet/verbose modes: Control output verbosity with -q and -v flags
  • robots.txt compliance for ethical web crawling
  • Custom User-Agent support
  • Dry-run mode to preview downloads
  • Progress bars with tqdm integration
  • PDF filtering by filename, URL patterns, and size
  • Security hardening: Domain restriction, path traversal protection, rate limiting
  • Configurable timeouts and request delays

Requirements

  • Python 3.10 or higher
  • Dependencies: requests, pymupdf, beautifulsoup4, tqdm

Installation

Using pip

pip install fetcharoo

From GitHub (latest)

pip install git+https://github.com/MALathon/fetcharoo.git

Using Poetry

poetry add fetcharoo

From source

git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install

Command-Line Interface

fetcharoo includes a CLI for quick PDF downloads:

# Download PDFs from a webpage
fetcharoo https://example.com

# Download with recursion and merge into one file
fetcharoo https://example.com -d 2 -m

# Merge with custom output filename and numeric sorting
fetcharoo https://example.com -m --output-name "textbook.pdf" --sort-by numeric

# List PDFs without downloading (dry run)
fetcharoo https://example.com --dry-run

# Download with custom options
fetcharoo https://example.com -o my_pdfs --delay 1.0 --progress

# Filter PDFs by pattern
fetcharoo https://example.com --include "report*.pdf" --exclude "*draft*"

# Quiet mode (less output) or verbose mode (more output)
fetcharoo https://example.com -q     # Quieter
fetcharoo https://example.com -qq    # Even quieter
fetcharoo https://example.com -v     # More verbose
fetcharoo https://example.com -vv    # Debug level

CLI Options

Option Description
-o, --output DIR Output directory (default: output)
-d, --depth N Recursion depth (default: 0)
-m, --merge Merge all PDFs into a single file
--output-name FILENAME Custom filename for merged PDF (with --merge)
--sort-by STRATEGY Sort PDFs before merging: numeric, alpha, alpha_desc, none
--dry-run List PDFs without downloading
--delay SECONDS Delay between requests (default: 0.5)
--timeout SECONDS Request timeout (default: 30)
--user-agent STRING Custom User-Agent string
--respect-robots Respect robots.txt rules
--progress Show progress bars
-q, --quiet Reduce output verbosity (use -qq for even quieter)
-v, --verbose Increase output verbosity (use -vv for debug)
--include PATTERN Include PDFs matching pattern
--exclude PATTERN Exclude PDFs matching pattern
--min-size BYTES Minimum PDF size
--max-size BYTES Maximum PDF size

Quick Start

from fetcharoo import download_pdfs_from_webpage

# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    mode='merge',
    write_dir='output'
)

Usage

Basic Usage

from fetcharoo import download_pdfs_from_webpage

# Download and save PDFs as separate files
download_pdfs_from_webpage(
    url='https://example.com/documents',
    recursion_depth=0,  # Only search the specified page
    mode='separate',
    write_dir='downloads'
)

With robots.txt Compliance

from fetcharoo import download_pdfs_from_webpage

# Respect robots.txt rules
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    respect_robots=True,
    user_agent='MyBot/1.0'
)

Dry-Run Mode

from fetcharoo import download_pdfs_from_webpage

# Preview what would be downloaded
result = download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    dry_run=True
)

print(f"Found {result['count']} PDFs:")
for url in result['urls']:
    print(f"  - {url}")

With Progress Bars

from fetcharoo import download_pdfs_from_webpage

# Show progress during download
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    write_dir='output',
    show_progress=True
)

PDF Filtering

from fetcharoo import download_pdfs_from_webpage, FilterConfig

# Filter by filename patterns and size
filter_config = FilterConfig(
    filename_include=['report*.pdf', 'annual*.pdf'],
    filename_exclude=['*draft*', '*temp*'],
    min_size=10000,  # 10KB minimum
    max_size=50000000  # 50MB maximum
)

download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    write_dir='output',
    filter_config=filter_config
)

With Security Options

from fetcharoo import download_pdfs_from_webpage

# Restrict crawling to specific domains
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    allowed_domains={'example.com', 'docs.example.com'},
    request_delay=1.0,  # 1 second between requests
    timeout=60  # 60 second timeout
)

Sorting and Merging

from fetcharoo import download_pdfs_from_webpage

# Merge chapters in numeric order (chapter_1.pdf, chapter_2.pdf, chapter_10.pdf)
download_pdfs_from_webpage(
    url='https://example.com/book',
    mode='merge',
    write_dir='output',
    sort_by='numeric',
    output_name='complete_book.pdf'
)

# Custom sort key function
from fetcharoo import process_pdfs, find_pdfs_from_webpage

pdf_urls = find_pdfs_from_webpage('https://example.com')
process_pdfs(
    pdf_urls,
    write_dir='output',
    mode='merge',
    sort_key=lambda url: url.split('/')[-1]  # Sort by filename
)

Using ProcessResult

from fetcharoo import download_pdfs_from_webpage

# Get detailed results from download operation
result = download_pdfs_from_webpage(
    url='https://example.com',
    mode='separate',
    write_dir='output'
)

# ProcessResult provides detailed information
print(f"Success: {result.success}")
print(f"Downloaded: {result.downloaded_count}")
print(f"Failed: {result.failed_count}")
print(f"Files created: {result.files_created}")
print(f"Errors: {result.errors}")

# ProcessResult is truthy when successful
if result:
    print("Download completed!")

Finding PDFs Without Downloading

from fetcharoo import find_pdfs_from_webpage

# Just get the list of PDF URLs (deduplicated by default)
pdf_urls = find_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1
)

for url in pdf_urls:
    print(url)

Custom User-Agent

from fetcharoo import download_pdfs_from_webpage, set_default_user_agent

# Set a global default User-Agent
set_default_user_agent('MyCompanyBot/1.0 (contact@example.com)')

# Or use per-request User-Agent
download_pdfs_from_webpage(
    url='https://example.com',
    user_agent='SpecificBot/2.0'
)

API Reference

download_pdfs_from_webpage()

Main function to find and download PDFs from a webpage.

Parameter Type Default Description
url str required The webpage URL to search
recursion_depth int 0 How many levels of links to follow (max 5)
mode str 'separate' 'merge' or 'separate'
write_dir str 'output' Output directory for PDFs
allowed_domains set None Restrict crawling to these domains
request_delay float 0.5 Seconds between requests
timeout int 30 Request timeout in seconds
respect_robots bool False Whether to respect robots.txt
user_agent str None Custom User-Agent (uses default if None)
dry_run bool False Preview URLs without downloading
show_progress bool False Show progress bars
filter_config FilterConfig None PDF filtering configuration
sort_by str None Sort strategy: 'numeric', 'alpha', 'alpha_desc', 'none'
sort_key callable None Custom sort key function
output_name str None Custom filename for merged PDF

Returns: ProcessResult object with download statistics, or dict in dry-run mode.

find_pdfs_from_webpage()

Find PDF URLs without downloading.

Parameter Type Default Description
url str required The webpage URL to search
recursion_depth int 0 How many levels of links to follow
deduplicate bool True Remove duplicate PDF URLs
... (plus other parameters from above)

process_pdfs()

Download and save a list of PDF URLs.

Parameter Type Default Description
pdf_links list required List of PDF URLs to download
write_dir str required Output directory
mode str 'separate' 'merge' or 'separate'
sort_by str None Sort strategy for merging
sort_key callable None Custom sort key function
output_name str None Custom merged filename

Returns: ProcessResult object with download statistics.

ProcessResult

Dataclass returned by download operations:

from fetcharoo import ProcessResult

# Attributes:
result.success        # bool: True if any PDFs were processed
result.files_created  # List[str]: Paths to created files
result.downloaded_count  # int: Number of successful downloads
result.filtered_count    # int: Number of PDFs filtered out
result.failed_count      # int: Number of failed downloads
result.errors           # List[str]: Error messages

# ProcessResult is truthy when successful:
if result:
    print("Success!")

FilterConfig

Configuration for PDF filtering:

from fetcharoo import FilterConfig

config = FilterConfig(
    filename_include=['*.pdf'],      # Patterns to include
    filename_exclude=['*draft*'],    # Patterns to exclude
    url_include=['*/reports/*'],     # URL patterns to include
    url_exclude=['*/temp/*'],        # URL patterns to exclude
    min_size=1000,                   # Minimum size in bytes
    max_size=100000000               # Maximum size in bytes
)

Utility Functions

  • merge_pdfs() - Merge multiple PDF documents
  • is_valid_url() - Validate URL format and scheme
  • is_safe_domain() - Check if domain is allowed
  • sanitize_filename() - Prevent path traversal attacks
  • check_robots_txt() - Check robots.txt permissions
  • set_default_user_agent() - Set default User-Agent
  • get_default_user_agent() - Get current default User-Agent

Security Features

fetcharoo includes several security measures:

  • Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
  • Path traversal protection: Sanitizes filenames to prevent directory escape
  • Rate limiting: Configurable delays between requests
  • Timeout handling: Prevents hanging on slow servers
  • URL validation: Only allows http/https schemes
  • robots.txt compliance: Optional respect for crawling rules

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Developed by Mark A. Lifson, Ph.D.

About

A Python library for downloading PDF files from webpages with recursive crawling, PDF merging, and security hardening.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages