fetcharoo

A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.

Features

Download PDF files from a specified webpage
Recursive crawling with configurable depth (up to 5 levels)
Merge downloaded PDFs into a single file or save separately
Smart merge ordering: Sort PDFs numerically, alphabetically, or with custom sort keys
Automatic deduplication: Remove duplicate PDF URLs across pages
Custom output filenames: Name your merged PDF files
Rich result reporting: Get detailed download statistics with ProcessResult
Command-line interface for quick downloads
Quiet/verbose modes: Control output verbosity with -q and -v flags
robots.txt compliance for ethical web crawling
Custom User-Agent support
Dry-run mode to preview downloads
Progress bars with tqdm integration
PDF filtering by filename, URL patterns, and size
Security hardening: Domain restriction, path traversal protection, rate limiting
Configurable timeouts and request delays

Requirements

Python 3.10 or higher
Dependencies: requests, pymupdf, beautifulsoup4, tqdm

Installation

Using pip

pip install fetcharoo

From GitHub (latest)

pip install git+https://github.com/MALathon/fetcharoo.git

Using Poetry

poetry add fetcharoo

From source

git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install

Command-Line Interface

fetcharoo includes a CLI for quick PDF downloads:

# Download PDFs from a webpage
fetcharoo https://example.com

# Download with recursion and merge into one file
fetcharoo https://example.com -d 2 -m

# Merge with custom output filename and numeric sorting
fetcharoo https://example.com -m --output-name "textbook.pdf" --sort-by numeric

# List PDFs without downloading (dry run)
fetcharoo https://example.com --dry-run

# Download with custom options
fetcharoo https://example.com -o my_pdfs --delay 1.0 --progress

# Filter PDFs by pattern
fetcharoo https://example.com --include "report*.pdf" --exclude "*draft*"

# Quiet mode (less output) or verbose mode (more output)
fetcharoo https://example.com -q     # Quieter
fetcharoo https://example.com -qq    # Even quieter
fetcharoo https://example.com -v     # More verbose
fetcharoo https://example.com -vv    # Debug level

CLI Options

Option	Description
`-o, --output DIR`	Output directory (default: output)
`-d, --depth N`	Recursion depth (default: 0)
`-m, --merge`	Merge all PDFs into a single file
`--output-name FILENAME`	Custom filename for merged PDF (with `--merge`)
`--sort-by STRATEGY`	Sort PDFs before merging: `numeric`, `alpha`, `alpha_desc`, `none`
`--dry-run`	List PDFs without downloading
`--delay SECONDS`	Delay between requests (default: 0.5)
`--timeout SECONDS`	Request timeout (default: 30)
`--user-agent STRING`	Custom User-Agent string
`--respect-robots`	Respect robots.txt rules
`--progress`	Show progress bars
`-q, --quiet`	Reduce output verbosity (use `-qq` for even quieter)
`-v, --verbose`	Increase output verbosity (use `-vv` for debug)
`--include PATTERN`	Include PDFs matching pattern
`--exclude PATTERN`	Exclude PDFs matching pattern
`--min-size BYTES`	Minimum PDF size
`--max-size BYTES`	Maximum PDF size

Quick Start

from fetcharoo import download_pdfs_from_webpage

# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    mode='merge',
    write_dir='output'
)

Usage

Basic Usage

from fetcharoo import download_pdfs_from_webpage

# Download and save PDFs as separate files
download_pdfs_from_webpage(
    url='https://example.com/documents',
    recursion_depth=0,  # Only search the specified page
    mode='separate',
    write_dir='downloads'
)

With robots.txt Compliance

from fetcharoo import download_pdfs_from_webpage

# Respect robots.txt rules
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    respect_robots=True,
    user_agent='MyBot/1.0'
)

Dry-Run Mode

from fetcharoo import download_pdfs_from_webpage

# Preview what would be downloaded
result = download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    dry_run=True
)

print(f"Found {result['count']} PDFs:")
for url in result['urls']:
    print(f"  - {url}")

With Progress Bars

from fetcharoo import download_pdfs_from_webpage

# Show progress during download
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    write_dir='output',
    show_progress=True
)

PDF Filtering

from fetcharoo import download_pdfs_from_webpage, FilterConfig

# Filter by filename patterns and size
filter_config = FilterConfig(
    filename_include=['report*.pdf', 'annual*.pdf'],
    filename_exclude=['*draft*', '*temp*'],
    min_size=10000,  # 10KB minimum
    max_size=50000000  # 50MB maximum
)

download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    write_dir='output',
    filter_config=filter_config
)

With Security Options

from fetcharoo import download_pdfs_from_webpage

# Restrict crawling to specific domains
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    allowed_domains={'example.com', 'docs.example.com'},
    request_delay=1.0,  # 1 second between requests
    timeout=60  # 60 second timeout
)

Sorting and Merging

from fetcharoo import download_pdfs_from_webpage

# Merge chapters in numeric order (chapter_1.pdf, chapter_2.pdf, chapter_10.pdf)
download_pdfs_from_webpage(
    url='https://example.com/book',
    mode='merge',
    write_dir='output',
    sort_by='numeric',
    output_name='complete_book.pdf'
)

# Custom sort key function
from fetcharoo import process_pdfs, find_pdfs_from_webpage

pdf_urls = find_pdfs_from_webpage('https://example.com')
process_pdfs(
    pdf_urls,
    write_dir='output',
    mode='merge',
    sort_key=lambda url: url.split('/')[-1]  # Sort by filename
)

Using ProcessResult

from fetcharoo import download_pdfs_from_webpage

# Get detailed results from download operation
result = download_pdfs_from_webpage(
    url='https://example.com',
    mode='separate',
    write_dir='output'
)

# ProcessResult provides detailed information
print(f"Success: {result.success}")
print(f"Downloaded: {result.downloaded_count}")
print(f"Failed: {result.failed_count}")
print(f"Files created: {result.files_created}")
print(f"Errors: {result.errors}")

# ProcessResult is truthy when successful
if result:
    print("Download completed!")

Finding PDFs Without Downloading

from fetcharoo import find_pdfs_from_webpage

# Just get the list of PDF URLs (deduplicated by default)
pdf_urls = find_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1
)

for url in pdf_urls:
    print(url)

Custom User-Agent

from fetcharoo import download_pdfs_from_webpage, set_default_user_agent

# Set a global default User-Agent
set_default_user_agent('MyCompanyBot/1.0 (contact@example.com)')

# Or use per-request User-Agent
download_pdfs_from_webpage(
    url='https://example.com',
    user_agent='SpecificBot/2.0'
)

API Reference

`download_pdfs_from_webpage()`

Main function to find and download PDFs from a webpage.

Parameter	Type	Default	Description
`url`	str	required	The webpage URL to search
`recursion_depth`	int	0	How many levels of links to follow (max 5)
`mode`	str	'separate'	'merge' or 'separate'
`write_dir`	str	'output'	Output directory for PDFs
`allowed_domains`	set	None	Restrict crawling to these domains
`request_delay`	float	0.5	Seconds between requests
`timeout`	int	30	Request timeout in seconds
`respect_robots`	bool	False	Whether to respect robots.txt
`user_agent`	str	None	Custom User-Agent (uses default if None)
`dry_run`	bool	False	Preview URLs without downloading
`show_progress`	bool	False	Show progress bars
`filter_config`	FilterConfig	None	PDF filtering configuration
`sort_by`	str	None	Sort strategy: 'numeric', 'alpha', 'alpha_desc', 'none'
`sort_key`	callable	None	Custom sort key function
`output_name`	str	None	Custom filename for merged PDF

Returns: ProcessResult object with download statistics, or dict in dry-run mode.

`find_pdfs_from_webpage()`

Find PDF URLs without downloading.

Parameter	Type	Default	Description
`url`	str	required	The webpage URL to search
`recursion_depth`	int	0	How many levels of links to follow
`deduplicate`	bool	True	Remove duplicate PDF URLs
...			(plus other parameters from above)

`process_pdfs()`

Download and save a list of PDF URLs.

Parameter	Type	Default	Description
`pdf_links`	list	required	List of PDF URLs to download
`write_dir`	str	required	Output directory
`mode`	str	'separate'	'merge' or 'separate'
`sort_by`	str	None	Sort strategy for merging
`sort_key`	callable	None	Custom sort key function
`output_name`	str	None	Custom merged filename

Returns: ProcessResult object with download statistics.

`ProcessResult`

Dataclass returned by download operations:

from fetcharoo import ProcessResult

# Attributes:
result.success        # bool: True if any PDFs were processed
result.files_created  # List[str]: Paths to created files
result.downloaded_count  # int: Number of successful downloads
result.filtered_count    # int: Number of PDFs filtered out
result.failed_count      # int: Number of failed downloads
result.errors           # List[str]: Error messages

# ProcessResult is truthy when successful:
if result:
    print("Success!")

`FilterConfig`

Configuration for PDF filtering:

from fetcharoo import FilterConfig

config = FilterConfig(
    filename_include=['*.pdf'],      # Patterns to include
    filename_exclude=['*draft*'],    # Patterns to exclude
    url_include=['*/reports/*'],     # URL patterns to include
    url_exclude=['*/temp/*'],        # URL patterns to exclude
    min_size=1000,                   # Minimum size in bytes
    max_size=100000000               # Maximum size in bytes
)

Utility Functions

merge_pdfs() - Merge multiple PDF documents
is_valid_url() - Validate URL format and scheme
is_safe_domain() - Check if domain is allowed
sanitize_filename() - Prevent path traversal attacks
check_robots_txt() - Check robots.txt permissions
set_default_user_agent() - Set default User-Agent
get_default_user_agent() - Get current default User-Agent

Security Features

fetcharoo includes several security measures:

Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
Path traversal protection: Sanitizes filenames to prevent directory escape
Rate limiting: Configurable delays between requests
Timeout handling: Prevents hanging on slow servers
URL validation: Only allows http/https schemes
robots.txt compliance: Optional respect for crawling rules

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes with tests
Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Developed by Mark A. Lifson, Ph.D.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
fetcharoo		fetcharoo
tests		tests
.gitignore		.gitignore
FILTERING_EXAMPLES.md		FILTERING_EXAMPLES.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test_filtering_demo.py		test_filtering_demo.py

MALathon/fetcharoo

Folders and files

Latest commit

History

Repository files navigation

fetcharoo

Features

Requirements

Installation

Using pip

From GitHub (latest)

Using Poetry

From source

Command-Line Interface

CLI Options

Quick Start

Usage

Basic Usage

With robots.txt Compliance

Dry-Run Mode

With Progress Bars

PDF Filtering

With Security Options

Sorting and Merging

Using ProcessResult

Finding PDFs Without Downloading

Custom User-Agent

API Reference

download_pdfs_from_webpage()

find_pdfs_from_webpage()

process_pdfs()

ProcessResult

FilterConfig

Utility Functions

Security Features

Contributing

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0