A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.
- Download PDF files from a specified webpage
- Recursive crawling with configurable depth (up to 5 levels)
- Merge downloaded PDFs into a single file or save separately
- Smart merge ordering: Sort PDFs numerically, alphabetically, or with custom sort keys
- Automatic deduplication: Remove duplicate PDF URLs across pages
- Custom output filenames: Name your merged PDF files
- Rich result reporting: Get detailed download statistics with
ProcessResult - Command-line interface for quick downloads
- Quiet/verbose modes: Control output verbosity with
-qand-vflags - robots.txt compliance for ethical web crawling
- Custom User-Agent support
- Dry-run mode to preview downloads
- Progress bars with tqdm integration
- PDF filtering by filename, URL patterns, and size
- Security hardening: Domain restriction, path traversal protection, rate limiting
- Configurable timeouts and request delays
- Python 3.10 or higher
- Dependencies:
requests,pymupdf,beautifulsoup4,tqdm
pip install fetcharoopip install git+https://github.com/MALathon/fetcharoo.gitpoetry add fetcharoogit clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry installfetcharoo includes a CLI for quick PDF downloads:
# Download PDFs from a webpage
fetcharoo https://example.com
# Download with recursion and merge into one file
fetcharoo https://example.com -d 2 -m
# Merge with custom output filename and numeric sorting
fetcharoo https://example.com -m --output-name "textbook.pdf" --sort-by numeric
# List PDFs without downloading (dry run)
fetcharoo https://example.com --dry-run
# Download with custom options
fetcharoo https://example.com -o my_pdfs --delay 1.0 --progress
# Filter PDFs by pattern
fetcharoo https://example.com --include "report*.pdf" --exclude "*draft*"
# Quiet mode (less output) or verbose mode (more output)
fetcharoo https://example.com -q # Quieter
fetcharoo https://example.com -qq # Even quieter
fetcharoo https://example.com -v # More verbose
fetcharoo https://example.com -vv # Debug level| Option | Description |
|---|---|
-o, --output DIR |
Output directory (default: output) |
-d, --depth N |
Recursion depth (default: 0) |
-m, --merge |
Merge all PDFs into a single file |
--output-name FILENAME |
Custom filename for merged PDF (with --merge) |
--sort-by STRATEGY |
Sort PDFs before merging: numeric, alpha, alpha_desc, none |
--dry-run |
List PDFs without downloading |
--delay SECONDS |
Delay between requests (default: 0.5) |
--timeout SECONDS |
Request timeout (default: 30) |
--user-agent STRING |
Custom User-Agent string |
--respect-robots |
Respect robots.txt rules |
--progress |
Show progress bars |
-q, --quiet |
Reduce output verbosity (use -qq for even quieter) |
-v, --verbose |
Increase output verbosity (use -vv for debug) |
--include PATTERN |
Include PDFs matching pattern |
--exclude PATTERN |
Exclude PDFs matching pattern |
--min-size BYTES |
Minimum PDF size |
--max-size BYTES |
Maximum PDF size |
from fetcharoo import download_pdfs_from_webpage
# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
mode='merge',
write_dir='output'
)from fetcharoo import download_pdfs_from_webpage
# Download and save PDFs as separate files
download_pdfs_from_webpage(
url='https://example.com/documents',
recursion_depth=0, # Only search the specified page
mode='separate',
write_dir='downloads'
)from fetcharoo import download_pdfs_from_webpage
# Respect robots.txt rules
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
mode='merge',
write_dir='output',
respect_robots=True,
user_agent='MyBot/1.0'
)from fetcharoo import download_pdfs_from_webpage
# Preview what would be downloaded
result = download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
dry_run=True
)
print(f"Found {result['count']} PDFs:")
for url in result['urls']:
print(f" - {url}")from fetcharoo import download_pdfs_from_webpage
# Show progress during download
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
write_dir='output',
show_progress=True
)from fetcharoo import download_pdfs_from_webpage, FilterConfig
# Filter by filename patterns and size
filter_config = FilterConfig(
filename_include=['report*.pdf', 'annual*.pdf'],
filename_exclude=['*draft*', '*temp*'],
min_size=10000, # 10KB minimum
max_size=50000000 # 50MB maximum
)
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
write_dir='output',
filter_config=filter_config
)from fetcharoo import download_pdfs_from_webpage
# Restrict crawling to specific domains
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
mode='merge',
write_dir='output',
allowed_domains={'example.com', 'docs.example.com'},
request_delay=1.0, # 1 second between requests
timeout=60 # 60 second timeout
)from fetcharoo import download_pdfs_from_webpage
# Merge chapters in numeric order (chapter_1.pdf, chapter_2.pdf, chapter_10.pdf)
download_pdfs_from_webpage(
url='https://example.com/book',
mode='merge',
write_dir='output',
sort_by='numeric',
output_name='complete_book.pdf'
)
# Custom sort key function
from fetcharoo import process_pdfs, find_pdfs_from_webpage
pdf_urls = find_pdfs_from_webpage('https://example.com')
process_pdfs(
pdf_urls,
write_dir='output',
mode='merge',
sort_key=lambda url: url.split('/')[-1] # Sort by filename
)from fetcharoo import download_pdfs_from_webpage
# Get detailed results from download operation
result = download_pdfs_from_webpage(
url='https://example.com',
mode='separate',
write_dir='output'
)
# ProcessResult provides detailed information
print(f"Success: {result.success}")
print(f"Downloaded: {result.downloaded_count}")
print(f"Failed: {result.failed_count}")
print(f"Files created: {result.files_created}")
print(f"Errors: {result.errors}")
# ProcessResult is truthy when successful
if result:
print("Download completed!")from fetcharoo import find_pdfs_from_webpage
# Just get the list of PDF URLs (deduplicated by default)
pdf_urls = find_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1
)
for url in pdf_urls:
print(url)from fetcharoo import download_pdfs_from_webpage, set_default_user_agent
# Set a global default User-Agent
set_default_user_agent('MyCompanyBot/1.0 (contact@example.com)')
# Or use per-request User-Agent
download_pdfs_from_webpage(
url='https://example.com',
user_agent='SpecificBot/2.0'
)Main function to find and download PDFs from a webpage.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | The webpage URL to search |
recursion_depth |
int | 0 | How many levels of links to follow (max 5) |
mode |
str | 'separate' | 'merge' or 'separate' |
write_dir |
str | 'output' | Output directory for PDFs |
allowed_domains |
set | None | Restrict crawling to these domains |
request_delay |
float | 0.5 | Seconds between requests |
timeout |
int | 30 | Request timeout in seconds |
respect_robots |
bool | False | Whether to respect robots.txt |
user_agent |
str | None | Custom User-Agent (uses default if None) |
dry_run |
bool | False | Preview URLs without downloading |
show_progress |
bool | False | Show progress bars |
filter_config |
FilterConfig | None | PDF filtering configuration |
sort_by |
str | None | Sort strategy: 'numeric', 'alpha', 'alpha_desc', 'none' |
sort_key |
callable | None | Custom sort key function |
output_name |
str | None | Custom filename for merged PDF |
Returns: ProcessResult object with download statistics, or dict in dry-run mode.
Find PDF URLs without downloading.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | The webpage URL to search |
recursion_depth |
int | 0 | How many levels of links to follow |
deduplicate |
bool | True | Remove duplicate PDF URLs |
| ... | (plus other parameters from above) |
Download and save a list of PDF URLs.
| Parameter | Type | Default | Description |
|---|---|---|---|
pdf_links |
list | required | List of PDF URLs to download |
write_dir |
str | required | Output directory |
mode |
str | 'separate' | 'merge' or 'separate' |
sort_by |
str | None | Sort strategy for merging |
sort_key |
callable | None | Custom sort key function |
output_name |
str | None | Custom merged filename |
Returns: ProcessResult object with download statistics.
Dataclass returned by download operations:
from fetcharoo import ProcessResult
# Attributes:
result.success # bool: True if any PDFs were processed
result.files_created # List[str]: Paths to created files
result.downloaded_count # int: Number of successful downloads
result.filtered_count # int: Number of PDFs filtered out
result.failed_count # int: Number of failed downloads
result.errors # List[str]: Error messages
# ProcessResult is truthy when successful:
if result:
print("Success!")Configuration for PDF filtering:
from fetcharoo import FilterConfig
config = FilterConfig(
filename_include=['*.pdf'], # Patterns to include
filename_exclude=['*draft*'], # Patterns to exclude
url_include=['*/reports/*'], # URL patterns to include
url_exclude=['*/temp/*'], # URL patterns to exclude
min_size=1000, # Minimum size in bytes
max_size=100000000 # Maximum size in bytes
)merge_pdfs()- Merge multiple PDF documentsis_valid_url()- Validate URL format and schemeis_safe_domain()- Check if domain is allowedsanitize_filename()- Prevent path traversal attackscheck_robots_txt()- Check robots.txt permissionsset_default_user_agent()- Set default User-Agentget_default_user_agent()- Get current default User-Agent
fetcharoo includes several security measures:
- Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
- Path traversal protection: Sanitizes filenames to prevent directory escape
- Rate limiting: Configurable delays between requests
- Timeout handling: Prevents hanging on slow servers
- URL validation: Only allows http/https schemes
- robots.txt compliance: Optional respect for crawling rules
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by Mark A. Lifson, Ph.D.