-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Create a registry of site-specific "recipes" or schemas that encapsulate the best way to download PDFs from different publishers/websites. This allows fetcharoo to intelligently handle different site structures without users needing to figure out the right parameters each time.
Motivation
Different sites have different PDF structures:
- Springer: Chapters named
978-3-xxx_N.pdf, need numeric sorting - arXiv: Usually single PDF, straightforward
- IEEE/ACM: May require authentication, have supplements
- University sites: Wildly varying structures
Currently users must manually specify sort_by, include/exclude patterns, etc. A schema registry would provide sensible defaults that "just work."
Proposed Design
1. SiteSchema Base Class
from dataclasses import dataclass
from typing import Optional, List, Callable
import re
@dataclass
class SiteSchema:
"""Base class for site-specific download configurations."""
name: str
url_pattern: str # Regex to match URLs
# PDF discovery
include_patterns: List[str] = None
exclude_patterns: List[str] = None
pdf_selector: str = None # CSS selector for PDF links
# Ordering
sort_by: str = None # 'numeric', 'alpha', 'alpha_desc', 'none'
sort_key: Callable = None
# Output
default_output_name: str = None # e.g., "{title}.pdf"
# Behavior
recommended_depth: int = 1
request_delay: float = 1.0
def matches(self, url: str) -> bool:
return bool(re.match(self.url_pattern, url))
def extract_metadata(self, url: str, html: str) -> dict:
"""Override to extract title, author, etc. from page."""
return {}2. Built-in Schemas
# fetcharoo/schemas/springer.py
class SpringerBook(SiteSchema):
name = "springer_book"
url_pattern = r"https?://link\.springer\.com/book/.*"
include_patterns = ["*_*.pdf"] # Chapter PDFs have underscore
exclude_patterns = ["*bbm*", "*fm*"] # Exclude front/back matter
sort_by = "numeric"
recommended_depth = 1
request_delay = 1.0
def sort_key(self, url):
# Extract chapter number from 978-3-xxx_N.pdf
match = re.search(r'_(\d+)\.pdf$', url)
return int(match.group(1)) if match else float('inf')
def extract_metadata(self, url, html):
# Extract book title from page
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('h1.c-article-title')
return {'title': title.text if title else 'book'}
class SpringerChaptersOnly(SpringerBook):
"""Download only individual chapters, not full book PDF."""
name = "springer_chapters"
exclude_patterns = ["*bbm*", "*fm*", "*978-3-*-*-?.pdf"] # Exclude full book# fetcharoo/schemas/arxiv.py
class ArxivPaper(SiteSchema):
name = "arxiv"
url_pattern = r"https?://arxiv\.org/(abs|pdf)/.*"
sort_by = "none"
recommended_depth = 0
def extract_metadata(self, url, html):
# Extract paper ID and title
...3. Schema Registry
# fetcharoo/schemas/registry.py
_SCHEMAS: Dict[str, SiteSchema] = {}
def register_schema(schema: SiteSchema):
"""Register a schema in the global registry."""
_SCHEMAS[schema.name] = schema
def get_schema(name: str) -> Optional[SiteSchema]:
"""Get schema by name."""
return _SCHEMAS.get(name)
def detect_schema(url: str) -> Optional[SiteSchema]:
"""Auto-detect schema from URL."""
for schema in _SCHEMAS.values():
if schema.matches(url):
return schema
return None
def list_schemas() -> List[str]:
"""List all registered schema names."""
return list(_SCHEMAS.keys())
# Decorator for easy registration
def schema(cls):
register_schema(cls())
return cls4. Integration with Main API
from fetcharoo import download_pdfs_from_webpage
# Auto-detect schema from URL
result = download_pdfs_from_webpage(
url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
schema='auto' # Detects SpringerBook
)
# Explicitly use a schema
result = download_pdfs_from_webpage(
url='https://link.springer.com/book/...',
schema='springer_chapters'
)
# Use schema instance with overrides
from fetcharoo.schemas import SpringerBook
result = download_pdfs_from_webpage(
url='...',
schema=SpringerBook(request_delay=2.0)
)
# Schema parameters are defaults; explicit params override
result = download_pdfs_from_webpage(
url='...',
schema='springer_book',
sort_by='alpha' # Overrides schema's 'numeric'
)5. CLI Integration
# Auto-detect
fetcharoo https://link.springer.com/book/... --schema auto
# Explicit schema
fetcharoo https://link.springer.com/book/... --schema springer_chapters
# List available schemas
fetcharoo --list-schemas6. User-Defined Schemas
Users can add custom schemas in a config file:
# ~/.config/fetcharoo/schemas.yaml
my_university:
url_pattern: "https?://library\\.myuni\\.edu/.*"
include_patterns: ["*.pdf"]
exclude_patterns: ["*thumbnail*", "*preview*"]
sort_by: alpha
request_delay: 2.0Or programmatically:
from fetcharoo.schemas import SiteSchema, register_schema
@register_schema
class MyUniversityLibrary(SiteSchema):
name = "myuni"
url_pattern = r"https://library\.myuni\.edu/.*"
sort_by = "alpha"Implementation Plan
- Create
SiteSchemabase dataclass - Implement schema registry with auto-detection
- Add built-in schemas: Springer, arXiv, generic
- Integrate
schemaparameter intodownload_pdfs_from_webpage - Add CLI
--schemaflag and--list-schemas - Support YAML config for user schemas
- Add tests for schema matching and behavior
- Document schema system in README
Open Questions
- Should schemas be able to define custom PDF extraction logic (beyond CSS selectors)?
- How to handle authentication requirements (just document, or provide hooks)?
- Should we ship schemas for sites that may have legal restrictions?
- Version schemas separately to update without library release?
Related
- Builds on Add sort_key parameter for merge mode ordering #3 (sort ordering), Deduplicate PDF URLs by default #4 (deduplication), Add custom output filename for merge mode #5 (output naming)
Metadata
Metadata
Assignees
Labels
No labels