Skip to content

Add site-specific download schemas/recipes #10

@MALathon

Description

@MALathon

Summary

Create a registry of site-specific "recipes" or schemas that encapsulate the best way to download PDFs from different publishers/websites. This allows fetcharoo to intelligently handle different site structures without users needing to figure out the right parameters each time.

Motivation

Different sites have different PDF structures:

  • Springer: Chapters named 978-3-xxx_N.pdf, need numeric sorting
  • arXiv: Usually single PDF, straightforward
  • IEEE/ACM: May require authentication, have supplements
  • University sites: Wildly varying structures

Currently users must manually specify sort_by, include/exclude patterns, etc. A schema registry would provide sensible defaults that "just work."

Proposed Design

1. SiteSchema Base Class

from dataclasses import dataclass
from typing import Optional, List, Callable
import re

@dataclass
class SiteSchema:
    """Base class for site-specific download configurations."""
    name: str
    url_pattern: str  # Regex to match URLs
    
    # PDF discovery
    include_patterns: List[str] = None
    exclude_patterns: List[str] = None
    pdf_selector: str = None  # CSS selector for PDF links
    
    # Ordering
    sort_by: str = None  # 'numeric', 'alpha', 'alpha_desc', 'none'
    sort_key: Callable = None
    
    # Output
    default_output_name: str = None  # e.g., "{title}.pdf"
    
    # Behavior
    recommended_depth: int = 1
    request_delay: float = 1.0
    
    def matches(self, url: str) -> bool:
        return bool(re.match(self.url_pattern, url))
    
    def extract_metadata(self, url: str, html: str) -> dict:
        """Override to extract title, author, etc. from page."""
        return {}

2. Built-in Schemas

# fetcharoo/schemas/springer.py
class SpringerBook(SiteSchema):
    name = "springer_book"
    url_pattern = r"https?://link\.springer\.com/book/.*"
    
    include_patterns = ["*_*.pdf"]  # Chapter PDFs have underscore
    exclude_patterns = ["*bbm*", "*fm*"]  # Exclude front/back matter
    sort_by = "numeric"
    recommended_depth = 1
    request_delay = 1.0
    
    def sort_key(self, url):
        # Extract chapter number from 978-3-xxx_N.pdf
        match = re.search(r'_(\d+)\.pdf$', url)
        return int(match.group(1)) if match else float('inf')
    
    def extract_metadata(self, url, html):
        # Extract book title from page
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.select_one('h1.c-article-title')
        return {'title': title.text if title else 'book'}

class SpringerChaptersOnly(SpringerBook):
    """Download only individual chapters, not full book PDF."""
    name = "springer_chapters"
    exclude_patterns = ["*bbm*", "*fm*", "*978-3-*-*-?.pdf"]  # Exclude full book
# fetcharoo/schemas/arxiv.py
class ArxivPaper(SiteSchema):
    name = "arxiv"
    url_pattern = r"https?://arxiv\.org/(abs|pdf)/.*"
    sort_by = "none"
    recommended_depth = 0
    
    def extract_metadata(self, url, html):
        # Extract paper ID and title
        ...

3. Schema Registry

# fetcharoo/schemas/registry.py
_SCHEMAS: Dict[str, SiteSchema] = {}

def register_schema(schema: SiteSchema):
    """Register a schema in the global registry."""
    _SCHEMAS[schema.name] = schema

def get_schema(name: str) -> Optional[SiteSchema]:
    """Get schema by name."""
    return _SCHEMAS.get(name)

def detect_schema(url: str) -> Optional[SiteSchema]:
    """Auto-detect schema from URL."""
    for schema in _SCHEMAS.values():
        if schema.matches(url):
            return schema
    return None

def list_schemas() -> List[str]:
    """List all registered schema names."""
    return list(_SCHEMAS.keys())

# Decorator for easy registration
def schema(cls):
    register_schema(cls())
    return cls

4. Integration with Main API

from fetcharoo import download_pdfs_from_webpage

# Auto-detect schema from URL
result = download_pdfs_from_webpage(
    url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
    schema='auto'  # Detects SpringerBook
)

# Explicitly use a schema
result = download_pdfs_from_webpage(
    url='https://link.springer.com/book/...',
    schema='springer_chapters'
)

# Use schema instance with overrides
from fetcharoo.schemas import SpringerBook
result = download_pdfs_from_webpage(
    url='...',
    schema=SpringerBook(request_delay=2.0)
)

# Schema parameters are defaults; explicit params override
result = download_pdfs_from_webpage(
    url='...',
    schema='springer_book',
    sort_by='alpha'  # Overrides schema's 'numeric'
)

5. CLI Integration

# Auto-detect
fetcharoo https://link.springer.com/book/... --schema auto

# Explicit schema
fetcharoo https://link.springer.com/book/... --schema springer_chapters

# List available schemas
fetcharoo --list-schemas

6. User-Defined Schemas

Users can add custom schemas in a config file:

# ~/.config/fetcharoo/schemas.yaml
my_university:
  url_pattern: "https?://library\\.myuni\\.edu/.*"
  include_patterns: ["*.pdf"]
  exclude_patterns: ["*thumbnail*", "*preview*"]
  sort_by: alpha
  request_delay: 2.0

Or programmatically:

from fetcharoo.schemas import SiteSchema, register_schema

@register_schema
class MyUniversityLibrary(SiteSchema):
    name = "myuni"
    url_pattern = r"https://library\.myuni\.edu/.*"
    sort_by = "alpha"

Implementation Plan

  1. Create SiteSchema base dataclass
  2. Implement schema registry with auto-detection
  3. Add built-in schemas: Springer, arXiv, generic
  4. Integrate schema parameter into download_pdfs_from_webpage
  5. Add CLI --schema flag and --list-schemas
  6. Support YAML config for user schemas
  7. Add tests for schema matching and behavior
  8. Document schema system in README

Open Questions

  1. Should schemas be able to define custom PDF extraction logic (beyond CSS selectors)?
  2. How to handle authentication requirements (just document, or provide hooks)?
  3. Should we ship schemas for sites that may have legal restrictions?
  4. Version schemas separately to update without library release?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions