Skip to content

Add built-in schemas for common sites (Springer, arXiv) #13

@MALathon

Description

@MALathon

Summary

Create pre-built schemas for commonly used academic/document sites.

Schemas to Implement

1. Springer Books

# fetcharoo/schemas/springer.py
from .base import SiteSchema
from .registry import schema
import re

@schema
class SpringerBook(SiteSchema):
    name = "springer_book"
    url_pattern = r"https?://link\.springer\.com/book/10\.\d+/.*"
    description = "Springer book with chapters"
    
    include_patterns = ["*.pdf"]
    exclude_patterns = ["*bbm*", "*bfm*"]  # Back/front matter
    sort_by = "numeric"
    recommended_depth = 1
    request_delay = 1.0
    
    test_url = "https://link.springer.com/book/10.1007/978-3-031-41026-0"
    expected_min_pdfs = 5
    
    @staticmethod
    def sort_key(url: str) -> tuple:
        """Sort by chapter number in filename (e.g., 978-3-xxx_5.pdf)."""
        match = re.search(r'_(\d+)\.pdf$', url)
        return (int(match.group(1)),) if match else (float('inf'),)

2. arXiv Papers

@schema  
class ArxivPaper(SiteSchema):
    name = "arxiv"
    url_pattern = r"https?://arxiv\.org/(abs|pdf)/\d+\.\d+.*"
    description = "arXiv preprint paper"
    
    sort_by = "none"
    recommended_depth = 0
    request_delay = 0.5
    
    test_url = "https://arxiv.org/abs/2301.07041"
    expected_min_pdfs = 1

3. Generic/Fallback

@schema
class GenericSite(SiteSchema):
    name = "generic"
    url_pattern = r".*"  # Matches anything (lowest priority)
    description = "Generic fallback for unknown sites"
    
    sort_by = "none"  # Preserve discovery order
    recommended_depth = 0
    request_delay = 0.5
    
    # No test_url - this is the fallback

Directory Structure

fetcharoo/schemas/
├── __init__.py      # Exports all schemas
├── base.py          # SiteSchema dataclass
├── registry.py      # Registry functions
├── springer.py      # Springer schemas
├── arxiv.py         # arXiv schema
└── generic.py       # Generic fallback

Tasks

  • Create springer.py with SpringerBook schema
  • Create arxiv.py with ArxivPaper schema
  • Create generic.py with fallback schema
  • Auto-register all schemas on import
  • Ensure detection priority (specific before generic)
  • Add integration tests with mocked responses
  • Document each schema's behavior

Acceptance Criteria

  • detect_schema("https://link.springer.com/book/...") returns SpringerBook
  • detect_schema("https://arxiv.org/abs/...") returns ArxivPaper
  • detect_schema("https://random-site.com") returns GenericSite
  • Each schema's sort_key works correctly

Dependencies

Part of

Parent issue: #10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions