-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Create pre-built schemas for commonly used academic/document sites.
Schemas to Implement
1. Springer Books
# fetcharoo/schemas/springer.py
from .base import SiteSchema
from .registry import schema
import re
@schema
class SpringerBook(SiteSchema):
name = "springer_book"
url_pattern = r"https?://link\.springer\.com/book/10\.\d+/.*"
description = "Springer book with chapters"
include_patterns = ["*.pdf"]
exclude_patterns = ["*bbm*", "*bfm*"] # Back/front matter
sort_by = "numeric"
recommended_depth = 1
request_delay = 1.0
test_url = "https://link.springer.com/book/10.1007/978-3-031-41026-0"
expected_min_pdfs = 5
@staticmethod
def sort_key(url: str) -> tuple:
"""Sort by chapter number in filename (e.g., 978-3-xxx_5.pdf)."""
match = re.search(r'_(\d+)\.pdf$', url)
return (int(match.group(1)),) if match else (float('inf'),)2. arXiv Papers
@schema
class ArxivPaper(SiteSchema):
name = "arxiv"
url_pattern = r"https?://arxiv\.org/(abs|pdf)/\d+\.\d+.*"
description = "arXiv preprint paper"
sort_by = "none"
recommended_depth = 0
request_delay = 0.5
test_url = "https://arxiv.org/abs/2301.07041"
expected_min_pdfs = 13. Generic/Fallback
@schema
class GenericSite(SiteSchema):
name = "generic"
url_pattern = r".*" # Matches anything (lowest priority)
description = "Generic fallback for unknown sites"
sort_by = "none" # Preserve discovery order
recommended_depth = 0
request_delay = 0.5
# No test_url - this is the fallbackDirectory Structure
fetcharoo/schemas/
├── __init__.py # Exports all schemas
├── base.py # SiteSchema dataclass
├── registry.py # Registry functions
├── springer.py # Springer schemas
├── arxiv.py # arXiv schema
└── generic.py # Generic fallback
Tasks
- Create
springer.pywithSpringerBookschema - Create
arxiv.pywithArxivPaperschema - Create
generic.pywith fallback schema - Auto-register all schemas on import
- Ensure detection priority (specific before generic)
- Add integration tests with mocked responses
- Document each schema's behavior
Acceptance Criteria
detect_schema("https://link.springer.com/book/...")returnsSpringerBookdetect_schema("https://arxiv.org/abs/...")returnsArxivPaperdetect_schema("https://random-site.com")returnsGenericSite- Each schema's
sort_keyworks correctly
Dependencies
- Create SiteSchema base dataclass #11 (SiteSchema base class)
- Implement schema registry with auto-detection #12 (Schema registry)
Part of
Parent issue: #10
Metadata
Metadata
Assignees
Labels
No labels