Skip to content

Conversation

@MALathon
Copy link
Owner

Summary

Create the foundational SiteSchema dataclass for site-specific PDF download configurations. This is the first step in building the schema system outlined in #10.

Changes

New Files

  • fetcharoo/schemas/__init__.py - Package exports
  • fetcharoo/schemas/base.py - SiteSchema dataclass implementation
  • tests/test_schemas_base.py - 25 comprehensive tests

SiteSchema Features

from fetcharoo.schemas import SiteSchema

schema = SiteSchema(
    name='springer_book',
    url_pattern=r'https?://link\.springer\.com/book/.*',
    description='Springer book with chapters',
    include_patterns=['*.pdf'],
    exclude_patterns=['*bbm*', '*bfm*'],
    sort_by='numeric',
    recommended_depth=1,
    request_delay=1.0,
    test_url='https://link.springer.com/book/10.1007/978-3-031-41026-0',
    expected_min_pdfs=5
)

# URL matching
schema.matches('https://link.springer.com/book/10.1007/978-3-031-41026-0')  # True

# Get FilterConfig for integration
filter_config = schema.get_filter_config()

Attributes

Attribute Type Description
name str Unique identifier
url_pattern str Regex pattern for URL matching
description str Human-readable description
include_patterns List[str] Filename patterns to include
exclude_patterns List[str] Filename patterns to exclude
sort_by str Sort strategy: numeric, alpha, alpha_desc, none
sort_key Callable Custom sort key function
default_output_name str Default merged filename
recommended_depth int Suggested recursion depth
request_delay float Suggested delay between requests
test_url str URL for validation testing
expected_min_pdfs int Minimum PDFs expected in validation
version str Schema version string

Test Plan

  • 25 tests covering all functionality
  • All 269 tests pass (244 existing + 25 new)

Next Steps

This PR enables:

Closes

Closes #11

Create the foundational schema system for site-specific PDF download
configurations. The SiteSchema dataclass encapsulates:

- URL pattern matching with compiled regex
- PDF filtering patterns (include/exclude for filenames and URLs)
- Sort strategy configuration (numeric, alpha, alpha_desc, none)
- Custom sort key function support
- Recommended depth and request delay settings
- Validation settings (test_url, expected_min_pdfs)
- Version tracking for schema updates

Also includes:
- Integration with existing FilterConfig class
- Comprehensive test suite (25 tests)
- Full documentation and examples

Closes #11
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 97.97980% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tests/test_schemas_base.py 97.27% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@MALathon MALathon merged commit 5ab1a67 into main Dec 16, 2025
4 checks passed
@MALathon MALathon deleted the feature/site-schema-base branch December 16, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create SiteSchema base dataclass

3 participants