-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Allow users to define custom schemas in a YAML configuration file without writing Python code.
Config File Locations
Search order (first found wins):
./fetcharoo.yaml(project-local)~/.config/fetcharoo/schemas.yaml(user config)- Environment variable
FETCHAROO_SCHEMAS_PATH
YAML Schema Format
# ~/.config/fetcharoo/schemas.yaml
schemas:
my_university:
url_pattern: "https?://library\\.myuni\\.edu/.*"
description: "My university library"
include_patterns:
- "*.pdf"
exclude_patterns:
- "*thumbnail*"
- "*preview*"
sort_by: alpha
recommended_depth: 1
request_delay: 2.0
test_url: "https://library.myuni.edu/example"
expected_min_pdfs: 1
company_docs:
url_pattern: "https://docs\\.mycompany\\.com/.*"
description: "Internal company documentation"
sort_by: numeric
recommended_depth: 2
# No test_url - won't be validated externallyImplementation
# fetcharoo/schemas/config.py
import os
import yaml
from pathlib import Path
from typing import Optional
from .base import SiteSchema
from .registry import register_schema
CONFIG_LOCATIONS = [
Path('./fetcharoo.yaml'),
Path.home() / '.config' / 'fetcharoo' / 'schemas.yaml',
]
def load_user_schemas(config_path: Optional[str] = None) -> int:
"""
Load schemas from YAML config file.
Args:
config_path: Explicit path, or searches default locations
Returns:
Number of schemas loaded
"""
# Find config file
if config_path:
path = Path(config_path)
elif os.environ.get('FETCHAROO_SCHEMAS_PATH'):
path = Path(os.environ['FETCHAROO_SCHEMAS_PATH'])
else:
path = None
for loc in CONFIG_LOCATIONS:
if loc.exists():
path = loc
break
if not path or not path.exists():
return 0
# Parse YAML
with open(path) as f:
config = yaml.safe_load(f)
if not config or 'schemas' not in config:
return 0
# Register schemas
count = 0
for name, schema_dict in config['schemas'].items():
schema = SiteSchema(
name=name,
url_pattern=schema_dict['url_pattern'],
description=schema_dict.get('description'),
include_patterns=schema_dict.get('include_patterns', []),
exclude_patterns=schema_dict.get('exclude_patterns', []),
sort_by=schema_dict.get('sort_by'),
recommended_depth=schema_dict.get('recommended_depth', 1),
request_delay=schema_dict.get('request_delay', 0.5),
test_url=schema_dict.get('test_url'),
expected_min_pdfs=schema_dict.get('expected_min_pdfs', 1),
)
register_schema(schema)
count += 1
return count
# Auto-load on import (optional, can be explicit)
def init_user_schemas():
"""Called during package init to load user schemas."""
try:
count = load_user_schemas()
if count > 0:
logger.debug(f"Loaded {count} user-defined schema(s)")
except Exception as e:
logger.warning(f"Failed to load user schemas: {e}")CLI Support
# Load schemas from specific file
fetcharoo --schemas-config ./my-schemas.yaml https://example.com
# Show where schemas were loaded from
fetcharoo --list-schemas -v
Available schemas:
Built-in:
springer_book Springer book with chapters
arxiv arXiv preprint paper
User-defined (~/.config/fetcharoo/schemas.yaml):
my_university My university libraryTasks
- Add
pyyamlas optional dependency - Create
config.pymodule - Implement
load_user_schemas() - Add config search path logic
- Support
FETCHAROO_SCHEMAS_PATHenv var - Add
--schemas-configCLI argument - Show schema source in
--list-schemasoutput - Add tests for YAML loading
- Document config format in README
Acceptance Criteria
- Schemas load from
~/.config/fetcharoo/schemas.yaml - Project-local
fetcharoo.yamltakes precedence - Environment variable overrides default locations
- Invalid YAML shows clear error message
--list-schemasshows source of each schema
Dependencies
- Create SiteSchema base dataclass #11 (SiteSchema base class)
- Implement schema registry with auto-detection #12 (Schema registry)
- Add CLI support for schemas (--schema, --list-schemas) #15 (CLI schema support)
Part of
Parent issue: #10
Metadata
Metadata
Assignees
Labels
No labels