Skip to content

Support user-defined schemas via YAML config file #19

@MALathon

Description

@MALathon

Summary

Allow users to define custom schemas in a YAML configuration file without writing Python code.

Config File Locations

Search order (first found wins):

  1. ./fetcharoo.yaml (project-local)
  2. ~/.config/fetcharoo/schemas.yaml (user config)
  3. Environment variable FETCHAROO_SCHEMAS_PATH

YAML Schema Format

# ~/.config/fetcharoo/schemas.yaml

schemas:
  my_university:
    url_pattern: "https?://library\\.myuni\\.edu/.*"
    description: "My university library"
    include_patterns:
      - "*.pdf"
    exclude_patterns:
      - "*thumbnail*"
      - "*preview*"
    sort_by: alpha
    recommended_depth: 1
    request_delay: 2.0
    test_url: "https://library.myuni.edu/example"
    expected_min_pdfs: 1

  company_docs:
    url_pattern: "https://docs\\.mycompany\\.com/.*"
    description: "Internal company documentation"
    sort_by: numeric
    recommended_depth: 2
    # No test_url - won't be validated externally

Implementation

# fetcharoo/schemas/config.py
import os
import yaml
from pathlib import Path
from typing import Optional
from .base import SiteSchema
from .registry import register_schema

CONFIG_LOCATIONS = [
    Path('./fetcharoo.yaml'),
    Path.home() / '.config' / 'fetcharoo' / 'schemas.yaml',
]

def load_user_schemas(config_path: Optional[str] = None) -> int:
    """
    Load schemas from YAML config file.
    
    Args:
        config_path: Explicit path, or searches default locations
        
    Returns:
        Number of schemas loaded
    """
    # Find config file
    if config_path:
        path = Path(config_path)
    elif os.environ.get('FETCHAROO_SCHEMAS_PATH'):
        path = Path(os.environ['FETCHAROO_SCHEMAS_PATH'])
    else:
        path = None
        for loc in CONFIG_LOCATIONS:
            if loc.exists():
                path = loc
                break
    
    if not path or not path.exists():
        return 0
    
    # Parse YAML
    with open(path) as f:
        config = yaml.safe_load(f)
    
    if not config or 'schemas' not in config:
        return 0
    
    # Register schemas
    count = 0
    for name, schema_dict in config['schemas'].items():
        schema = SiteSchema(
            name=name,
            url_pattern=schema_dict['url_pattern'],
            description=schema_dict.get('description'),
            include_patterns=schema_dict.get('include_patterns', []),
            exclude_patterns=schema_dict.get('exclude_patterns', []),
            sort_by=schema_dict.get('sort_by'),
            recommended_depth=schema_dict.get('recommended_depth', 1),
            request_delay=schema_dict.get('request_delay', 0.5),
            test_url=schema_dict.get('test_url'),
            expected_min_pdfs=schema_dict.get('expected_min_pdfs', 1),
        )
        register_schema(schema)
        count += 1
    
    return count

# Auto-load on import (optional, can be explicit)
def init_user_schemas():
    """Called during package init to load user schemas."""
    try:
        count = load_user_schemas()
        if count > 0:
            logger.debug(f"Loaded {count} user-defined schema(s)")
    except Exception as e:
        logger.warning(f"Failed to load user schemas: {e}")

CLI Support

# Load schemas from specific file
fetcharoo --schemas-config ./my-schemas.yaml https://example.com

# Show where schemas were loaded from
fetcharoo --list-schemas -v
Available schemas:
  Built-in:
    springer_book        Springer book with chapters
    arxiv                arXiv preprint paper
  User-defined (~/.config/fetcharoo/schemas.yaml):
    my_university        My university library

Tasks

  • Add pyyaml as optional dependency
  • Create config.py module
  • Implement load_user_schemas()
  • Add config search path logic
  • Support FETCHAROO_SCHEMAS_PATH env var
  • Add --schemas-config CLI argument
  • Show schema source in --list-schemas output
  • Add tests for YAML loading
  • Document config format in README

Acceptance Criteria

  • Schemas load from ~/.config/fetcharoo/schemas.yaml
  • Project-local fetcharoo.yaml takes precedence
  • Environment variable overrides default locations
  • Invalid YAML shows clear error message
  • --list-schemas shows source of each schema

Dependencies

Part of

Parent issue: #10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions