Skip to content

Add remote schema registry for community schemas #20

@MALathon

Description

@MALathon

Summary

Create a mechanism to fetch and update schemas from a remote registry, allowing schema fixes without library releases.

Design Overview

Remote Registry

Host community-maintained schemas at:

  • GitHub repo: MALathon/fetcharoo-schemas
  • Or JSON endpoint: https://fetcharoo.io/schemas/v1/registry.json

Registry Format

{
  "version": "1.0.0",
  "updated": "2025-01-15T00:00:00Z",
  "schemas": {
    "springer_book": {
      "version": "1.2.0",
      "url_pattern": "https?://link\\.springer\\.com/book/.*",
      "description": "Springer book with chapters",
      "include_patterns": ["*.pdf"],
      "exclude_patterns": ["*bbm*", "*bfm*"],
      "sort_by": "numeric",
      "recommended_depth": 1,
      "request_delay": 1.0,
      "test_url": "https://link.springer.com/book/10.1007/978-3-031-41026-0",
      "expected_min_pdfs": 5
    }
  }
}

Local Cache

~/.cache/fetcharoo/
├── schemas.json          # Cached remote schemas
└── schemas.meta.json     # Cache metadata (last updated, etag)

Update Flow

# fetcharoo/schemas/remote.py

REMOTE_REGISTRY_URL = "https://raw.githubusercontent.com/MALathon/fetcharoo-schemas/main/registry.json"
CACHE_DIR = Path.home() / '.cache' / 'fetcharoo'
CACHE_MAX_AGE = timedelta(days=7)

def update_schemas(force: bool = False) -> UpdateResult:
    """
    Fetch latest schemas from remote registry.
    
    Args:
        force: Update even if cache is fresh
        
    Returns:
        UpdateResult with counts of new/updated schemas
    """
    cache_file = CACHE_DIR / 'schemas.json'
    meta_file = CACHE_DIR / 'schemas.meta.json'
    
    # Check cache freshness
    if not force and cache_file.exists():
        meta = json.loads(meta_file.read_text())
        cached_at = datetime.fromisoformat(meta['cached_at'])
        if datetime.now() - cached_at < CACHE_MAX_AGE:
            return UpdateResult(from_cache=True)
    
    # Fetch remote
    response = requests.get(REMOTE_REGISTRY_URL, timeout=30)
    response.raise_for_status()
    registry = response.json()
    
    # Save cache
    CACHE_DIR.mkdir(parents=True, exist_ok=True)
    cache_file.write_text(json.dumps(registry))
    meta_file.write_text(json.dumps({
        'cached_at': datetime.now().isoformat(),
        'etag': response.headers.get('etag')
    }))
    
    # Register schemas
    new_count = 0
    updated_count = 0
    for name, schema_data in registry['schemas'].items():
        existing = get_schema(name)
        if existing:
            # Check version
            if schema_data.get('version', '0') > existing.version:
                # Update (re-register)
                register_schema(SiteSchema(**schema_data), overwrite=True)
                updated_count += 1
        else:
            register_schema(SiteSchema(**schema_data))
            new_count += 1
    
    return UpdateResult(new=new_count, updated=updated_count)

def load_cached_schemas() -> int:
    """Load schemas from local cache without network request."""
    cache_file = CACHE_DIR / 'schemas.json'
    if not cache_file.exists():
        return 0
    
    registry = json.loads(cache_file.read_text())
    count = 0
    for name, schema_data in registry['schemas'].items():
        if not get_schema(name):  # Don't overwrite built-in
            register_schema(SiteSchema(**schema_data))
            count += 1
    return count

CLI Commands

# Update schemas from remote
$ fetcharoo --update-schemas
Fetching schemas from remote registry...
Updated 2 schemas, added 1 new schema.

# Force update (ignore cache)
$ fetcharoo --update-schemas --force

# Use remote schemas (auto-loads from cache)
$ fetcharoo --use-remote-schemas https://example.com --schema auto

Opt-in Behavior

Remote schemas should be opt-in:

# Explicit in code
from fetcharoo.schemas import load_remote_schemas
load_remote_schemas()  # Loads from cache, updates if stale

# Or via environment
FETCHAROO_USE_REMOTE_SCHEMAS=1 fetcharoo https://...

# Or CLI flag
fetcharoo --use-remote-schemas https://...

Community Contribution

Separate repo for schema contributions:

fetcharoo-schemas/
├── registry.json         # Combined registry
├── schemas/
│   ├── springer.yaml
│   ├── arxiv.yaml
│   ├── ieee.yaml
│   └── ...
├── tests/                # Schema validation tests
└── .github/
    └── workflows/
        └── validate.yml  # Auto-validate on PR

Tasks

  • Design registry JSON format
  • Create fetcharoo-schemas repository
  • Implement update_schemas() with caching
  • Implement load_cached_schemas()
  • Add --update-schemas CLI command
  • Add --use-remote-schemas flag
  • Support FETCHAROO_USE_REMOTE_SCHEMAS env var
  • Document contribution process
  • Add tests for remote loading

Acceptance Criteria

  • Can fetch schemas from remote URL
  • Caches schemas locally with TTL
  • --update-schemas refreshes cache
  • Built-in schemas take precedence over remote
  • Works offline with cached schemas
  • Clear opt-in mechanism (not auto-enabled)

Security Considerations

  • Only fetch from trusted URLs (configurable)
  • Validate schema format before registering
  • Don't execute arbitrary code from remote schemas
  • Consider signing/checksums for integrity

Dependencies

Part of

Parent issue: #10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions