-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Create a mechanism to fetch and update schemas from a remote registry, allowing schema fixes without library releases.
Design Overview
Remote Registry
Host community-maintained schemas at:
- GitHub repo:
MALathon/fetcharoo-schemas - Or JSON endpoint:
https://fetcharoo.io/schemas/v1/registry.json
Registry Format
{
"version": "1.0.0",
"updated": "2025-01-15T00:00:00Z",
"schemas": {
"springer_book": {
"version": "1.2.0",
"url_pattern": "https?://link\\.springer\\.com/book/.*",
"description": "Springer book with chapters",
"include_patterns": ["*.pdf"],
"exclude_patterns": ["*bbm*", "*bfm*"],
"sort_by": "numeric",
"recommended_depth": 1,
"request_delay": 1.0,
"test_url": "https://link.springer.com/book/10.1007/978-3-031-41026-0",
"expected_min_pdfs": 5
}
}
}Local Cache
~/.cache/fetcharoo/
├── schemas.json # Cached remote schemas
└── schemas.meta.json # Cache metadata (last updated, etag)
Update Flow
# fetcharoo/schemas/remote.py
REMOTE_REGISTRY_URL = "https://raw.githubusercontent.com/MALathon/fetcharoo-schemas/main/registry.json"
CACHE_DIR = Path.home() / '.cache' / 'fetcharoo'
CACHE_MAX_AGE = timedelta(days=7)
def update_schemas(force: bool = False) -> UpdateResult:
"""
Fetch latest schemas from remote registry.
Args:
force: Update even if cache is fresh
Returns:
UpdateResult with counts of new/updated schemas
"""
cache_file = CACHE_DIR / 'schemas.json'
meta_file = CACHE_DIR / 'schemas.meta.json'
# Check cache freshness
if not force and cache_file.exists():
meta = json.loads(meta_file.read_text())
cached_at = datetime.fromisoformat(meta['cached_at'])
if datetime.now() - cached_at < CACHE_MAX_AGE:
return UpdateResult(from_cache=True)
# Fetch remote
response = requests.get(REMOTE_REGISTRY_URL, timeout=30)
response.raise_for_status()
registry = response.json()
# Save cache
CACHE_DIR.mkdir(parents=True, exist_ok=True)
cache_file.write_text(json.dumps(registry))
meta_file.write_text(json.dumps({
'cached_at': datetime.now().isoformat(),
'etag': response.headers.get('etag')
}))
# Register schemas
new_count = 0
updated_count = 0
for name, schema_data in registry['schemas'].items():
existing = get_schema(name)
if existing:
# Check version
if schema_data.get('version', '0') > existing.version:
# Update (re-register)
register_schema(SiteSchema(**schema_data), overwrite=True)
updated_count += 1
else:
register_schema(SiteSchema(**schema_data))
new_count += 1
return UpdateResult(new=new_count, updated=updated_count)
def load_cached_schemas() -> int:
"""Load schemas from local cache without network request."""
cache_file = CACHE_DIR / 'schemas.json'
if not cache_file.exists():
return 0
registry = json.loads(cache_file.read_text())
count = 0
for name, schema_data in registry['schemas'].items():
if not get_schema(name): # Don't overwrite built-in
register_schema(SiteSchema(**schema_data))
count += 1
return countCLI Commands
# Update schemas from remote
$ fetcharoo --update-schemas
Fetching schemas from remote registry...
Updated 2 schemas, added 1 new schema.
# Force update (ignore cache)
$ fetcharoo --update-schemas --force
# Use remote schemas (auto-loads from cache)
$ fetcharoo --use-remote-schemas https://example.com --schema autoOpt-in Behavior
Remote schemas should be opt-in:
# Explicit in code
from fetcharoo.schemas import load_remote_schemas
load_remote_schemas() # Loads from cache, updates if stale
# Or via environment
FETCHAROO_USE_REMOTE_SCHEMAS=1 fetcharoo https://...
# Or CLI flag
fetcharoo --use-remote-schemas https://...Community Contribution
Separate repo for schema contributions:
fetcharoo-schemas/
├── registry.json # Combined registry
├── schemas/
│ ├── springer.yaml
│ ├── arxiv.yaml
│ ├── ieee.yaml
│ └── ...
├── tests/ # Schema validation tests
└── .github/
└── workflows/
└── validate.yml # Auto-validate on PR
Tasks
- Design registry JSON format
- Create
fetcharoo-schemasrepository - Implement
update_schemas()with caching - Implement
load_cached_schemas() - Add
--update-schemasCLI command - Add
--use-remote-schemasflag - Support
FETCHAROO_USE_REMOTE_SCHEMASenv var - Document contribution process
- Add tests for remote loading
Acceptance Criteria
- Can fetch schemas from remote URL
- Caches schemas locally with TTL
--update-schemasrefreshes cache- Built-in schemas take precedence over remote
- Works offline with cached schemas
- Clear opt-in mechanism (not auto-enabled)
Security Considerations
- Only fetch from trusted URLs (configurable)
- Validate schema format before registering
- Don't execute arbitrary code from remote schemas
- Consider signing/checksums for integrity
Dependencies
- Create SiteSchema base dataclass #11-Support user-defined schemas via YAML config file #19 (Full schema system)
Part of
Parent issue: #10
Metadata
Metadata
Assignees
Labels
No labels