Skip to content

RFC: Session Templates for Common Crawling Patterns #9

@mudassaralichouhan

Description

@mudassaralichouhan

Summary

Propose a reusable template system for common crawling patterns (e.g., news, e-commerce) to improve developer experience, reduce repetitive configuration, and standardize crawler behavior across similar site types.

Motivation

  • Reduce repetitive configuration for common site types
  • Standardize crawling patterns across the project
  • Improve developer productivity

Detailed Design

Template Structure

{
  "name": "news-site",
  "description": "Template for news websites",
  "config": {
    "maxPages": 500,
    "maxDepth": 3,
    "spaRenderingEnabled": true,
    "extractTextContent": true,
    "politenessDelay": 1000
  },
  "patterns": {
    "articleSelectors": ["article", ".post", ".story"],
    "titleSelectors": ["h1", ".headline", ".title"],
    "contentSelectors": [".content", ".body", ".article-body"]
  }
}

API Endpoints

  • GET /api/templates - List available templates
  • POST /api/crawl/add-site-with-template - Use template for crawling
  • POST /api/templates - Create new template

Implementation Plan

  1. Phase 1: Core template system
  2. Phase 2: Template API endpoints
  3. Phase 3: Pre-built templates
  4. Phase 4: Template validation and testing

Testing Strategy

  • Unit tests for template loading/application
  • Integration tests with real websites
  • Performance benchmarks

Migration Strategy

  • Backward compatible with existing API
  • Gradual migration path

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions