Skip to content

High-throughput(100+ pages/sec), respectful C++ web crawler with smart frontier scheduling, robots.txt compliance, HTTP/2 multiplexing, and RocksDB-backed persistence for production-scale ingestion.

Notifications You must be signed in to change notification settings

NotGyashu/Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hybrid Speed Crawler – A High-Performance Web Crawler[1]

A modular, production-grade, and aggressively optimized web crawler built in modern C++ with a multi-stage pipeline that couples HTTP/2 multiplexed fetching, robots-aware scheduling, sharded frontiers, SIMD-assisted parsing, content hashing, and RocksDB-backed persistence for crash-safe, sustained, compliant throughput at scale.1

This repository contains an end-to-end system that continuously discovers, fetches, filters, parses, deduplicates, and persists web content while enforcing domain politeness, honoring robots.txt, ingesting sitemaps and feeds, and coordinating graceful startup/shutdown with metrics and tracing hooks for operations.2

  • Stack: C++17, libcurl multi (HTTP/2), RocksDB (metadata, caches), RE2 (regex), TinyXML2 (XML), moodycamel concurrentqueue (lock-free queues), Tracy (profiling), plus modular configs in JSON for domains, seeds, patterns, and feeds.1

TL;DR3

Hybrid Speed Crawler is a high-performance, robots-aware web crawler that uses multi-threaded curl multi handles with HTTP/2 multiplexing, domain-aware rate limiting, sharded smart frontiers, work-stealing queues, conditional GET caching, and SIMD-assisted parsing to sustain hundreds of pages per second on commodity hardware, with crash-safe RocksDB persistence and coherent shutdown semantics for production reliability.4


Table of Contents3

    1. Title, TL;DR & Table of Contents3
    1. Problem Statement & Goals2
    1. High-Level Architecture3
    1. Core Features4
    1. Detailed Pipeline Walkthroughs2
    1. Concurrency & Scheduling Model4
    1. Data Structures & Storage5
    1. Networking Subsystem3
    1. Robots.txt & Politeness6
    1. Sitemaps & Feeds7
    1. Caching: HTTP & Application-level1
    1. Fault Tolerance & Recovery5
    1. Observability8
    1. Performance Engineering9
    1. Security, Legal, and Ethics6
    1. Configuration & DevOps1
    1. Testing Strategy2
    1. Roadmap & Future Work9
    1. Glossary4
    1. References10

2. Problem Statement & Goals2

Modern content discovery demands crawling systems that remain fresh, respectful, and scalable in the face of dynamic websites, changing robots policies, and varied content quality, all while maintaining operational stability and crash safety.8

This crawler solves the acquisition side of ML-ready content pipelines by enforcing robots.txt, prioritizing by per-domain rules, using conditional GET to avoid redundant downloads, deduplicating via robust hashing, and exporting enriched metadata with timing and cacheable headers.4

  • Freshness goals: leverage sitemaps, RSS/Atom, and conditional GET to recrawl recent and changing content while minimizing redundant fetches.7
  • Scale goals: sustain high QPS via HTTP/2 multiplexing, connection pooling, work stealing, sharded disk queues, and priority-aware smart frontiers.4
  • Politeness goals: domain-aware rate limiting, robots.txt allow/disallow with crawl-delay, and shared per-domain queues for throttling.6
  • ML-ready: enriched storage embeds content hash, timing, status, and headers in JSON for downstream indexing, deduplication, and training.11

Non-goals:

  • This is not a browser engine and does not execute arbitrary JavaScript or render dynamic pages beyond HTTP fetch and parsing.4
  • This is not a search UI or ranking system; it exports enriched content and metadata for downstream indexing and retrieval systems.12

3. High-Level Architecture3

The system is organized into modular subsystems: networking (HTTP client + connection pool), policy (robots cache, rate limiter, domain configs), scheduling (smart frontier, work-stealing, sharded disk queue), parsing (SIMD/RE2/TinyXML2 pipeline), persistence (RocksDB metadata + JSON enriched storage), and coordination (monitoring, signal-safe shutdown).1

CMake lists dependencies, organizes modular sources, and enables profiling with Tracy plus aggressive optimization flags while keeping debug pathways for sanitizer builds.1

3.1 System Context (C4 L1)2

This diagram shows the crawler in relation to external systems such as DNS, origin servers, and optional downstream indexers.3

graph TD
  A[Operator / DevOps] -->|Configure & Run| B[Hybrid Speed Crawler]
  B -->|HTTP/2 GET| C[Origin Web Servers]
  B -->|DNS| D[DNS Resolvers]
  B -->|Robots/Sitemaps| C
  B -->|Feeds (RSS/Atom)| C
  B -->|Enriched JSON| E[Object Storage / S3-like]
  B -->|Metadata| F[RocksDB]
  B -->|Metrics & Logs| G[Monitoring Stack]
  E -->|Downstream| H[Indexers / ML Pipelines]
Loading

3.2 Container/Component (C4 L2)3

The container diagram breaks down core components and internal dataflows.2

graph LR
  subgraph Scheduler
    SF[Smart Frontier\n(partitioned PQ)]
    WS[Work-Stealing\nPer-Worker Deques]
    SD[Sharded Disk Queue]
  end
  subgraph Policy
    RL[Rate Limiter]
    RC[Robots Cache]
    DC[Domain Configs]
  end
  subgraph Network
    CP[Connection Pool]
    HC[HTTP Client\n(libcurl multi)]
    CG[Conditional GET Cache]
  end
  subgraph Parsing
    UP[Ultra Parser\nSIMD+RE2]
    CF[Content Filter]
    LD[Language Detector]
    HD[HTML Document Sanity]
  end
  subgraph Persist
    MD[Metadata Store\n(RocksDB)]
    ES[Enriched Storage\n(JSON batches)]
  end
  subgraph Control
    MON[Monitoring & Shutdown]
  end
  SF --> WS
  WS --> HC
  HC --> UP
  UP --> CF
  CF --> ES
  HC --> CG
  HC --> RC
  RL --> WS
  RC --> Scheduler
  MD --> SF
  ES --> Persist
  DC --> Scheduler
  MON --> Scheduler
Loading

3.3 Dataflow Overview4

  • URLs originate from seeds, sitemaps, feeds, and page parsing, then are normalized, prioritized, and queued for fetching under domain-aware rate limiting.2
  • The fetch pipeline reuses HTTP/2 connections, applies conditional GET headers, decompresses as needed, and hands HTML to parsing, which extracts links and evaluates quality.4
  • Persisted outputs include metadata updates in RocksDB and enriched JSON content for downstream processing in object storage or analytics.12

3.4 Phase 1 / Phase 2 Pipeline2

Phase 1 focuses on robust high-speed crawling with smart scheduling and minimal per-page CPU, while Phase 2 adds feed and sitemap integrations to continuously inject fresh content sources and expand discovery.7


4. Core Features4

Each feature section describes motivation, algorithmic approach, complexity, and operational trade-offs with explicit MUST/SHOULD guidance for production.2

4.1 Domain-aware rate limiting & per-host concurrency13

  • Motivation: Prevent overload and comply with politeness by spacing requests per domain while maximizing overall throughput.13
  • Approach: A sharded rate limiter persists last-request timestamps per shard in RocksDB and computes gaps using chrono-based windows; per-host connections are capped via libcurl multi options for fairness.13
  • Complexity: O(1) per request/update; RocksDB writes are batched asynchronously in a background thread.13
  • Requirements: The crawler MUST enforce per-host concurrency and per-domain pacing before issuing requests and SHOULD persist limiter state to survive restarts without bursts.13

Pseudocode:

if (!rateLimiter.can_request_now(domain)) {
  sharedDomainQueues.try_queue_for_domain(domain, urlInfo);
  continue; // pick another available domain
}
rateLimiter.record_request(domain);

4.2 robots.txt parser & policy engine6

  • Motivation: Legal and ethical crawling requires strict adherence to robots.txt directives including allow/disallow and crawl-delay for the configured user agent.6
  • Approach: Two-phase cache checks (memory then RocksDB) with wildcard patterns and per-agent rule selection, non-blocking with DEFERRED_FETCH_STARTED signals to trigger fetches on cache misses, and sitemap extraction from robots.6
  • Complexity: O(L) per check where L is length of robots text, with in-memory lookup O(1) average; fetch costs amortized and cached.6
  • Requirements: The crawler MUST disallow non-permitted paths per robots policy and SHOULD respect crawl-delay if provided, with cache invalidation heuristics for aged entries.6

4.3 Sitemap & RSS/Atom ingestion7

  • Motivation: Rapid discovery of high-priority and frequently changing content via sitemaps and feeds reduces crawl latency and amplifies coverage.14
  • Approach: TinyXML2-based parsing of sitemap index and URL sets, with domain-derived prioritization and recrawl intervals; RSS poller runs dedicated thread, filters recent entries, and emits URLs based on recency and priority.7
  • Complexity: XML parsing is O(N) in elements; scheduling is O(1) per enqueue with partitioning.14
  • Requirements: The crawler SHOULD enable conditional GET on feeds/sitemaps and MUST validate sitemap URLs to avoid poisoning the frontier.3

4.4 Smart frontier / priority queues / work stealing2

  • Motivation: Prioritize valuable, ready URLs while minimizing lock contention and maximizing CPU utilization across workers.2
  • Approach: Partitioned priority queues for ready-time and priority ordering, ready-partition index via lock-free queue, and per-worker deques for locality with stealing to balance load.3
  • Complexity: Enqueue O(log n) in partition heap; dequeue is O(1) amortized when using ready partition fast path, with stealing O(1) average.2
  • Requirements: The scheduler MUST enforce max depth and queue caps, SHOULD batch enqueues by partition to reduce lock traffic, and MAY fallback to disk sharding when memory thresholds are reached.3

4.5 Proxy rotation & IP pool health monitoring1

  • Motivation: Corporate environments or geo-distributed crawling may require rotating egress IPs and monitoring their health to avoid blocks and balance load.1
  • Approach: Integrate an optional proxy list with per-proxy health scores, backoff on failures, and periodic re-tests; this can be wired into HttpClient options and per-request overrides.3
  • Complexity: O(1) per selection; periodic health checks O(P), where P is proxies count.1
  • Requirements: Proxy rotation MAY be configured; when enabled, the crawler SHOULD bound concurrent connections per proxy and MUST respect robots and rate limiting irrespective of proxy identity.4

4.6 Content hashing & dedup12

  • Motivation: Reduce storage and downstream processing by identifying unchanged or duplicate content, enabling recrawl scheduling and stable identifiers.12
  • Approach: Fast content hashing (e.g., stable hash over key parts of content), stored in metadata for conditional recrawl and enriched output; optional simhash/shingling can be added for near-dup detection.11
  • Complexity: O(n) per page content length; RocksDB updates batched asynchronously.5
  • Requirements: The crawler SHOULD compute content hashes on successful 200 responses and MUST update metadata to adapt recrawl intervals based on change patterns.5

4.7 Snapshot vs. continuous recrawl2

  • Motivation: Different data consumers need one-time snapshots or continuous freshness; the crawler must support both strategies.2
  • Approach: Regular mode deep crawl with disk queues versus fresh mode with RSS-driven ingestion and shallower depth, both using consistent policy and persistence layers.2
  • Complexity: Equivalent per-request; scheduler parameters differ for depth and queue sizes.2
  • Requirements: Fresh mode SHOULD run indefinitely with bounded queues and emergency seed injection; regular mode MAY use a max runtime window with graceful shutdown.2

4.8 Adaptive backoff & retry4

  • Motivation: Transient network/server errors demand adaptive strategies to avoid hammering and to recover quickly.4
  • Approach: Exponential backoff with jitter budgets via rate limiter failure counts, domain blacklisting for repeated timeouts, and retry fallback HTTP→HTTPS or vice versa for protocol errors.15
  • Complexity: O(1) bookkeeping per failure; blacklist cleanup O(K) where K is number of temporary entries.15
  • Requirements: The crawler SHOULD back off on 429/503 and timeouts, and MAY switch protocols on SSL connect failures with a one-time retry.4

4.9 Compression, HTTP/2, TLS reuse, connection pooling3

  • Motivation: Maximize bandwidth efficiency and minimize handshake/connection overhead for high throughput.3
  • Approach: libcurl multi with HTTP/2, accept-encoding for gzip/deflate, connection pooling of easy handles, and CURLOPT_PIPEWAIT to bias multiplex reuse instead of opening new connections.3
  • Complexity: O(1) per transfer management; overall gains are multiplicative with concurrency.16
  • Requirements: The crawler MUST enable certificate verification and SHOULD enable HTTP/2 multiplexing with pipe-wait to keep FD count low and reuse TLS sessions.10

4.10 Content-type sniffing, encoding normalization, charset detection3

  • Motivation: Many pages omit or mislabel headers; sensible defaults and normalization improve parsing correctness.3
  • Approach: Header parsing captures content-type and heuristics can be applied to detect HTML versus binary, normalizing encoding and skipping unsuitable media.3
  • Complexity: O(1) for header processing; optional content scan O(n).3
  • Requirements: The crawler SHOULD skip non-HTML-like responses for the HTML pipeline and MAY store binary metadata for inventory use cases.3

4.11 Snippet extraction, boilerplate removal, readability scoring17

  • Motivation: Produce cleaner content for downstream indexing and training by filtering scripts, navs, and ads while scoring textual content density.17
  • Approach: RE2-based structural stripping with density metrics and sentence heuristics; SIMD prefilter assists scanning and reduces overhead before deeper extraction.9
  • Complexity: O(n) per page with vectorized primitives; configurable thresholds to bound CPU per page.17
  • Requirements: The crawler SHOULD drop low-quality pages early and MAY store a quality score alongside content for downstream filtering.11

4.12 Structured data extraction (Schema.org, OpenGraph, JSON-LD)12

  • Motivation: Enrich downstream applications with structured hints like title, description, and canonical links.12
  • Approach: Lightweight scanning for meta tags and JSON-LD blocks can be integrated into the parser or a dedicated pass feeding enriched storage.11
  • Complexity: O(n) per page scan; negligible impact if batched.11
  • Requirements: The crawler MAY include structured data fields in enriched JSON when available without blocking critical path.11

4.13 Politeness, legal & compliance6

  • Motivation: Respect site owners and legal constraints while crawling at scale.6
  • Approach: Strict robots adherence, rate limiting, blacklists/allowlists, and user-agent transparency via headers with clear contact info.4
  • Requirements: The crawler MUST follow robots rules, MUST rate limit by domain, and SHOULD provide a clear User-Agent with operator contact endpoint.18

4.14 Fault tolerance, persistence & recovery5

  • Motivation: Keep progress under crashes and shutdowns with minimal loss and fast recovery.5
  • Approach: Asynchronous RocksDB writers for metadata and limiter states, disk-sharded queues for overflow and restarts, graceful shutdown that flushes storage and coordinates thread termination.19
  • Requirements: The crawler MUST drain and flush queues on shutdown and SHOULD use disk sharding for large frontier sizes beyond RAM.2

4.15 Metrics, tracing, logging, and profiling8

  • Motivation: Operability demands visibility into throughput, latency, and failures with low overhead.8
  • Approach: Periodic stats logging with queue depths and p/s, error tracking per domain, Tracy profiling zones in hot paths, and hooks for metrics exporters.9
  • Requirements: The crawler SHOULD emit queue size, p/s, error counts, and backpressure indicators at steady intervals and MAY integrate with Prometheus/OpenTelemetry.8

5. Detailed Pipeline Walkthroughs2

This section includes end-to-end flows and sequence diagrams for the most important processes to aid onboarding, debugging, and performance tuning.4

5.1 URL lifecycle (state machine)2

A URL transitions through states from discovery to scheduled recrawl or drop depending on policy, quality, and success.3

stateDiagram-v2
  [*] --> DISCOVERED
  DISCOVERED --> NORMALIZED
  NORMALIZED --> DEDUP_CHECK
  DEDUP_CHECK --> FETCH_QUEUED
  FETCH_QUEUED --> FETCHING
  FETCHING --> ROBOTS_BLOCKED: robots disallow
  FETCHING --> RATE_LIMITED: not ready window
  FETCHING --> FETCH_OK: 2xx
  FETCHING --> FETCH_SOFT_FAIL: 3xx/429/503
  FETCHING --> FETCH_HARD_FAIL: 4xx/5xx
  FETCH_OK --> PARSED
  PARSED --> QUALITY_OK
  PARSED --> QUALITY_REJECT
  QUALITY_OK --> INDEXED
  INDEXED --> SCHEDULED_FOR_RECRAWL
  QUALITY_REJECT --> DROPPED
  ROBOTS_BLOCKED --> DROPPED
  FETCH_SOFT_FAIL --> RETRY_BUDGET
  RETRY_BUDGET --> FETCH_QUEUED
  RETRY_BUDGET --> DROPPED: exhausted
  SCHEDULED_FOR_RECRAWL --> FETCH_QUEUED
Loading

5.2 HTTP fetch pipeline (sequence)4

This sequence shows DNS, connect, TLS, request, response, decompress, and handoff.3

sequenceDiagram
  participant F as Frontier
  participant W as Worker
  participant M as libcurl multi
  participant S as Server
  F->>W: dequeue UrlInfo
  W->>M: add easy handle (HTTP/2, headers)
  M->>S: DNS + TCP + TLS handshake
  S-->>M: TLS cert (validated)
  M-->>S: GET with If-None-Match/If-Modified-Since
  S-->>M: 200/304 + gzip payload
  M-->>W: response body + headers
  W->>W: decompress + header parse
  W->>Parser: HTML content
  Parser-->>W: links + quality decision
  W->>Storage: metadata update + enriched JSON
Loading

5.3 robots.txt fetch + decision engine6

Robots evaluation is always performed before issuing a content request to a path.4

flowchart TD
  A[URL] --> B{Cache has robots?}
  B -- yes --> C{Path allowed?}
  B -- no --> D[Queue robots fetch]
  D --> E[DEFERRED_FETCH_STARTED]
  C -- yes --> F[Proceed to fetch]
  C -- no --> G[Drop URL]
Loading

5.4 Sitemap index expansion & scheduling14

Index sitemaps generate child sitemaps added to the queue and parsed for URLs which in turn flow into the frontier.14

sequenceDiagram
  participant SP as Sitemap Parser
  participant RC as Robots Cache
  participant HC as HttpClient
  participant SF as Smart Frontier
  SP->>RC: domains to monitor
  RC-->>SP: sitemaps list
  SP->>HC: download index
  HC-->>SP: XML index
  SP->>SP: extract child sitemaps
  SP->>HC: download child sitemaps
  HC-->>SP: XML urlsets
  SP->>SF: enqueue UrlInfo with priority
Loading

5.5 RSS polling loop7

The RSS worker periodically polls feeds, filters recency, and emits new entries.7

flowchart LR
  A[Start Poller] --> B[Load feeds JSON]
  B --> C[Wait poll_interval]
  C --> D[Download feed]
  D --> E[Parse RSS/Atom]
  E --> F{Entries recent?}
  F -- yes --> G[Emit UrlInfo batch]
  F -- no --> H[Skip]
  G --> C
  H --> C
Loading

5.6 Conditional GET cache1

Conditional GET reduces redundant downloads by caching ETag/Last-Modified.3

sequenceDiagram
  participant W as Worker
  participant CG as ConditionalGet
  participant HC as HttpClient
  W->>CG: get_cache_info(url)
  CG-->>W: {etag, last-modified}
  W->>HC: GET with If-None-Match/If-Modified-Since
  HC-->>W: 304 or 200
  W->>CG: update_cache(headers)
Loading

5.7 Queue sharding & work stealing2

Balanced scheduling via partitioned heaps and per-worker deques with stealing.13

flowchart TD
  A[Batch of UrlInfo] --> B[Partition by hash]
  B --> C[Lock partition once]
  C --> D[Push into PQ]
  D --> E[Mark ready partition]
  E --> F[Worker pop from ready partitions]
  F --> G{Empty?}
  G -- yes --> H[Try steal from victim]
  G -- no --> I[Process URL]
Loading

5.8 Proxy rotation & health scoring1

When enabled, the client cycles proxies with health scores decreasing upon failures.3

sequenceDiagram
  participant P as ProxyPool
  participant W as Worker
  participant HC as HttpClient
  W->>P: next proxy()
  P-->>W: proxy A
  W->>HC: GET via proxy A
  HC-->>W: response or error
  W->>P: record_success/failure
Loading

5.9 Backpressure & flow control8

Monitor queue depths and crawl rate to apply refill from disk or inject seeds.19

flowchart TD
  A[Monitor tick] --> B{SmartQ < threshold?}
  B -- yes --> C[Load from disk shards]
  B -- no --> D[No refill]
  C --> E{Total URLs < critical?}
  E -- yes --> F[Emergency seeds]
  E -- no --> G[Continue]
Loading

5.10 Graceful shutdown2

Signal-safe handler flips the stop flag and the coordinator drains queues and joins threads.8

sequenceDiagram
  participant OS as OS Signal
  participant SH as Signal Handler
  participant MON as Coordinator
  participant W as Workers
  OS->>SH: SIGINT/SIGTERM
  SH->>MON: notify_all()
  MON->>W: interrupt_waits()
  W-->>MON: finished
  MON->>Storage: flush()
  MON->>All: teardown components
Loading

6. Concurrency & Scheduling Model4

  • Threading: Separate network worker threads using curl multi with per-worker active request sets, and dedicated threads for RSS and Sitemap parsing guarded by condition variables for responsive shutdown.7
  • CPU pinning: Optional pinning can reduce context switches; the design minimizes cross-core chatter via per-partition locks and per-worker deques.2
  • Async I/O: libcurl multi provides event-driven parallelism; HTTP/2 multiplexing allows multiple streams per connection, which the system configures and prioritizes with PIPEWAIT to bias reuse.16
  • Lock contention: Batching enqueues and ready-partition queues reduce contention on partition heaps, while per-worker queues localize push/pop without global locks.2
  • False sharing: Atomic counters and queue size fields are localized per partition and per worker to avoid cache line contention.2

Priority scheduling formula:

  • Let $p_d$ be base priority from depth $d$, $p_d = \max(p_\text{min}, 1 - d \cdot \alpha)$ with $\alpha$ as depth penalty, and let $m_\text{domain}$ be domain multiplier from config and $o$ be overdue factor when beyond ready time, so final priority $P = \min(p_\text{max}, p_d \cdot m_\text{domain} + o)$.2

Throughput model:

  • Pages per second $ pps \approx \frac{N_workers \cdot C_streams}{L_eff} $, with effective latency $L_\text{eff}$ including DNS+connect+TLS and payload time, and stream count influenced by HTTP/2 multiplexing and backpressure.4

7. Data Structures & Storage5

  • Frontier: Partitioned priority queues keyed by expected crawl time and priority with seen sets per partition to avoid duplicates.2
  • Work-stealing deques: Per-worker deques with capped sizes and front-steal/back-pop semantics to maximize locality.8
  • Disk sharding: N shards each with file + mutex + size for spillover and restart warm-up, with periodic cleanup of empty shards.20
  • Metadata: RocksDB store of UrlMetadata keyed by URL with crawl times, hash, backoff, failures, and change frequency, persisted asynchronously from a concurrent queue worker.21

ER/Class overview:

  • UrlMetadata(url, last_crawl_time, previous_change_time, expected_next_crawl, content_hash, backoff_multiplier, crawl_count, change_frequency, temporary_failures).21
  • RobotsInfo(domain, content, timestamp, is_valid, last_http_status, crawl_delay, sitemaps[], sitemaps_parsed).4
  • HttpHeaders(etag, last_modified, response_time).18
  • EnrichedPageData(url, content, domain, depth, content_hash, times, status, length).11
classDiagram
  class UrlMetadata {
    +time last_crawl_time
    +time previous_change_time
    +time expected_next_crawl
    +string content_hash
    +int backoff_multiplier
    +int crawl_count
    +float change_frequency
    +int temporary_failures
  }
  class RobotsInfo {
    +string content
    +time timestamp
    +bool is_valid
    +int last_http_status
    +int crawl_delay
    +SitemapInfo[] sitemaps
    +bool sitemaps_parsed
  }
  class EnrichedPageData {
    +string url
    +string content
    +string domain
    +int depth
    +string content_hash
    +time last_crawl_time
    +int http_status_code
    +size_t content_length
  }
  UrlMetadata <.. EnrichedPageData
  RobotsInfo <.. SitemapInfo
Loading

Canonicalization rules:

  • Normalize scheme and domain case, remove fragments, collapse slashes, and strip tracking params such as utm_*, gclid, fbclid from queries to improve dedup and politeness.12

Checksum/dedup:

  • Use fast, deterministic content hashing on key content segments for equality and consider simhash/shingles for near-dup in future work.22

8. Networking Subsystem3

  • HTTP client: Unified component that sets timeouts, redirects, HTTP/2, accept-encoding, TCP_NODELAY, TCP_KEEPALIVE, and per-request conditional headers, reusing curl easy handles from a connection pool.3
  • TLS config: Certificate verification MUST be enabled (VERIFYPEER=1, VERIFYHOST=2) and CA bundles configured; disabling verification allows MITM and is not acceptable in production.23
  • Multiplexing: HTTP/2 multiplexing is enabled with curl multi and PIPEWAIT to prefer reusing existing connections and reduce FD churn.16
  • Concurrency knobs: CURLMOPT_MAX_TOTAL_CONNECTIONS, CURLMOPT_MAX_HOST_CONNECTIONS, and worker-local MAX_CONCURRENT_REQUESTS tune parallelism and fairness.4

Bandwidth & throughput:

  • Let average payload size be $S$ bytes and compressed transfer rate per stream be $R$ bytes/sec; with $C$ streams per worker and $N$ workers, ideal download rate $ bps \approx N \cdot C \cdot R $, gated by latency and server throttling.4

9. Robots.txt & Politeness6

Parsing algorithm:

  • Read robots into memory and RocksDB with timestamp, parse per-agent allow/disallow rules, compute longest-prefix match precedence, enforce crawl-delay, and extract sitemap URLs with priorities.6

Cache invalidation:

  • Use age thresholds (e.g., 30 days) to refresh robots, and invalidate per-domain on observed 403/404 or significant policy changes.4

Flowchart:

flowchart TD
  A[Domain, Path] --> B{Memory cache hit?}
  B -- yes --> C{Allowed?}
  B -- no --> D[Load from RocksDB]
  D --> E{Loaded?}
  E -- yes --> C
  E -- no --> F[Insert placeholder and defer]
  C -- yes --> G[Fetch now]
  C -- no --> H[Drop]
Loading

TLS note:

  • The robots fetch mechanism SHOULD use the same TLS verification policy as page fetches to avoid inconsistent trust.4

10. Sitemaps & Feeds14

Index vs leaf:

  • Index sitemaps list child sitemaps which are scheduled at priority-based intervals, while leaf sitemaps list URLs with optional lastmod/changefreq mapping to crawl priority.24

Parsing:

  • TinyXML2 is used to parse XML documents robustly without regex, and response headers are captured for conditional GET on subsequent polls.7

Sequence:

sequenceDiagram
  participant RC as Robots Cache
  participant SP as Sitemap Parser
  participant HC as HttpClient
  participant SF as Frontier
  RC-->>SP: candidate sitemap URLs
  SP->>HC: GET sitemap index
  HC-->>SP: XML
  SP->>SP: Expand child sitemaps
  SP->>HC: GET leaf sitemaps
  HC-->>SP: urlset xml
  SP->>SF: enqueue UrlInfo with priorities
Loading

11. Caching: HTTP & Application-level1

HTTP cache:

  • Conditional GET via ETag/Last-Modified reduces bandwidth and CPU; 304 responses skip content processing while updating metadata timestamps.3

Application-level:

  • Content-addressable metadata and enriched storage dedup repeated content, while frontiers avoid re-enqueueing seen URLs.2

Disk layout:

  • Sharded disk queue uses per-shard files to spill URLs when memory frontiers are saturated, with batch reads to refill when in-memory queue dips below thresholds.20

Sequence:

sequenceDiagram
  participant W as Worker
  participant CG as CondGet
  participant S as Server
  W->>CG: lookup headers
  W->>S: GET with If-None-Match/If-Modified-Since
  S-->>W: 304 or 200
  W->>CG: update headers from response
Loading

12. Fault Tolerance & Recovery5

Crash safety:

  • Metadata store and rate limiter persist through background writers, and the disk queue provides a durable backlog to resume after restart.13

Replay & idempotency:

  • Frontier seen-sets and metadata hashes prevent duplicate processing across restarts; worker shutdown drains active transfers before cleanup.2

Flowchart:

flowchart TD
  A[Startup] --> B[Load configs]
  B --> C[Scan disk shards]
  C --> D[Warm frontiers]
  D --> E[Start workers]
  E --> F{Crash?}
  F -- yes --> G[Restart]
  G --> C
  F -- no --> H[Graceful shutdown]
  H --> I[Flush stores]
  I --> J[Stop]
Loading

13. Observability8

Metrics:

  • p/s crawl rate, total pages, queue depths (smart, work-stealing, HTML), HTTP status distribution, retry counts, robots hits, and backoff indicators should be emitted periodically in logs or exported via metrics collectors.8

Tracing/logging:

  • Tracy zones bracket hot path operations in workers, robots checks, metadata updates, and storage flushes for sampling and analysis.4

Dataflow:

graph LR
  W[Workers] --> M[Metrics Emitter]
  F[Frontier] --> M
  N[HTTP Client] --> M
  S[Storage] --> M
  M --> G[Monitoring Stack]
Loading

14. Performance Engineering9

Bottlenecks:

  • CPU in parsing/regex, DNS/TLS handshake latency, server-side throttling, lock contention under high enqueue rates, and disk I/O for shards and RocksDB compaction.4

Benchmark methodology:

  • Use canned seeds and test endpoints (e.g., httpbin) to calibrate connection pools and concurrency; measure with p/s, bytes/sec, 95p latency, and FD counts under varied payload sizes and network emulation.2

Optimization playbook:

  • Prefer HTTP/2 multiplexing with PIPEWAIT to reduce new connections, batch enqueues per partition, enforce per-worker deque caps, reserve buffers, sanitize UTF-8 once, and use chrono types for time computations to avoid unit errors.16

Before/after tables can track improvements in p/s and CPU after enabling PIPEWAIT, batching enqueues, and introducing SIMD prefilters.9


15. Security, Legal, and Ethics6

  • Robots.txt compliance is mandatory and the user agent should clearly identify the crawler with a contact URL or email for incident response.18
  • TLS verification MUST be enabled (VERIFYPEER=1, VERIFYHOST=2) to prevent MITM and poisoned inputs; do not disable in production.10
  • Rate limiting prevents DoS and reduces collateral damage; blocklists and allowlists can further constrain scope during trials.15
  • PII handling: do not store sensitive data beyond necessary metadata; enriched storage should be secured at rest and access-controlled.11

16. Configuration & DevOps1

Example config files are provided for domains, seeds, feeds, emergency seeds, excluded patterns, and high-priority domains, and SHOULD be tailored to the deployment’s policy and scope.6

  • Domain policies JSON:
{
  "domains": {
    "reuters.com": {
      "crawl_frequency_limit": "2h",
      "language_whitelist": ["en"],
      "enabled": true,
      "priority_multiplier": 1.3
    }
  }
}
  • Feeds:
[
  { "url": "https://news.ycombinator.com/rss", "priority": 9, "poll_interval": 5 }
]
  • Seeds:
[
  "https://en.wikipedia.org/wiki/Special:Random",
  "https://github.com/trending"
]
  • Exclusion lists:
[".jpg", ".png", ".pdf", ".zip"]

Deployment topologies:

  • Single node: all components in one process using local RocksDB and disk shards for simplicity and quick iteration.1
  • Sharded cluster: multiple crawler instances partitioned by domain hash ranges, optionally with a distributed frontier broker and shared object storage.2
graph TD
  subgraph Node A
    A1[Workers] --> A2[Frontier]
    A2 --> A3[RocksDB]
    A1 --> A4[Storage JSON]
  end
  subgraph Node B
    B1[Workers] --> B2[Frontier]
    B2 --> B3[RocksDB]
    B1 --> B4[Storage JSON]
  end
  A4 --> S3[(Object Storage)]
  B4 --> S3
  A3 --> M[Monitoring]
  B3 --> M
Loading

Operational runbook:

  • Scale up by increasing worker threads and per-worker concurrency, monitor p/s and queue depths, expand to more nodes by hashing domains, and drain by flipping stop flag and letting coordinator flush before shutdown.2

17. Testing Strategy2

  • Unit tests for URL normalization, robots parsing, rate limiter arithmetic, and metadata serialization.12
  • Integration tests with canned HTTP fixtures for 200/304/429/503 flows, robots allow/disallow, sitemap/atom XML parsing, and conditional GET caching.3
  • Fuzzing for HTML parsers and robots tokenization to harden against malformed inputs.9
  • Chaos testing by injecting network timeouts, delayed responses, and forced crashes to validate persistence and recovery.5
  • Performance regression tests using fixed seed sets, concurrency knobs, and synthetic content sizes to track p/s trends over releases.4

18. Roadmap & Future Work9

  • JS rendering: optional headless browser pool for JS-heavy sites with tight budgets and strict robots policies.2
  • ML-based URL scoring: train freshness and quality predictors to enhance priority beyond heuristics.11
  • Distributed frontier: brokered by Kafka/NATS with idempotent tokens and cross-node politeness coordination.2
  • WARC export and CommonCrawl compatibility for interoperability with existing analysis tools.12

19. Glossary4

  • Frontier: the scheduler’s queue of URLs prioritized by readiness and importance for fetching.2
  • Politeness window: minimum interval between requests to the same domain to avoid overload.13
  • Conditional GET: HTTP mechanism using ETag/Last-Modified to fetch only when content changed.1
  • Multiplexing: HTTP/2 feature allowing multiple concurrent streams over a single TCP/TLS connection.25
  • Crawl-delay: robots.txt directive specifying minimum seconds to wait between requests.6
  • DEFERRED_FETCH_STARTED: sentinel indicating robots must be fetched before policy decision.6

20. References3

  • libcurl CURLOPT_SSL_VERIFYHOST: host certificate verification and warnings against disabling.10
  • libcurl CURLOPT_SSL_VERIFYPEER: peer certificate verification and defaults.23
  • libcurl CURLOPT_PIPEWAIT: prefer waiting to multiplex on existing connections.16
  • HTTP/2 multiplexing: overview and behavior in libcurl’s multi interface.25
  • Project code: CMakeLists for dependencies and build flags.1
  • Networking workers: multi-handle setup, request options, and result processing.4
  • Policy components: robots cache, rate limiter, and blacklists.6

Installation & Build1

Requirements:

  • C++17 compiler with pthreads, libcurl with HTTP/2, RocksDB, RE2, TinyXML2, and Tracy optional for profiling.1

Build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target crawler -j$(nproc)

Run:

# Regular mode with 8 threads, depth 4, queue size 50k, 60 min max runtime
./build/crawler 8 4 50000 --max-runtime 60 --mode regular
# Fresh mode (RSS-driven, shallow):
./build/crawler --mode fresh

Security:

  • Ensure CA bundle is present and HttpClient is configured with VERIFYPEER=1 and VERIFYHOST=2 for production deployments.26

Configuration Files6

  • Place JSON configs under config/ with domain configs, seeds, feeds, emergency seeds, excluded patterns, and high-priority domains as provided in the repository examples.27
  • Emergency seeds are injected only under critical low-queue conditions to revive the crawl and SHOULD be audited regularly.4

Operational Notes & SLOs8

  • Suggested SLOs: 99% of fetches within $ \le 3 \cdot L_eff $ under normal network conditions, and sustained p/s above target thresholds with error rate below 2% on healthy domains.9
  • Monitor:
    • Crawl rate p/s and filtered page counts to ensure content quality and sustained throughput.24
    • Queue depths and disk shard sizes to detect backpressure and storage spillover.19
    • HTTP status distribution and domain error counters for blacklist triggers and anomaly detection.28

Security Hardening Checklist3

  • MUST enable TLS verification and provide CA bundle paths in environments with custom trust stores.26
  • SHOULD restrict scope via allowlists during initial deployments and enforce robots with detailed logs on policy violations.6
  • MUST bound concurrency per host and per worker to avoid unintentional DoS or rate-limit violations.4
  • SHOULD rotate logs and protect enriched JSON outputs with proper access controls and encryption at rest where required.11

Design Appendices3

A. CURL Multi Tuning4

  • Set CURLMOPT_MAX_TOTAL_CONNECTIONS and CURLMOPT_MAX_HOST_CONNECTIONS to cap global and per-host sockets, and use CURLOPT_PIPEWAIT to bias reuse.16
  • Prefer HTTP/2 via CURLOPT_HTTP_VERSION with accept-encoding to reduce payload sizes on text content.3

B. Timekeeping & Portability13

  • Use std::chrono durations consistently and avoid direct count() comparisons across clocks; persist system_clock timestamps to disk for stability across restarts.6
  • For HTTP date parsing, use a portable UTC conversion strategy for RFC 2822 and ISO 8601 formats in environments without timegm.3

C. SIMD & Parsing9

  • Gate SIMD paths on payload size thresholds and cap maximum HTML size to bound CPU per page, falling back to scalar code when necessary.18

By adhering to the architecture, policies, and operational guidance in this README, Hybrid Speed Crawler delivers sustained, compliant, and observable crawling performance suitable for production pipelines that demand both breadth and freshness.2 29303132333435

Footnotes

  1. html_processing_queue.cpp 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

  2. smart_frontier.h 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

  3. smart_frontier.cpp 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

  4. emergency_seeds.json 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

  5. url_info.h 2 3 4 5 6 7 8 9

  6. domain_configs.json 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

  7. rss_poller.cpp 2 3 4 5 6 7 8 9

  8. work_stealing_queue.h 2 3 4 5 6 7 8 9 10 11

  9. ultra_parser.cpp 2 3 4 5 6 7 8 9 10

  10. https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html 2 3 4

  11. url_normalizer.h 2 3 4 5 6 7 8 9 10

  12. url_normalizer.cpp 2 3 4 5 6 7 8 9

  13. work_stealing_queue.cpp 2 3 4 5 6 7 8 9

  14. sitemap_parser.cpp 2 3 4 5

  15. high_priority_domains.json 2 3

  16. https://curl.se/libcurl/c/CURLOPT_PIPEWAIT.html 2 3 4 5 6

  17. excluded_extensions.json 2 3

  18. html_processing_queue.h 2 3 4

  19. sharded_disk_queue.cpp 2 3

  20. sharded_disk_queue.h 2

  21. time_utils.h 2

  22. ultra_parser.h

  23. https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html 2

  24. sitemap_parser.h 2

  25. https://everything.curl.dev/libcurl-http/multiplexing.html 2

  26. https://docs.oracle.com/cd/E88353_01/html/E37842/curlopt-ssl-verifypeer-3.html 2

  27. seeds.json

  28. rss_poller.h

  29. https://stackoverflow.com/questions/15552052/how-to-enable-curlopt-ssl-verifyhost-2-support-on-my-os-php

  30. https://core.trac.wordpress.org/ticket/34794

  31. https://www.php.net/manual/en/function.curl-setopt.php

  32. https://docs.oracle.com/cd/E88353_01/html/E37842/curlopt-ssl-verifyhost-3.html

  33. https://php.watch/codex/CURLOPT_SSL_VERIFYPEER

  34. https://curl.se/mail/lib-2019-03/0102.html

  35. feeds.json

About

High-throughput(100+ pages/sec), respectful C++ web crawler with smart frontier scheduling, robots.txt compliance, HTTP/2 multiplexing, and RocksDB-backed persistence for production-scale ingestion.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published