A modular, production-grade, and aggressively optimized web crawler built in modern C++ with a multi-stage pipeline that couples HTTP/2 multiplexed fetching, robots-aware scheduling, sharded frontiers, SIMD-assisted parsing, content hashing, and RocksDB-backed persistence for crash-safe, sustained, compliant throughput at scale.1
This repository contains an end-to-end system that continuously discovers, fetches, filters, parses, deduplicates, and persists web content while enforcing domain politeness, honoring robots.txt, ingesting sitemaps and feeds, and coordinating graceful startup/shutdown with metrics and tracing hooks for operations.2
- Stack: C++17, libcurl multi (HTTP/2), RocksDB (metadata, caches), RE2 (regex), TinyXML2 (XML), moodycamel concurrentqueue (lock-free queues), Tracy (profiling), plus modular configs in JSON for domains, seeds, patterns, and feeds.1
TL;DR3
Hybrid Speed Crawler is a high-performance, robots-aware web crawler that uses multi-threaded curl multi handles with HTTP/2 multiplexing, domain-aware rate limiting, sharded smart frontiers, work-stealing queues, conditional GET caching, and SIMD-assisted parsing to sustain hundreds of pages per second on commodity hardware, with crash-safe RocksDB persistence and coherent shutdown semantics for production reliability.4
Table of Contents3
-
- Title, TL;DR & Table of Contents3
-
- Problem Statement & Goals2
-
- High-Level Architecture3
-
- Core Features4
-
- Detailed Pipeline Walkthroughs2
-
- Concurrency & Scheduling Model4
-
- Data Structures & Storage5
-
- Networking Subsystem3
-
- Robots.txt & Politeness6
-
- Sitemaps & Feeds7
-
- Caching: HTTP & Application-level1
-
- Fault Tolerance & Recovery5
-
- Observability8
-
- Performance Engineering9
-
- Security, Legal, and Ethics6
-
- Configuration & DevOps1
-
- Testing Strategy2
-
- Roadmap & Future Work9
-
- Glossary4
-
- References10
2. Problem Statement & Goals2
Modern content discovery demands crawling systems that remain fresh, respectful, and scalable in the face of dynamic websites, changing robots policies, and varied content quality, all while maintaining operational stability and crash safety.8
This crawler solves the acquisition side of ML-ready content pipelines by enforcing robots.txt, prioritizing by per-domain rules, using conditional GET to avoid redundant downloads, deduplicating via robust hashing, and exporting enriched metadata with timing and cacheable headers.4
- Freshness goals: leverage sitemaps, RSS/Atom, and conditional GET to recrawl recent and changing content while minimizing redundant fetches.7
- Scale goals: sustain high QPS via HTTP/2 multiplexing, connection pooling, work stealing, sharded disk queues, and priority-aware smart frontiers.4
- Politeness goals: domain-aware rate limiting, robots.txt allow/disallow with crawl-delay, and shared per-domain queues for throttling.6
- ML-ready: enriched storage embeds content hash, timing, status, and headers in JSON for downstream indexing, deduplication, and training.11
Non-goals:
- This is not a browser engine and does not execute arbitrary JavaScript or render dynamic pages beyond HTTP fetch and parsing.4
- This is not a search UI or ranking system; it exports enriched content and metadata for downstream indexing and retrieval systems.12
3. High-Level Architecture3
The system is organized into modular subsystems: networking (HTTP client + connection pool), policy (robots cache, rate limiter, domain configs), scheduling (smart frontier, work-stealing, sharded disk queue), parsing (SIMD/RE2/TinyXML2 pipeline), persistence (RocksDB metadata + JSON enriched storage), and coordination (monitoring, signal-safe shutdown).1
CMake lists dependencies, organizes modular sources, and enables profiling with Tracy plus aggressive optimization flags while keeping debug pathways for sanitizer builds.1
3.1 System Context (C4 L1)2
This diagram shows the crawler in relation to external systems such as DNS, origin servers, and optional downstream indexers.3
graph TD
A[Operator / DevOps] -->|Configure & Run| B[Hybrid Speed Crawler]
B -->|HTTP/2 GET| C[Origin Web Servers]
B -->|DNS| D[DNS Resolvers]
B -->|Robots/Sitemaps| C
B -->|Feeds (RSS/Atom)| C
B -->|Enriched JSON| E[Object Storage / S3-like]
B -->|Metadata| F[RocksDB]
B -->|Metrics & Logs| G[Monitoring Stack]
E -->|Downstream| H[Indexers / ML Pipelines]
3.2 Container/Component (C4 L2)3
The container diagram breaks down core components and internal dataflows.2
graph LR
subgraph Scheduler
SF[Smart Frontier\n(partitioned PQ)]
WS[Work-Stealing\nPer-Worker Deques]
SD[Sharded Disk Queue]
end
subgraph Policy
RL[Rate Limiter]
RC[Robots Cache]
DC[Domain Configs]
end
subgraph Network
CP[Connection Pool]
HC[HTTP Client\n(libcurl multi)]
CG[Conditional GET Cache]
end
subgraph Parsing
UP[Ultra Parser\nSIMD+RE2]
CF[Content Filter]
LD[Language Detector]
HD[HTML Document Sanity]
end
subgraph Persist
MD[Metadata Store\n(RocksDB)]
ES[Enriched Storage\n(JSON batches)]
end
subgraph Control
MON[Monitoring & Shutdown]
end
SF --> WS
WS --> HC
HC --> UP
UP --> CF
CF --> ES
HC --> CG
HC --> RC
RL --> WS
RC --> Scheduler
MD --> SF
ES --> Persist
DC --> Scheduler
MON --> Scheduler
3.3 Dataflow Overview4
- URLs originate from seeds, sitemaps, feeds, and page parsing, then are normalized, prioritized, and queued for fetching under domain-aware rate limiting.2
- The fetch pipeline reuses HTTP/2 connections, applies conditional GET headers, decompresses as needed, and hands HTML to parsing, which extracts links and evaluates quality.4
- Persisted outputs include metadata updates in RocksDB and enriched JSON content for downstream processing in object storage or analytics.12
3.4 Phase 1 / Phase 2 Pipeline2
Phase 1 focuses on robust high-speed crawling with smart scheduling and minimal per-page CPU, while Phase 2 adds feed and sitemap integrations to continuously inject fresh content sources and expand discovery.7
4. Core Features4
Each feature section describes motivation, algorithmic approach, complexity, and operational trade-offs with explicit MUST/SHOULD guidance for production.2
4.1 Domain-aware rate limiting & per-host concurrency13
- Motivation: Prevent overload and comply with politeness by spacing requests per domain while maximizing overall throughput.13
- Approach: A sharded rate limiter persists last-request timestamps per shard in RocksDB and computes gaps using chrono-based windows; per-host connections are capped via libcurl multi options for fairness.13
- Complexity: O(1) per request/update; RocksDB writes are batched asynchronously in a background thread.13
- Requirements: The crawler MUST enforce per-host concurrency and per-domain pacing before issuing requests and SHOULD persist limiter state to survive restarts without bursts.13
Pseudocode:
if (!rateLimiter.can_request_now(domain)) {
sharedDomainQueues.try_queue_for_domain(domain, urlInfo);
continue; // pick another available domain
}
rateLimiter.record_request(domain);4.2 robots.txt parser & policy engine6
- Motivation: Legal and ethical crawling requires strict adherence to robots.txt directives including allow/disallow and crawl-delay for the configured user agent.6
- Approach: Two-phase cache checks (memory then RocksDB) with wildcard patterns and per-agent rule selection, non-blocking with DEFERRED_FETCH_STARTED signals to trigger fetches on cache misses, and sitemap extraction from robots.6
- Complexity: O(L) per check where L is length of robots text, with in-memory lookup O(1) average; fetch costs amortized and cached.6
- Requirements: The crawler MUST disallow non-permitted paths per robots policy and SHOULD respect crawl-delay if provided, with cache invalidation heuristics for aged entries.6
4.3 Sitemap & RSS/Atom ingestion7
- Motivation: Rapid discovery of high-priority and frequently changing content via sitemaps and feeds reduces crawl latency and amplifies coverage.14
- Approach: TinyXML2-based parsing of sitemap index and URL sets, with domain-derived prioritization and recrawl intervals; RSS poller runs dedicated thread, filters recent entries, and emits URLs based on recency and priority.7
- Complexity: XML parsing is O(N) in elements; scheduling is O(1) per enqueue with partitioning.14
- Requirements: The crawler SHOULD enable conditional GET on feeds/sitemaps and MUST validate sitemap URLs to avoid poisoning the frontier.3
4.4 Smart frontier / priority queues / work stealing2
- Motivation: Prioritize valuable, ready URLs while minimizing lock contention and maximizing CPU utilization across workers.2
- Approach: Partitioned priority queues for ready-time and priority ordering, ready-partition index via lock-free queue, and per-worker deques for locality with stealing to balance load.3
- Complexity: Enqueue O(log n) in partition heap; dequeue is O(1) amortized when using ready partition fast path, with stealing O(1) average.2
- Requirements: The scheduler MUST enforce max depth and queue caps, SHOULD batch enqueues by partition to reduce lock traffic, and MAY fallback to disk sharding when memory thresholds are reached.3
4.5 Proxy rotation & IP pool health monitoring1
- Motivation: Corporate environments or geo-distributed crawling may require rotating egress IPs and monitoring their health to avoid blocks and balance load.1
- Approach: Integrate an optional proxy list with per-proxy health scores, backoff on failures, and periodic re-tests; this can be wired into HttpClient options and per-request overrides.3
- Complexity: O(1) per selection; periodic health checks O(P), where P is proxies count.1
- Requirements: Proxy rotation MAY be configured; when enabled, the crawler SHOULD bound concurrent connections per proxy and MUST respect robots and rate limiting irrespective of proxy identity.4
4.6 Content hashing & dedup12
- Motivation: Reduce storage and downstream processing by identifying unchanged or duplicate content, enabling recrawl scheduling and stable identifiers.12
- Approach: Fast content hashing (e.g., stable hash over key parts of content), stored in metadata for conditional recrawl and enriched output; optional simhash/shingling can be added for near-dup detection.11
- Complexity: O(n) per page content length; RocksDB updates batched asynchronously.5
- Requirements: The crawler SHOULD compute content hashes on successful 200 responses and MUST update metadata to adapt recrawl intervals based on change patterns.5
4.7 Snapshot vs. continuous recrawl2
- Motivation: Different data consumers need one-time snapshots or continuous freshness; the crawler must support both strategies.2
- Approach: Regular mode deep crawl with disk queues versus fresh mode with RSS-driven ingestion and shallower depth, both using consistent policy and persistence layers.2
- Complexity: Equivalent per-request; scheduler parameters differ for depth and queue sizes.2
- Requirements: Fresh mode SHOULD run indefinitely with bounded queues and emergency seed injection; regular mode MAY use a max runtime window with graceful shutdown.2
4.8 Adaptive backoff & retry4
- Motivation: Transient network/server errors demand adaptive strategies to avoid hammering and to recover quickly.4
- Approach: Exponential backoff with jitter budgets via rate limiter failure counts, domain blacklisting for repeated timeouts, and retry fallback HTTP→HTTPS or vice versa for protocol errors.15
- Complexity: O(1) bookkeeping per failure; blacklist cleanup O(K) where K is number of temporary entries.15
- Requirements: The crawler SHOULD back off on 429/503 and timeouts, and MAY switch protocols on SSL connect failures with a one-time retry.4
4.9 Compression, HTTP/2, TLS reuse, connection pooling3
- Motivation: Maximize bandwidth efficiency and minimize handshake/connection overhead for high throughput.3
- Approach: libcurl multi with HTTP/2, accept-encoding for gzip/deflate, connection pooling of easy handles, and CURLOPT_PIPEWAIT to bias multiplex reuse instead of opening new connections.3
- Complexity: O(1) per transfer management; overall gains are multiplicative with concurrency.16
- Requirements: The crawler MUST enable certificate verification and SHOULD enable HTTP/2 multiplexing with pipe-wait to keep FD count low and reuse TLS sessions.10
4.10 Content-type sniffing, encoding normalization, charset detection3
- Motivation: Many pages omit or mislabel headers; sensible defaults and normalization improve parsing correctness.3
- Approach: Header parsing captures content-type and heuristics can be applied to detect HTML versus binary, normalizing encoding and skipping unsuitable media.3
- Complexity: O(1) for header processing; optional content scan O(n).3
- Requirements: The crawler SHOULD skip non-HTML-like responses for the HTML pipeline and MAY store binary metadata for inventory use cases.3
4.11 Snippet extraction, boilerplate removal, readability scoring17
- Motivation: Produce cleaner content for downstream indexing and training by filtering scripts, navs, and ads while scoring textual content density.17
- Approach: RE2-based structural stripping with density metrics and sentence heuristics; SIMD prefilter assists scanning and reduces overhead before deeper extraction.9
- Complexity: O(n) per page with vectorized primitives; configurable thresholds to bound CPU per page.17
- Requirements: The crawler SHOULD drop low-quality pages early and MAY store a quality score alongside content for downstream filtering.11
4.12 Structured data extraction (Schema.org, OpenGraph, JSON-LD)12
- Motivation: Enrich downstream applications with structured hints like title, description, and canonical links.12
- Approach: Lightweight scanning for meta tags and JSON-LD blocks can be integrated into the parser or a dedicated pass feeding enriched storage.11
- Complexity: O(n) per page scan; negligible impact if batched.11
- Requirements: The crawler MAY include structured data fields in enriched JSON when available without blocking critical path.11
4.13 Politeness, legal & compliance6
- Motivation: Respect site owners and legal constraints while crawling at scale.6
- Approach: Strict robots adherence, rate limiting, blacklists/allowlists, and user-agent transparency via headers with clear contact info.4
- Requirements: The crawler MUST follow robots rules, MUST rate limit by domain, and SHOULD provide a clear User-Agent with operator contact endpoint.18
4.14 Fault tolerance, persistence & recovery5
- Motivation: Keep progress under crashes and shutdowns with minimal loss and fast recovery.5
- Approach: Asynchronous RocksDB writers for metadata and limiter states, disk-sharded queues for overflow and restarts, graceful shutdown that flushes storage and coordinates thread termination.19
- Requirements: The crawler MUST drain and flush queues on shutdown and SHOULD use disk sharding for large frontier sizes beyond RAM.2
4.15 Metrics, tracing, logging, and profiling8
- Motivation: Operability demands visibility into throughput, latency, and failures with low overhead.8
- Approach: Periodic stats logging with queue depths and p/s, error tracking per domain, Tracy profiling zones in hot paths, and hooks for metrics exporters.9
- Requirements: The crawler SHOULD emit queue size, p/s, error counts, and backpressure indicators at steady intervals and MAY integrate with Prometheus/OpenTelemetry.8
5. Detailed Pipeline Walkthroughs2
This section includes end-to-end flows and sequence diagrams for the most important processes to aid onboarding, debugging, and performance tuning.4
5.1 URL lifecycle (state machine)2
A URL transitions through states from discovery to scheduled recrawl or drop depending on policy, quality, and success.3
stateDiagram-v2
[*] --> DISCOVERED
DISCOVERED --> NORMALIZED
NORMALIZED --> DEDUP_CHECK
DEDUP_CHECK --> FETCH_QUEUED
FETCH_QUEUED --> FETCHING
FETCHING --> ROBOTS_BLOCKED: robots disallow
FETCHING --> RATE_LIMITED: not ready window
FETCHING --> FETCH_OK: 2xx
FETCHING --> FETCH_SOFT_FAIL: 3xx/429/503
FETCHING --> FETCH_HARD_FAIL: 4xx/5xx
FETCH_OK --> PARSED
PARSED --> QUALITY_OK
PARSED --> QUALITY_REJECT
QUALITY_OK --> INDEXED
INDEXED --> SCHEDULED_FOR_RECRAWL
QUALITY_REJECT --> DROPPED
ROBOTS_BLOCKED --> DROPPED
FETCH_SOFT_FAIL --> RETRY_BUDGET
RETRY_BUDGET --> FETCH_QUEUED
RETRY_BUDGET --> DROPPED: exhausted
SCHEDULED_FOR_RECRAWL --> FETCH_QUEUED
5.2 HTTP fetch pipeline (sequence)4
This sequence shows DNS, connect, TLS, request, response, decompress, and handoff.3
sequenceDiagram
participant F as Frontier
participant W as Worker
participant M as libcurl multi
participant S as Server
F->>W: dequeue UrlInfo
W->>M: add easy handle (HTTP/2, headers)
M->>S: DNS + TCP + TLS handshake
S-->>M: TLS cert (validated)
M-->>S: GET with If-None-Match/If-Modified-Since
S-->>M: 200/304 + gzip payload
M-->>W: response body + headers
W->>W: decompress + header parse
W->>Parser: HTML content
Parser-->>W: links + quality decision
W->>Storage: metadata update + enriched JSON
5.3 robots.txt fetch + decision engine6
Robots evaluation is always performed before issuing a content request to a path.4
flowchart TD
A[URL] --> B{Cache has robots?}
B -- yes --> C{Path allowed?}
B -- no --> D[Queue robots fetch]
D --> E[DEFERRED_FETCH_STARTED]
C -- yes --> F[Proceed to fetch]
C -- no --> G[Drop URL]
5.4 Sitemap index expansion & scheduling14
Index sitemaps generate child sitemaps added to the queue and parsed for URLs which in turn flow into the frontier.14
sequenceDiagram
participant SP as Sitemap Parser
participant RC as Robots Cache
participant HC as HttpClient
participant SF as Smart Frontier
SP->>RC: domains to monitor
RC-->>SP: sitemaps list
SP->>HC: download index
HC-->>SP: XML index
SP->>SP: extract child sitemaps
SP->>HC: download child sitemaps
HC-->>SP: XML urlsets
SP->>SF: enqueue UrlInfo with priority
5.5 RSS polling loop7
The RSS worker periodically polls feeds, filters recency, and emits new entries.7
flowchart LR
A[Start Poller] --> B[Load feeds JSON]
B --> C[Wait poll_interval]
C --> D[Download feed]
D --> E[Parse RSS/Atom]
E --> F{Entries recent?}
F -- yes --> G[Emit UrlInfo batch]
F -- no --> H[Skip]
G --> C
H --> C
5.6 Conditional GET cache1
Conditional GET reduces redundant downloads by caching ETag/Last-Modified.3
sequenceDiagram
participant W as Worker
participant CG as ConditionalGet
participant HC as HttpClient
W->>CG: get_cache_info(url)
CG-->>W: {etag, last-modified}
W->>HC: GET with If-None-Match/If-Modified-Since
HC-->>W: 304 or 200
W->>CG: update_cache(headers)
5.7 Queue sharding & work stealing2
Balanced scheduling via partitioned heaps and per-worker deques with stealing.13
flowchart TD
A[Batch of UrlInfo] --> B[Partition by hash]
B --> C[Lock partition once]
C --> D[Push into PQ]
D --> E[Mark ready partition]
E --> F[Worker pop from ready partitions]
F --> G{Empty?}
G -- yes --> H[Try steal from victim]
G -- no --> I[Process URL]
5.8 Proxy rotation & health scoring1
When enabled, the client cycles proxies with health scores decreasing upon failures.3
sequenceDiagram
participant P as ProxyPool
participant W as Worker
participant HC as HttpClient
W->>P: next proxy()
P-->>W: proxy A
W->>HC: GET via proxy A
HC-->>W: response or error
W->>P: record_success/failure
5.9 Backpressure & flow control8
Monitor queue depths and crawl rate to apply refill from disk or inject seeds.19
flowchart TD
A[Monitor tick] --> B{SmartQ < threshold?}
B -- yes --> C[Load from disk shards]
B -- no --> D[No refill]
C --> E{Total URLs < critical?}
E -- yes --> F[Emergency seeds]
E -- no --> G[Continue]
5.10 Graceful shutdown2
Signal-safe handler flips the stop flag and the coordinator drains queues and joins threads.8
sequenceDiagram
participant OS as OS Signal
participant SH as Signal Handler
participant MON as Coordinator
participant W as Workers
OS->>SH: SIGINT/SIGTERM
SH->>MON: notify_all()
MON->>W: interrupt_waits()
W-->>MON: finished
MON->>Storage: flush()
MON->>All: teardown components
6. Concurrency & Scheduling Model4
- Threading: Separate network worker threads using curl multi with per-worker active request sets, and dedicated threads for RSS and Sitemap parsing guarded by condition variables for responsive shutdown.7
- CPU pinning: Optional pinning can reduce context switches; the design minimizes cross-core chatter via per-partition locks and per-worker deques.2
- Async I/O: libcurl multi provides event-driven parallelism; HTTP/2 multiplexing allows multiple streams per connection, which the system configures and prioritizes with PIPEWAIT to bias reuse.16
- Lock contention: Batching enqueues and ready-partition queues reduce contention on partition heaps, while per-worker queues localize push/pop without global locks.2
- False sharing: Atomic counters and queue size fields are localized per partition and per worker to avoid cache line contention.2
Priority scheduling formula:
- Let
$p_d$ be base priority from depth$d$ ,$p_d = \max(p_\text{min}, 1 - d \cdot \alpha)$ with$\alpha$ as depth penalty, and let$m_\text{domain}$ be domain multiplier from config and$o$ be overdue factor when beyond ready time, so final priority$P = \min(p_\text{max}, p_d \cdot m_\text{domain} + o)$ .2
Throughput model:
- Pages per second $ pps \approx \frac{N_workers \cdot C_streams}{L_eff} $, with effective latency
$L_\text{eff}$ including DNS+connect+TLS and payload time, and stream count influenced by HTTP/2 multiplexing and backpressure.4
7. Data Structures & Storage5
- Frontier: Partitioned priority queues keyed by expected crawl time and priority with seen sets per partition to avoid duplicates.2
- Work-stealing deques: Per-worker deques with capped sizes and front-steal/back-pop semantics to maximize locality.8
- Disk sharding: N shards each with file + mutex + size for spillover and restart warm-up, with periodic cleanup of empty shards.20
- Metadata: RocksDB store of UrlMetadata keyed by URL with crawl times, hash, backoff, failures, and change frequency, persisted asynchronously from a concurrent queue worker.21
ER/Class overview:
- UrlMetadata(url, last_crawl_time, previous_change_time, expected_next_crawl, content_hash, backoff_multiplier, crawl_count, change_frequency, temporary_failures).21
- RobotsInfo(domain, content, timestamp, is_valid, last_http_status, crawl_delay, sitemaps[], sitemaps_parsed).4
- HttpHeaders(etag, last_modified, response_time).18
- EnrichedPageData(url, content, domain, depth, content_hash, times, status, length).11
classDiagram
class UrlMetadata {
+time last_crawl_time
+time previous_change_time
+time expected_next_crawl
+string content_hash
+int backoff_multiplier
+int crawl_count
+float change_frequency
+int temporary_failures
}
class RobotsInfo {
+string content
+time timestamp
+bool is_valid
+int last_http_status
+int crawl_delay
+SitemapInfo[] sitemaps
+bool sitemaps_parsed
}
class EnrichedPageData {
+string url
+string content
+string domain
+int depth
+string content_hash
+time last_crawl_time
+int http_status_code
+size_t content_length
}
UrlMetadata <.. EnrichedPageData
RobotsInfo <.. SitemapInfo
Canonicalization rules:
- Normalize scheme and domain case, remove fragments, collapse slashes, and strip tracking params such as utm_*, gclid, fbclid from queries to improve dedup and politeness.12
Checksum/dedup:
- Use fast, deterministic content hashing on key content segments for equality and consider simhash/shingles for near-dup in future work.22
8. Networking Subsystem3
- HTTP client: Unified component that sets timeouts, redirects, HTTP/2, accept-encoding, TCP_NODELAY, TCP_KEEPALIVE, and per-request conditional headers, reusing curl easy handles from a connection pool.3
- TLS config: Certificate verification MUST be enabled (VERIFYPEER=1, VERIFYHOST=2) and CA bundles configured; disabling verification allows MITM and is not acceptable in production.23
- Multiplexing: HTTP/2 multiplexing is enabled with curl multi and PIPEWAIT to prefer reusing existing connections and reduce FD churn.16
- Concurrency knobs: CURLMOPT_MAX_TOTAL_CONNECTIONS, CURLMOPT_MAX_HOST_CONNECTIONS, and worker-local MAX_CONCURRENT_REQUESTS tune parallelism and fairness.4
Bandwidth & throughput:
- Let average payload size be
$S$ bytes and compressed transfer rate per stream be$R$ bytes/sec; with$C$ streams per worker and$N$ workers, ideal download rate $ bps \approx N \cdot C \cdot R $, gated by latency and server throttling.4
9. Robots.txt & Politeness6
Parsing algorithm:
- Read robots into memory and RocksDB with timestamp, parse per-agent allow/disallow rules, compute longest-prefix match precedence, enforce crawl-delay, and extract sitemap URLs with priorities.6
Cache invalidation:
- Use age thresholds (e.g., 30 days) to refresh robots, and invalidate per-domain on observed 403/404 or significant policy changes.4
Flowchart:
flowchart TD
A[Domain, Path] --> B{Memory cache hit?}
B -- yes --> C{Allowed?}
B -- no --> D[Load from RocksDB]
D --> E{Loaded?}
E -- yes --> C
E -- no --> F[Insert placeholder and defer]
C -- yes --> G[Fetch now]
C -- no --> H[Drop]
TLS note:
- The robots fetch mechanism SHOULD use the same TLS verification policy as page fetches to avoid inconsistent trust.4
10. Sitemaps & Feeds14
Index vs leaf:
- Index sitemaps list child sitemaps which are scheduled at priority-based intervals, while leaf sitemaps list URLs with optional lastmod/changefreq mapping to crawl priority.24
Parsing:
- TinyXML2 is used to parse XML documents robustly without regex, and response headers are captured for conditional GET on subsequent polls.7
Sequence:
sequenceDiagram
participant RC as Robots Cache
participant SP as Sitemap Parser
participant HC as HttpClient
participant SF as Frontier
RC-->>SP: candidate sitemap URLs
SP->>HC: GET sitemap index
HC-->>SP: XML
SP->>SP: Expand child sitemaps
SP->>HC: GET leaf sitemaps
HC-->>SP: urlset xml
SP->>SF: enqueue UrlInfo with priorities
11. Caching: HTTP & Application-level1
HTTP cache:
- Conditional GET via ETag/Last-Modified reduces bandwidth and CPU; 304 responses skip content processing while updating metadata timestamps.3
Application-level:
- Content-addressable metadata and enriched storage dedup repeated content, while frontiers avoid re-enqueueing seen URLs.2
Disk layout:
- Sharded disk queue uses per-shard files to spill URLs when memory frontiers are saturated, with batch reads to refill when in-memory queue dips below thresholds.20
Sequence:
sequenceDiagram
participant W as Worker
participant CG as CondGet
participant S as Server
W->>CG: lookup headers
W->>S: GET with If-None-Match/If-Modified-Since
S-->>W: 304 or 200
W->>CG: update headers from response
12. Fault Tolerance & Recovery5
Crash safety:
- Metadata store and rate limiter persist through background writers, and the disk queue provides a durable backlog to resume after restart.13
Replay & idempotency:
- Frontier seen-sets and metadata hashes prevent duplicate processing across restarts; worker shutdown drains active transfers before cleanup.2
Flowchart:
flowchart TD
A[Startup] --> B[Load configs]
B --> C[Scan disk shards]
C --> D[Warm frontiers]
D --> E[Start workers]
E --> F{Crash?}
F -- yes --> G[Restart]
G --> C
F -- no --> H[Graceful shutdown]
H --> I[Flush stores]
I --> J[Stop]
13. Observability8
Metrics:
- p/s crawl rate, total pages, queue depths (smart, work-stealing, HTML), HTTP status distribution, retry counts, robots hits, and backoff indicators should be emitted periodically in logs or exported via metrics collectors.8
Tracing/logging:
- Tracy zones bracket hot path operations in workers, robots checks, metadata updates, and storage flushes for sampling and analysis.4
Dataflow:
graph LR
W[Workers] --> M[Metrics Emitter]
F[Frontier] --> M
N[HTTP Client] --> M
S[Storage] --> M
M --> G[Monitoring Stack]
14. Performance Engineering9
Bottlenecks:
- CPU in parsing/regex, DNS/TLS handshake latency, server-side throttling, lock contention under high enqueue rates, and disk I/O for shards and RocksDB compaction.4
Benchmark methodology:
- Use canned seeds and test endpoints (e.g., httpbin) to calibrate connection pools and concurrency; measure with p/s, bytes/sec, 95p latency, and FD counts under varied payload sizes and network emulation.2
Optimization playbook:
- Prefer HTTP/2 multiplexing with PIPEWAIT to reduce new connections, batch enqueues per partition, enforce per-worker deque caps, reserve buffers, sanitize UTF-8 once, and use chrono types for time computations to avoid unit errors.16
Before/after tables can track improvements in p/s and CPU after enabling PIPEWAIT, batching enqueues, and introducing SIMD prefilters.9
15. Security, Legal, and Ethics6
- Robots.txt compliance is mandatory and the user agent should clearly identify the crawler with a contact URL or email for incident response.18
- TLS verification MUST be enabled (VERIFYPEER=1, VERIFYHOST=2) to prevent MITM and poisoned inputs; do not disable in production.10
- Rate limiting prevents DoS and reduces collateral damage; blocklists and allowlists can further constrain scope during trials.15
- PII handling: do not store sensitive data beyond necessary metadata; enriched storage should be secured at rest and access-controlled.11
16. Configuration & DevOps1
Example config files are provided for domains, seeds, feeds, emergency seeds, excluded patterns, and high-priority domains, and SHOULD be tailored to the deployment’s policy and scope.6
- Domain policies JSON:
{
"domains": {
"reuters.com": {
"crawl_frequency_limit": "2h",
"language_whitelist": ["en"],
"enabled": true,
"priority_multiplier": 1.3
}
}
}- Feeds:
[
{ "url": "https://news.ycombinator.com/rss", "priority": 9, "poll_interval": 5 }
]- Seeds:
[
"https://en.wikipedia.org/wiki/Special:Random",
"https://github.com/trending"
]- Exclusion lists:
[".jpg", ".png", ".pdf", ".zip"]Deployment topologies:
- Single node: all components in one process using local RocksDB and disk shards for simplicity and quick iteration.1
- Sharded cluster: multiple crawler instances partitioned by domain hash ranges, optionally with a distributed frontier broker and shared object storage.2
graph TD
subgraph Node A
A1[Workers] --> A2[Frontier]
A2 --> A3[RocksDB]
A1 --> A4[Storage JSON]
end
subgraph Node B
B1[Workers] --> B2[Frontier]
B2 --> B3[RocksDB]
B1 --> B4[Storage JSON]
end
A4 --> S3[(Object Storage)]
B4 --> S3
A3 --> M[Monitoring]
B3 --> M
Operational runbook:
- Scale up by increasing worker threads and per-worker concurrency, monitor p/s and queue depths, expand to more nodes by hashing domains, and drain by flipping stop flag and letting coordinator flush before shutdown.2
17. Testing Strategy2
- Unit tests for URL normalization, robots parsing, rate limiter arithmetic, and metadata serialization.12
- Integration tests with canned HTTP fixtures for 200/304/429/503 flows, robots allow/disallow, sitemap/atom XML parsing, and conditional GET caching.3
- Fuzzing for HTML parsers and robots tokenization to harden against malformed inputs.9
- Chaos testing by injecting network timeouts, delayed responses, and forced crashes to validate persistence and recovery.5
- Performance regression tests using fixed seed sets, concurrency knobs, and synthetic content sizes to track p/s trends over releases.4
18. Roadmap & Future Work9
- JS rendering: optional headless browser pool for JS-heavy sites with tight budgets and strict robots policies.2
- ML-based URL scoring: train freshness and quality predictors to enhance priority beyond heuristics.11
- Distributed frontier: brokered by Kafka/NATS with idempotent tokens and cross-node politeness coordination.2
- WARC export and CommonCrawl compatibility for interoperability with existing analysis tools.12
19. Glossary4
- Frontier: the scheduler’s queue of URLs prioritized by readiness and importance for fetching.2
- Politeness window: minimum interval between requests to the same domain to avoid overload.13
- Conditional GET: HTTP mechanism using ETag/Last-Modified to fetch only when content changed.1
- Multiplexing: HTTP/2 feature allowing multiple concurrent streams over a single TCP/TLS connection.25
- Crawl-delay: robots.txt directive specifying minimum seconds to wait between requests.6
- DEFERRED_FETCH_STARTED: sentinel indicating robots must be fetched before policy decision.6
20. References3
- libcurl CURLOPT_SSL_VERIFYHOST: host certificate verification and warnings against disabling.10
- libcurl CURLOPT_SSL_VERIFYPEER: peer certificate verification and defaults.23
- libcurl CURLOPT_PIPEWAIT: prefer waiting to multiplex on existing connections.16
- HTTP/2 multiplexing: overview and behavior in libcurl’s multi interface.25
- Project code: CMakeLists for dependencies and build flags.1
- Networking workers: multi-handle setup, request options, and result processing.4
- Policy components: robots cache, rate limiter, and blacklists.6
Installation & Build1
Requirements:
- C++17 compiler with pthreads, libcurl with HTTP/2, RocksDB, RE2, TinyXML2, and Tracy optional for profiling.1
Build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target crawler -j$(nproc)Run:
# Regular mode with 8 threads, depth 4, queue size 50k, 60 min max runtime
./build/crawler 8 4 50000 --max-runtime 60 --mode regular
# Fresh mode (RSS-driven, shallow):
./build/crawler --mode freshSecurity:
- Ensure CA bundle is present and HttpClient is configured with VERIFYPEER=1 and VERIFYHOST=2 for production deployments.26
Configuration Files6
- Place JSON configs under config/ with domain configs, seeds, feeds, emergency seeds, excluded patterns, and high-priority domains as provided in the repository examples.27
- Emergency seeds are injected only under critical low-queue conditions to revive the crawl and SHOULD be audited regularly.4
Operational Notes & SLOs8
- Suggested SLOs: 99% of fetches within $ \le 3 \cdot L_eff $ under normal network conditions, and sustained p/s above target thresholds with error rate below 2% on healthy domains.9
- Monitor:
Security Hardening Checklist3
- MUST enable TLS verification and provide CA bundle paths in environments with custom trust stores.26
- SHOULD restrict scope via allowlists during initial deployments and enforce robots with detailed logs on policy violations.6
- MUST bound concurrency per host and per worker to avoid unintentional DoS or rate-limit violations.4
- SHOULD rotate logs and protect enriched JSON outputs with proper access controls and encryption at rest where required.11
Design Appendices3
A. CURL Multi Tuning4
- Set CURLMOPT_MAX_TOTAL_CONNECTIONS and CURLMOPT_MAX_HOST_CONNECTIONS to cap global and per-host sockets, and use CURLOPT_PIPEWAIT to bias reuse.16
- Prefer HTTP/2 via CURLOPT_HTTP_VERSION with accept-encoding to reduce payload sizes on text content.3
B. Timekeeping & Portability13
- Use std::chrono durations consistently and avoid direct count() comparisons across clocks; persist system_clock timestamps to disk for stability across restarts.6
- For HTTP date parsing, use a portable UTC conversion strategy for RFC 2822 and ISO 8601 formats in environments without timegm.3
C. SIMD & Parsing9
- Gate SIMD paths on payload size thresholds and cap maximum HTML size to bound CPU per page, falling back to scalar code when necessary.18
By adhering to the architecture, policies, and operational guidance in this README, Hybrid Speed Crawler delivers sustained, compliant, and observable crawling performance suitable for production pipelines that demand both breadth and freshness.2 29303132333435
Footnotes
-
html_processing_queue.cpp ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18
-
smart_frontier.h ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31 ↩32 ↩33 ↩34 ↩35 ↩36 ↩37 ↩38
-
smart_frontier.cpp ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31 ↩32 ↩33
-
emergency_seeds.json ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31 ↩32 ↩33
-
domain_configs.json ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21
-
https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html ↩ ↩2 ↩3 ↩4
-
https://curl.se/libcurl/c/CURLOPT_PIPEWAIT.html ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
ultra_parser.h ↩
-
https://everything.curl.dev/libcurl-http/multiplexing.html ↩ ↩2
-
https://docs.oracle.com/cd/E88353_01/html/E37842/curlopt-ssl-verifypeer-3.html ↩ ↩2
-
seeds.json ↩
-
rss_poller.h ↩
-
https://stackoverflow.com/questions/15552052/how-to-enable-curlopt-ssl-verifyhost-2-support-on-my-os-php ↩
-
https://docs.oracle.com/cd/E88353_01/html/E37842/curlopt-ssl-verifyhost-3.html ↩
-
feeds.json ↩