Skip to content

superdoc-dev/docx-corpus

Repository files navigation

logo

Scraper CDX Filter codecov License: MIT

Building the largest open corpus of .docx files for document processing and rendering research.

Vision

Document rendering is hard. Microsoft Word has decades of edge cases, quirks, and undocumented behaviors. To build reliable document processing tools, you need to test against real-world documents - not just synthetic test cases.

docx-corpus scrapes the entire public web (via Common Crawl) to collect .docx files, creating a massive test corpus for:

  • Document parsing and rendering engines

  • Visual regression testing

  • Feature coverage analysis

  • Edge case discovery

  • Machine learning training data

How It Works

┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│   Common Crawl   │      │   cdx-filter     │      │     scraper      │
│   (S3 bucket)    │ ───► │   (Lambda)       │ ───► │     (Bun)        │
│                  │      │                  │      │                  │
│  CDX indexes     │      │  Filters .docx   │      │  Downloads WARC  │
│  WARC archives   │      │  Writes to R2    │      │  Validates docx  │
└──────────────────┘      └──────────────────┘      └──────────────────┘
                                   │                         │
                                   ▼                         ▼
                          ┌──────────────────┐      ┌──────────────────┐
                          │   Cloudflare R2  │      │     Storage      │
                          │                  │      │                  │
                          │  cdx-filtered/   │      │  Local or R2     │
                          │  *.jsonl         │      │  documents/      │
                          └──────────────────┘      └──────────────────┘

Why Common Crawl?

Common Crawl is a nonprofit that crawls the web monthly and makes it freely available:

  • 3+ billion URLs per monthly crawl
  • Petabytes of data going back to 2008
  • Free to access - no API keys needed
  • Reproducible - archived crawls never change

This gives us access to every public .docx file on the web.

Installation

# Clone the repository
git clone https://github.com/superdoc-dev/docx-corpus.git
cd docx-corpus

# Install dependencies
bun install

Project Structure

apps/
  cdx-filter/     # AWS Lambda - filters CDX indexes for .docx URLs
  scraper/        # Main CLI - downloads WARC records and validates .docx files
App Purpose Runtime
cdx-filter Filter Common Crawl CDX indexes AWS Lambda (Node.js)
scraper Download and validate .docx files Local (Bun)

Usage

1. Run Lambda to filter CDX indexes

First, deploy and run the Lambda function to filter Common Crawl CDX indexes for .docx files. See apps/cdx-filter/README.md for detailed setup instructions.

cd apps/cdx-filter
./invoke-all.sh CC-MAIN-2025-51

This reads CDX files directly from Common Crawl S3 (no rate limits) and stores filtered JSONL in your R2 bucket.

2. Run the scraper

# Scrape all documents from a crawl
bun run scrape --crawl CC-MAIN-2025-51

# Limit to 500 documents
bun run scrape --crawl CC-MAIN-2025-51 --batch 500

# Re-process URLs already in database
bun run scrape --crawl CC-MAIN-2025-51 --force

# Check progress
bun run status

Docker

Run the CLI in a container:

cd apps/scraper

# Build and start the container
docker-compose up -d --build

# Run CLI commands
docker exec docx-corpus-scraper bun run scrape --crawl CC-MAIN-2025-51
docker exec docx-corpus-scraper bun run status

# Stop the container
docker-compose down

Pass environment variables via docker run -e or add them to your .env file in apps/scraper/.

Storage Options

R2 credentials are required to read pre-filtered CDX records from the Lambda output.

Local document storage (default): Downloaded .docx files are saved to ./corpus/documents/

Cloud document storage (Cloudflare R2): Documents can also be uploaded to R2 alongside the CDX records:

export CLOUDFLARE_ACCOUNT_ID=xxx
export R2_ACCESS_KEY_ID=xxx
export R2_SECRET_ACCESS_KEY=xxx
bun run scrape --crawl CC-MAIN-2025-51 --batch 1000

Configuration

All configuration via environment variables (.env):

# Cloudflare R2 (required for both Lambda and scraper)
CLOUDFLARE_ACCOUNT_ID=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET_NAME=docx-corpus

# Scraping
STORAGE_PATH=./corpus
CRAWL_ID=CC-MAIN-2025-51

# Performance tuning
CONCURRENCY=50              # Parallel downloads
RATE_LIMIT_RPS=50           # Requests per second (initial)
MAX_RPS=100                 # Max requests per second
MIN_RPS=10                  # Min requests per second
TIMEOUT_MS=45000            # Request timeout in ms
MAX_RETRIES=10              # Max retry attempts
MAX_BACKOFF_MS=60000        # Max backoff delay (ms)

Rate Limiting

  • WARC requests: Adaptive rate limiting that adjusts to server load
  • On 503/429 errors: Retries with exponential backoff + jitter (up to 60s)
  • On 403 errors: Fails immediately (indicates 24h IP block from Common Crawl)

Corpus Statistics

Metric Description
Sources Entire public web via Common Crawl
Deduplication SHA-256 content hash
Validation ZIP structure + Word XML verification
Storage Content-addressed (hash as filename)

Development

# Run linter
bun run lint

# Format code
bun run format

# Type check
bun run typecheck

# Run tests
bun run test

# Build
bun run build

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Takedown Requests

If you find a document in this corpus that you own and would like removed, please email help@docxcorp.us with:

  • The document hash or URL
  • Proof of ownership

We will process requests within 7 days.

License

MIT


Built by 🦋SuperDoc