GitHub - superdoc-dev/docx-corpus: The largest open corpus of .docx files for document processing research

Building the largest open corpus of .docx files for document processing and rendering research.

Vision

Document rendering is hard. Microsoft Word has decades of edge cases, quirks, and undocumented behaviors. To build reliable document processing tools, you need to test against real-world documents - not just synthetic test cases.

docx-corpus scrapes the entire public web (via Common Crawl) to collect .docx files, creating a massive test corpus for:

Document parsing and rendering engines
Visual regression testing
Feature coverage analysis
Edge case discovery
Machine learning training data

How It Works

┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│   Common Crawl   │      │   cdx-filter     │      │     scraper      │
│   (S3 bucket)    │ ───► │   (Lambda)       │ ───► │     (Bun)        │
│                  │      │                  │      │                  │
│  CDX indexes     │      │  Filters .docx   │      │  Downloads WARC  │
│  WARC archives   │      │  Writes to R2    │      │  Validates docx  │
└──────────────────┘      └──────────────────┘      └──────────────────┘
                                   │                         │
                                   ▼                         ▼
                          ┌──────────────────┐      ┌──────────────────┐
                          │   Cloudflare R2  │      │     Storage      │
                          │                  │      │                  │
                          │  cdx-filtered/   │      │  Local or R2     │
                          │  *.jsonl         │      │  documents/      │
                          └──────────────────┘      └──────────────────┘

Why Common Crawl?

Common Crawl is a nonprofit that crawls the web monthly and makes it freely available:

3+ billion URLs per monthly crawl
Petabytes of data going back to 2008
Free to access - no API keys needed
Reproducible - archived crawls never change

This gives us access to every public .docx file on the web.

Installation

# Clone the repository
git clone https://github.com/superdoc-dev/docx-corpus.git
cd docx-corpus

# Install dependencies
bun install

Project Structure

apps/
  cdx-filter/     # AWS Lambda - filters CDX indexes for .docx URLs
  scraper/        # Main CLI - downloads WARC records and validates .docx files

App	Purpose	Runtime
cdx-filter	Filter Common Crawl CDX indexes	AWS Lambda (Node.js)
scraper	Download and validate .docx files	Local (Bun)

Usage

1. Run Lambda to filter CDX indexes

First, deploy and run the Lambda function to filter Common Crawl CDX indexes for .docx files. See apps/cdx-filter/README.md for detailed setup instructions.

cd apps/cdx-filter
./invoke-all.sh CC-MAIN-2025-51

This reads CDX files directly from Common Crawl S3 (no rate limits) and stores filtered JSONL in your R2 bucket.

2. Run the scraper

# Scrape all documents from a crawl
bun run scrape --crawl CC-MAIN-2025-51

# Limit to 500 documents
bun run scrape --crawl CC-MAIN-2025-51 --batch 500

# Re-process URLs already in database
bun run scrape --crawl CC-MAIN-2025-51 --force

# Check progress
bun run status

Docker

Run the CLI in a container:

cd apps/scraper

# Build and start the container
docker-compose up -d --build

# Run CLI commands
docker exec docx-corpus-scraper bun run scrape --crawl CC-MAIN-2025-51
docker exec docx-corpus-scraper bun run status

# Stop the container
docker-compose down

Pass environment variables via docker run -e or add them to your .env file in apps/scraper/.

Storage Options

R2 credentials are required to read pre-filtered CDX records from the Lambda output.

Local document storage (default): Downloaded .docx files are saved to ./corpus/documents/

Cloud document storage (Cloudflare R2): Documents can also be uploaded to R2 alongside the CDX records:

export CLOUDFLARE_ACCOUNT_ID=xxx
export R2_ACCESS_KEY_ID=xxx
export R2_SECRET_ACCESS_KEY=xxx
bun run scrape --crawl CC-MAIN-2025-51 --batch 1000

Configuration

All configuration via environment variables (.env):

# Cloudflare R2 (required for both Lambda and scraper)
CLOUDFLARE_ACCOUNT_ID=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET_NAME=docx-corpus

# Scraping
STORAGE_PATH=./corpus
CRAWL_ID=CC-MAIN-2025-51

# Performance tuning
CONCURRENCY=50              # Parallel downloads
RATE_LIMIT_RPS=50           # Requests per second (initial)
MAX_RPS=100                 # Max requests per second
MIN_RPS=10                  # Min requests per second
TIMEOUT_MS=45000            # Request timeout in ms
MAX_RETRIES=10              # Max retry attempts
MAX_BACKOFF_MS=60000        # Max backoff delay (ms)

Rate Limiting

WARC requests: Adaptive rate limiting that adjusts to server load
On 503/429 errors: Retries with exponential backoff + jitter (up to 60s)
On 403 errors: Fails immediately (indicates 24h IP block from Common Crawl)

Corpus Statistics

Metric	Description
Sources	Entire public web via Common Crawl
Deduplication	SHA-256 content hash
Validation	ZIP structure + Word XML verification
Storage	Content-addressed (hash as filename)

Development

# Run linter
bun run lint

# Format code
bun run format

# Type check
bun run typecheck

# Run tests
bun run test

# Build
bun run build

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Takedown Requests

If you find a document in this corpus that you own and would like removed, please email help@docxcorp.us with:

The document hash or URL
Proof of ownership

We will process requests within 7 days.

License

MIT

Built by 🦋SuperDoc

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/workflows		.github/workflows
.husky		.husky
apps		apps
site		site
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
docker-compose.yaml		docker-compose.yaml
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision

How It Works

Why Common Crawl?

Installation

Project Structure

Usage

1. Run Lambda to filter CDX indexes

2. Run the scraper

Docker

Storage Options

Configuration

Rate Limiting

Corpus Statistics

Development

Contributing

Takedown Requests

License

About

Uh oh!

Releases 40

Contributors 3

Uh oh!

Languages

License

superdoc-dev/docx-corpus

Folders and files

Latest commit

History

Repository files navigation

Vision

How It Works

Why Common Crawl?

Installation

Project Structure

Usage

1. Run Lambda to filter CDX indexes

2. Run the scraper

Docker

Storage Options

Configuration

Rate Limiting

Corpus Statistics

Development

Contributing

Takedown Requests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 40

Contributors 3

Uh oh!

Languages