Skip to content

πŸ” Professional email scraping toolkit with Google My Business integration, Playwright rendering, and educational airline data extraction. Built with TypeScript & Python for reliable contact discovery.

License

Notifications You must be signed in to change notification settings

nabilW/Email-Scraping

Repository files navigation

πŸ“§ Email Scraping Toolkit

A comprehensive, educational toolkit for discovering, extracting, enriching, and validating business contact information from web sources.

TypeScript Node.js License Educational

⚠️ Educational Purpose Only - This project is intended for learning and educational purposes. Users must comply with all applicable laws, respect robots.txt, website terms of service, and obtain proper consent before collecting personal information.


πŸ“– About

Email Scraping is a unified, production-ready toolkit that combines multiple proven approaches to business contact discovery and validation. This educational project demonstrates modern web scraping techniques, email extraction methods, deliverability verification, and data enrichment workflows.

What This Project Does

This toolkit provides a complete pipeline for:

  • πŸ” Business Discovery: Find businesses using Google Maps Places API or manual targeting
  • πŸ•·οΈ Web Crawling: Recursively crawl websites with controlled depth and domain scoping
  • πŸ“§ Email Extraction: Multi-layer extraction using regex, DOM parsing, attribute scanning, and obfuscation decoding
  • βœ… Deliverability Verification: SMTP handshake testing with MX record resolution
  • 🌐 Domain Validation: HTTP reachability checks to ensure websites are active
  • πŸ’Ύ Data Export: Flexible output formats (JSON, CSV, Excel) for downstream integration

Why This Project Exists

Lead generation workflows often require stitching together multiple utilities: map prospecting, HTML scraping, enrichment, and deliverability checks. This project merges proven ideas from community-built tools like:

All integrated into a modern TypeScript pipeline with performance optimizations, caching, and database storage.

Key Features

Feature Description
πŸ—ΊοΈ Google Maps Discovery Text search via Places API with pagination and rate limiting
πŸ”„ Recursive Web Crawling Controlled depth, domain scoping, and external link filtering
πŸ“§ Multi-Layer Extraction Regex, DOM parsing, attribute scans, base64/unicode decoding
βœ… SMTP Deliverability MX resolution with progressive failover and SMTP handshake
🌐 HTTP Validation Probes production and dev hostnames for reachability
πŸ’Ύ Flexible Export JSON, CSV, and Excel output for CRM integration
πŸš€ Performance Optimized LRU caching, connection pooling, parallel processing
πŸ—„οΈ Database Storage SQLite integration for large-scale datasets

πŸ—οΈ Architecture Overview

The project follows a modular architecture with clear separation of concerns:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Pipeline Orchestrator                  β”‚
β”‚              (EmailScrapingPipeline)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚          β”‚          β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚Scraperβ”‚  β”‚Extractβ”‚  β”‚Verify  β”‚
β”‚       β”‚  β”‚       β”‚  β”‚        β”‚
β”‚β€’ Maps β”‚  β”‚β€’ Regexβ”‚  β”‚β€’ SMTP  β”‚
β”‚β€’ Web  β”‚  β”‚β€’ DOM  β”‚  β”‚β€’ HTTP  β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
    β”‚          β”‚          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
        β”‚ ResultStore β”‚
        β”‚             β”‚
        β”‚β€’ JSON       β”‚
        β”‚β€’ CSV        β”‚
        β”‚β€’ Excel      β”‚
        β”‚β€’ Database   β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Modules

Module Responsibility Key Technologies
GoogleMapsScraper Text search via Places API, pagination, rate limiting Google Places API
WebCrawler Bounded-depth recursive crawling with external link guard Playwright, Cheerio
EmailExtractor Multi-pass extraction (mailto, text, obfuscation decoding) Regex, DOM parsing
DeliverabilityChecker MX lookup + SMTP handshake with progressive fallback DNS, SMTP client
HttpDomainValidator HEAD/GET probes to confirm prod/dev web hosts respond HTTP/HTTPS
ResultStore JSON/CSV/Excel export with optional database storage SQLite, CSV/Excel writers
CacheManager LRU caching for HTTP, DNS, and HTML responses In-memory LRU cache
EmailScrapingPipeline Orchestrates data flow, deduplication, and persistence Promise pooling

πŸš€ Getting Started

Prerequisites

  • Node.js 18.18.0 or higher
  • npm or yarn or pnpm
  • (Optional) Python 3.10+ for Python helper scripts
  • (Optional) Google Maps API Key for Places discovery

Installation

# Clone the repository
git clone https://github.com/nabilW/EmailScraping.git
cd EmailScraping

# Install dependencies
npm install
# or
pnpm install
# or
yarn install

Configuration

  1. Copy the example environment file:

    cp config/example.env .env
  2. Edit .env with your configuration:

    # Optional: Google Maps Places API (for business discovery)
    GOOGLE_MAPS_API_KEY=your_api_key_here
    
    # Required: SMTP deliverability probing
    SMTP_PROBE_FROM=your-email@example.com
    SMTP_PROBE_HELLO=your-domain.com
    
    # Optional: Database storage
    USE_DATABASE=true
    DATABASE_PATH=./output/emails.db

Quick Start Examples

Basic Usage - Search by Term and Location

npm run dev -- --term "business services" --location "New York" --country us

Require Website Reachability

Only keep emails from websites that respond with 2xx/3xx status codes:

npm run dev -- --term "consulting" --require-reachable

Skip Dev Subdomain Probing

Avoid false positives from internal development hosts:

npm run dev -- --term "business" --skip-dev-subdomain

Manual Website Targeting

When you already know the domains you want to crawl:

npm run dev -- --website https://example.com --require-reachable

Batch Processing with Query File

Create a JSON file with multiple queries:

[
  { "term": "business services", "location": "New York", "countryCode": "us" },
  { "term": "consulting", "location": "London", "countryCode": "gb" },
  { "term": "technology", "location": "San Francisco", "countryCode": "us" }
]
npm run dev -- --queries ./queries.json

Using Seed Files

For more control, use seed files with additional configuration:

[
  {
    "name": "Example Company",
    "website": "https://www.example.com",
    "extraPaths": [
      "/contact",
      "/about",
      "/team"
    ],
    "seedEmails": [
      "contact@example.com",
      "info@example.com"
    ],
    "allowedDomains": [
      "example.com"
    ]
  }
]
npm run dev -- --seed config/seeds/example.json --require-reachable

Seed File Options:

  • extraPaths or extraUrls – Additional landing pages to crawl (contact, media, etc.)
  • seedEmails – Known good addresses to bootstrap the dataset
  • allowedDomains – Restrict scraped emails to specific domains

Output Files

The pipeline generates several output files in the output/ directory:

  • results.json – Complete structured data with metadata
  • results.csv – Spreadsheet-friendly format
  • results.xlsx – Excel workbook with formatted data
  • emails.txt – Simple one-email-per-line format, deduplicated
  • emails.db – SQLite database (if database storage is enabled)

πŸ“ Repository Structure

EmailScraping/
β”œβ”€β”€ πŸ“‚ config/                     # Configuration templates
β”‚   └── example.env                # Environment variables template
β”‚
β”œβ”€β”€ πŸ“‚ docs/                        # Additional documentation
β”‚   └── WORKFLOWS.md               # Detailed workflow documentation
β”‚
β”œβ”€β”€ πŸ“‚ output/                      # Generated outputs
β”‚   β”œβ”€β”€ results.json               # Main JSON output
β”‚   β”œβ”€β”€ results.csv                # CSV export
β”‚   β”œβ”€β”€ results.xlsx               # Excel export
β”‚   β”œβ”€β”€ emails.txt                 # Simple email list
β”‚   β”œβ”€β”€ emails.db                  # SQLite database (optional)
β”‚   β”œβ”€β”€ logs/                      # Application logs
β”‚   └── extract-emails/            # Python extract-emails outputs
β”‚
β”œβ”€β”€ πŸ“‚ queries/                     # Query files for batch processing
β”‚   β”œβ”€β”€ africa_countries.txt
β”‚   β”œβ”€β”€ middle_east_africa.txt
β”‚   └── ...
β”‚
β”œβ”€β”€ πŸ“‚ scripts/                     # Helper scripts
β”‚   β”œβ”€β”€ πŸ“‚ python/                 # Python utilities
β”‚   β”‚   β”œβ”€β”€ extract_emails_helper.py
β”‚   β”‚   β”œβ”€β”€ verify_emails.py
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ πŸ“‚ typescript/             # TypeScript automation
β”‚   β”‚   β”œβ”€β”€ googleSearch.ts
β”‚   β”‚   β”œβ”€β”€ mergeScrapyEmails.ts
β”‚   β”‚   └── ...
β”‚   └── README.md                  # Script documentation
β”‚
β”œβ”€β”€ πŸ“‚ scrapy/                      # Scrapy spiders for bulk crawling
β”‚   └── uae_airlines_crawler/
β”‚       └── uae_airlines_crawler/
β”‚           └── spiders/
β”‚               └── multi_contacts.py
β”‚
β”œβ”€β”€ πŸ“‚ src/                         # Main TypeScript application
β”‚   β”œβ”€β”€ πŸ“‚ extractors/             # Email extraction logic
β”‚   β”œβ”€β”€ πŸ“‚ pipeline/               # Main pipeline orchestration
β”‚   β”œβ”€β”€ πŸ“‚ scrapers/                # Web scraping modules
β”‚   β”œβ”€β”€ πŸ“‚ storage/                 # Data persistence
β”‚   β”œβ”€β”€ πŸ“‚ utils/                   # Utilities (cache, logger)
β”‚   β”œβ”€β”€ πŸ“‚ verifiers/               # Validation modules
β”‚   └── index.ts                    # Entry point
β”‚
β”œβ”€β”€ πŸ“‚ tests/                       # Unit and integration tests
β”‚   └── EmailExtractor.test.ts
β”‚
β”œβ”€β”€ πŸ“„ README.md                    # This file
β”œβ”€β”€ πŸ“„ LICENSE                      # MIT License
β”œβ”€β”€ πŸ“„ CONTRIBUTING.md              # Contribution guidelines
β”œβ”€β”€ πŸ“„ package.json                 # Node.js dependencies
└── πŸ“„ tsconfig.json                # TypeScript configuration

πŸ’‘ Tip: Check scripts/README.md for detailed documentation on all helper scripts and example invocations.


πŸ”§ Advanced Usage

Google Search Discovery

Discover new domains before running the pipeline:

# Inline queries (comma separated)
npm run search:google -- --queries "business services,consulting companies" --limit 20

# Or use a configuration file
cat > config/google-search.json <<'JSON'
{
  "queries": [
    "business services",
    "consulting companies"
  ],
  "resultsPerQuery": 25,
  "output": "output/google-search-results.json",
  "domainsOutput": "output/google-search-domains.txt"
}
JSON
npm run search:google -- --input config/google-search.json

Output Files:

  • output/google-search-results.json – Full Google SERP entries
  • output/google-search-domains.{txt,json} – Deduplicated hostnames

πŸ’‘ Tip: For best results, provide Google Programmable Search credentials:

export GOOGLE_API_KEY="your-api-key"
export GOOGLE_CSE_ID="your-search-engine-id"

Python extract-emails Integration

Enrich your dataset with the Python extract-emails scraper:

  1. Set up Python environment:

    python3 -m venv .venv-extract
    source .venv-extract/bin/activate
    pip install "extract-emails>=5.3.3"
  2. Run the helper:

    python -m extract_emails.console.application \
      --url https://www.example.com/contact \
      --browser-name requests \
      --depth 1 \
      --output-file output/extract-emails/example.csv
  3. Merge into main archive:

    npm run update:emails

The merger automatically handles duplicates, enforces allowedDomains, and maintains alphabetical order.

Scrapy-Based Crawling

For deeper harvesting on JavaScript-heavy sites, use the Scrapy project:

  1. Activate virtual environment:

    cd scrapy
    source .venv/bin/activate
  2. Run a spider:

    cd uae_airlines_crawler
    scrapy crawl multi_contacts -O ../../output/contacts.jsonl
  3. Merge with TypeScript pipeline:

    npm run update:emails

The spider respects depth limits, filters placeholder emails, and follows configured domains.


🎯 Performance Optimizations

This project includes several performance optimizations:

  • LRU Caching: HTTP responses, DNS lookups, and HTML content are cached
  • Connection Pooling: Persistent HTTP/HTTPS connections for faster requests
  • Parallel Processing: Controlled concurrency using Promise pools
  • Database Storage: SQLite for efficient large-scale data management
  • Memory Management: Smart in-memory limits with database fallback

πŸ§ͺ Testing

Run the test suite:

# Run all tests
npm test

# Run tests in watch mode
npm run test:watch

πŸ“ Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests and linting (npm test && npm run lint)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

πŸ—ΊοΈ Roadmap

Future enhancements under consideration:

  • πŸ”„ Google My Business session scraping for UI parity
  • 🎭 Enhanced Playwright-based JavaScript rendering for SPAs
  • πŸ”Œ Pluggable deliverability providers (ZeroBounce, NeverBounce, etc.)
  • 🌐 REST API & dashboard frontend
  • πŸ“Š Advanced analytics and reporting
  • πŸ” Enhanced security and rate limiting

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Educational Purpose Disclaimer

This project is intended for educational purposes only. Users are responsible for:

  • βœ… Complying with all applicable laws and regulations regarding web scraping
  • βœ… Respecting robots.txt and website terms of service
  • βœ… Obtaining proper consent before collecting personal information
  • βœ… Using collected data ethically and responsibly

The authors and contributors are not responsible for any misuse of this software.


πŸ™ Acknowledgments

This project draws inspiration from several open-source projects:


πŸ“ž Support

For questions, issues, or contributions:


Made with ❀️ for educational purposes

⭐ Star this repo if you find it helpful!

About

πŸ” Professional email scraping toolkit with Google My Business integration, Playwright rendering, and educational airline data extraction. Built with TypeScript & Python for reliable contact discovery.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published