A comprehensive, educational toolkit for discovering, extracting, enriching, and validating business contact information from web sources.
β οΈ Educational Purpose Only - This project is intended for learning and educational purposes. Users must comply with all applicable laws, respect robots.txt, website terms of service, and obtain proper consent before collecting personal information.
Email Scraping is a unified, production-ready toolkit that combines multiple proven approaches to business contact discovery and validation. This educational project demonstrates modern web scraping techniques, email extraction methods, deliverability verification, and data enrichment workflows.
This toolkit provides a complete pipeline for:
- π Business Discovery: Find businesses using Google Maps Places API or manual targeting
- π·οΈ Web Crawling: Recursively crawl websites with controlled depth and domain scoping
- π§ Email Extraction: Multi-layer extraction using regex, DOM parsing, attribute scanning, and obfuscation decoding
- β Deliverability Verification: SMTP handshake testing with MX record resolution
- π Domain Validation: HTTP reachability checks to ensure websites are active
- πΎ Data Export: Flexible output formats (JSON, CSV, Excel) for downstream integration
Lead generation workflows often require stitching together multiple utilities: map prospecting, HTML scraping, enrichment, and deliverability checks. This project merges proven ideas from community-built tools like:
- Map Email Scraper - Google Maps prospecting
- MailScout - Deliverability probing patterns
- e-scraper - Web crawling utilities
- crawler-python - Recursive crawling strategies
All integrated into a modern TypeScript pipeline with performance optimizations, caching, and database storage.
| Feature | Description |
|---|---|
| πΊοΈ Google Maps Discovery | Text search via Places API with pagination and rate limiting |
| π Recursive Web Crawling | Controlled depth, domain scoping, and external link filtering |
| π§ Multi-Layer Extraction | Regex, DOM parsing, attribute scans, base64/unicode decoding |
| β SMTP Deliverability | MX resolution with progressive failover and SMTP handshake |
| π HTTP Validation | Probes production and dev hostnames for reachability |
| πΎ Flexible Export | JSON, CSV, and Excel output for CRM integration |
| π Performance Optimized | LRU caching, connection pooling, parallel processing |
| ποΈ Database Storage | SQLite integration for large-scale datasets |
The project follows a modular architecture with clear separation of concerns:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Pipeline Orchestrator β
β (EmailScrapingPipeline) β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β β β
βββββΌββββ ββββΌββββ βββββΌβββββ
βScraperβ βExtractβ βVerify β
β β β β β β
ββ’ Maps β ββ’ Regexβ ββ’ SMTP β
ββ’ Web β ββ’ DOM β ββ’ HTTP β
βββββ¬ββββ βββββ¬ββββ βββββ¬βββββ
β β β
ββββββββββββΌβββββββββββ
β
ββββββββΌβββββββ
β ResultStore β
β β
ββ’ JSON β
ββ’ CSV β
ββ’ Excel β
ββ’ Database β
βββββββββββββββ
| Module | Responsibility | Key Technologies |
|---|---|---|
GoogleMapsScraper |
Text search via Places API, pagination, rate limiting | Google Places API |
WebCrawler |
Bounded-depth recursive crawling with external link guard | Playwright, Cheerio |
EmailExtractor |
Multi-pass extraction (mailto, text, obfuscation decoding) | Regex, DOM parsing |
DeliverabilityChecker |
MX lookup + SMTP handshake with progressive fallback | DNS, SMTP client |
HttpDomainValidator |
HEAD/GET probes to confirm prod/dev web hosts respond | HTTP/HTTPS |
ResultStore |
JSON/CSV/Excel export with optional database storage | SQLite, CSV/Excel writers |
CacheManager |
LRU caching for HTTP, DNS, and HTML responses | In-memory LRU cache |
EmailScrapingPipeline |
Orchestrates data flow, deduplication, and persistence | Promise pooling |
- Node.js 18.18.0 or higher
- npm or yarn or pnpm
- (Optional) Python 3.10+ for Python helper scripts
- (Optional) Google Maps API Key for Places discovery
# Clone the repository
git clone https://github.com/nabilW/EmailScraping.git
cd EmailScraping
# Install dependencies
npm install
# or
pnpm install
# or
yarn install-
Copy the example environment file:
cp config/example.env .env
-
Edit
.envwith your configuration:# Optional: Google Maps Places API (for business discovery) GOOGLE_MAPS_API_KEY=your_api_key_here # Required: SMTP deliverability probing SMTP_PROBE_FROM=your-email@example.com SMTP_PROBE_HELLO=your-domain.com # Optional: Database storage USE_DATABASE=true DATABASE_PATH=./output/emails.db
npm run dev -- --term "business services" --location "New York" --country usOnly keep emails from websites that respond with 2xx/3xx status codes:
npm run dev -- --term "consulting" --require-reachableAvoid false positives from internal development hosts:
npm run dev -- --term "business" --skip-dev-subdomainWhen you already know the domains you want to crawl:
npm run dev -- --website https://example.com --require-reachableCreate a JSON file with multiple queries:
[
{ "term": "business services", "location": "New York", "countryCode": "us" },
{ "term": "consulting", "location": "London", "countryCode": "gb" },
{ "term": "technology", "location": "San Francisco", "countryCode": "us" }
]npm run dev -- --queries ./queries.jsonFor more control, use seed files with additional configuration:
[
{
"name": "Example Company",
"website": "https://www.example.com",
"extraPaths": [
"/contact",
"/about",
"/team"
],
"seedEmails": [
"contact@example.com",
"info@example.com"
],
"allowedDomains": [
"example.com"
]
}
]npm run dev -- --seed config/seeds/example.json --require-reachableSeed File Options:
extraPathsorextraUrlsβ Additional landing pages to crawl (contact, media, etc.)seedEmailsβ Known good addresses to bootstrap the datasetallowedDomainsβ Restrict scraped emails to specific domains
The pipeline generates several output files in the output/ directory:
results.jsonβ Complete structured data with metadataresults.csvβ Spreadsheet-friendly formatresults.xlsxβ Excel workbook with formatted dataemails.txtβ Simple one-email-per-line format, deduplicatedemails.dbβ SQLite database (if database storage is enabled)
EmailScraping/
βββ π config/ # Configuration templates
β βββ example.env # Environment variables template
β
βββ π docs/ # Additional documentation
β βββ WORKFLOWS.md # Detailed workflow documentation
β
βββ π output/ # Generated outputs
β βββ results.json # Main JSON output
β βββ results.csv # CSV export
β βββ results.xlsx # Excel export
β βββ emails.txt # Simple email list
β βββ emails.db # SQLite database (optional)
β βββ logs/ # Application logs
β βββ extract-emails/ # Python extract-emails outputs
β
βββ π queries/ # Query files for batch processing
β βββ africa_countries.txt
β βββ middle_east_africa.txt
β βββ ...
β
βββ π scripts/ # Helper scripts
β βββ π python/ # Python utilities
β β βββ extract_emails_helper.py
β β βββ verify_emails.py
β β βββ ...
β βββ π typescript/ # TypeScript automation
β β βββ googleSearch.ts
β β βββ mergeScrapyEmails.ts
β β βββ ...
β βββ README.md # Script documentation
β
βββ π scrapy/ # Scrapy spiders for bulk crawling
β βββ uae_airlines_crawler/
β βββ uae_airlines_crawler/
β βββ spiders/
β βββ multi_contacts.py
β
βββ π src/ # Main TypeScript application
β βββ π extractors/ # Email extraction logic
β βββ π pipeline/ # Main pipeline orchestration
β βββ π scrapers/ # Web scraping modules
β βββ π storage/ # Data persistence
β βββ π utils/ # Utilities (cache, logger)
β βββ π verifiers/ # Validation modules
β βββ index.ts # Entry point
β
βββ π tests/ # Unit and integration tests
β βββ EmailExtractor.test.ts
β
βββ π README.md # This file
βββ π LICENSE # MIT License
βββ π CONTRIBUTING.md # Contribution guidelines
βββ π package.json # Node.js dependencies
βββ π tsconfig.json # TypeScript configuration
π‘ Tip: Check
scripts/README.mdfor detailed documentation on all helper scripts and example invocations.
Discover new domains before running the pipeline:
# Inline queries (comma separated)
npm run search:google -- --queries "business services,consulting companies" --limit 20
# Or use a configuration file
cat > config/google-search.json <<'JSON'
{
"queries": [
"business services",
"consulting companies"
],
"resultsPerQuery": 25,
"output": "output/google-search-results.json",
"domainsOutput": "output/google-search-domains.txt"
}
JSON
npm run search:google -- --input config/google-search.jsonOutput Files:
output/google-search-results.jsonβ Full Google SERP entriesoutput/google-search-domains.{txt,json}β Deduplicated hostnames
π‘ Tip: For best results, provide Google Programmable Search credentials:
export GOOGLE_API_KEY="your-api-key" export GOOGLE_CSE_ID="your-search-engine-id"
Enrich your dataset with the Python extract-emails scraper:
-
Set up Python environment:
python3 -m venv .venv-extract source .venv-extract/bin/activate pip install "extract-emails>=5.3.3"
-
Run the helper:
python -m extract_emails.console.application \ --url https://www.example.com/contact \ --browser-name requests \ --depth 1 \ --output-file output/extract-emails/example.csv
-
Merge into main archive:
npm run update:emails
The merger automatically handles duplicates, enforces allowedDomains, and maintains alphabetical order.
For deeper harvesting on JavaScript-heavy sites, use the Scrapy project:
-
Activate virtual environment:
cd scrapy source .venv/bin/activate
-
Run a spider:
cd uae_airlines_crawler scrapy crawl multi_contacts -O ../../output/contacts.jsonl -
Merge with TypeScript pipeline:
npm run update:emails
The spider respects depth limits, filters placeholder emails, and follows configured domains.
This project includes several performance optimizations:
- LRU Caching: HTTP responses, DNS lookups, and HTML content are cached
- Connection Pooling: Persistent HTTP/HTTPS connections for faster requests
- Parallel Processing: Controlled concurrency using Promise pools
- Database Storage: SQLite for efficient large-scale data management
- Memory Management: Smart in-memory limits with database fallback
Run the test suite:
# Run all tests
npm test
# Run tests in watch mode
npm run test:watchWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests and linting (
npm test && npm run lint) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Future enhancements under consideration:
- π Google My Business session scraping for UI parity
- π Enhanced Playwright-based JavaScript rendering for SPAs
- π Pluggable deliverability providers (ZeroBounce, NeverBounce, etc.)
- π REST API & dashboard frontend
- π Advanced analytics and reporting
- π Enhanced security and rate limiting
This project is licensed under the MIT License - see the LICENSE file for details.
This project is intended for educational purposes only. Users are responsible for:
- β Complying with all applicable laws and regulations regarding web scraping
- β
Respecting
robots.txtand website terms of service - β Obtaining proper consent before collecting personal information
- β Using collected data ethically and responsibly
The authors and contributors are not responsible for any misuse of this software.
This project draws inspiration from several open-source projects:
- Map Email Scraper - Google Maps prospecting patterns
- MailScout - Deliverability verification approaches
- e-scraper - Web crawling utilities
- crawler-python - Recursive crawling strategies
- TS-email-scraper - TypeScript email extraction patterns
For questions, issues, or contributions:
- π§ Open an Issue
- π Submit a Pull Request
- π Check the Documentation
Made with β€οΈ for educational purposes
β Star this repo if you find it helpful!