Skip to content

AI-powered job prospect automation: Discover companies from ProductHunt → Extract team data → Generate personalized emails using LLMs. Features parallel processing, Notion integration, and comprehensive campaign management.

License

Notifications You must be signed in to change notification settings

Unisami/ProspectAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 ProspectAI

Intelligent Job Prospecting Automation

Transform your job search with AI-powered company discovery, team extraction, and personalized outreach

Python 3.13+ License: MIT AI Powered Zero Truncation Cross Platform Tests Passing


🎯 From ProductHunt Discovery to Personalized Outreach in 3-5 Minutes

Complete automation: Discover companies → Extract teams → Find contacts → Generate emails

🚀 Quick Start • 📖 Documentation • 🔧 Troubleshooting • 💡 Examples

✨ What is ProspectAI?

ProspectAI is an intelligent automation system that revolutionizes job prospecting by seamlessly discovering new companies from ProductHunt, extracting team member information with AI precision, finding verified contact details, and generating highly personalized outreach emails that get responses.

🔥 Why Choose ProspectAI?

🚀 Speed 🧠 Intelligence 📧 Success 🔧 Reliability
10x Faster Multi-AI Support High Response Rate Zero Data Loss
3-5 minutes vs hours Multi-AI Providers Personalized emails No truncation ever
Parallel processing 47K tokens per company Business context Complete pipeline

🎯 Complete Automation Pipeline:

flowchart LR
    A[🔍 Discover Companies] --> B[🧠 Extract Teams]
    B --> C[💼 Find LinkedIn Profiles]
    C --> D[📧 Discover Emails]
    D --> E[🤖 Generate Personalized Emails]
    E --> F[📊 Store in Notion]
    F --> G[🚀 Send & Track]
Loading

🌟 Core Features

📊 Feature Comparison Matrix

Feature Category Traditional Manual Basic Tools ProspectAI
🔍 Company Discovery 2-3 hours 30-60 min 3-5 min
🧠 Team Extraction Manual research Basic scraping AI-powered
💼 LinkedIn Finding Click-by-click Slow scraping 20x faster
📧 Email Discovery Manual search Basic tools Hunter.io + AI
✍️ Email Writing Generic templates Basic personalization AI personalized
📊 Data Storage Spreadsheets Basic CRM Notion + No limits
🔄 Automation Level 0% 30-40% 95%+ automated

🚀 Discovery & Intelligence

  • 🎯 Multi-Strategy ProductHunt Scraping: Apollo GraphQL + Selenium + Requests
  • 🧠 AI-Enhanced Team Extraction: 4-strategy identification with advanced AI models
  • Ultra-Fast LinkedIn Discovery: 10-30s per profile (450-1800x faster)
  • 💼 Smart Profile Caching: Failed searches cached to prevent repeats
  • 🎭 Website Intelligence: Dual-method URL extraction and validation

🤖 AI-Powered Processing

  • 🌐 Multi-Provider Architecture: OpenAI, Azure, Anthropic, Google, DeepSeek
  • 💡 Unified AI Service: Centralized processing with provider switching
  • 📈 47K Tokens Per Company: Comprehensive analysis without limits
  • 🎨 Zero Data Truncation: Complete preservation via Notion blocks
  • 📋 Rich Text Storage: Unlimited content with intelligent splitting

📧 Email Generation & Outreach

  • Emotionally Resonant Writing: Authentic, builder-focused language
  • 🎯 High-Converting Templates: Under 150 words with "tl;dr" sections
  • 📊 Business Intelligence Context: Market analysis + competitive insights
  • 🕰️ Sender Profile Integration: Professional matching for authenticity
  • 📬 Multiple Email Types: Cold outreach, referral, product interest, networking

📊 Performance & Monitoring

  • Parallel Processing: 3-5x faster with configurable worker pools
  • 🏃 Optimized Rate Limiting: 0.5s delays vs 2.0s default (4x faster)
  • 📈 Real-Time Analytics: Token usage, success rates, performance tracking
  • 🔄 Multi-Tier Caching: Memory + persistent with intelligent invalidation
  • 📊 Campaign Dashboard: Live progress tracking in Notion

🔧 Developer Experience

  • 🎨 Rich CLI Interface: Beautiful progress bars and status updates
  • 🔍 Comprehensive Testing: 16+ test scripts for all components
  • 📈 Debug Utilities: Verbose logging and component isolation
  • ⚙️ Configuration Validation: Built-in API key and settings validation
  • 📊 Performance Benchmarking: Automated speed and accuracy testing

📋 Navigation & Quick Links


📝 Project Status: This project has been reorganized for better maintainability and performance. All utility scripts are now in scripts/, configuration templates in config/, comprehensive guides in docs/, and detailed reports in reports/.

🚀 Performance: Recent optimizations have made the system 4-6x faster overall, with LinkedIn discovery now 20x faster (10-30s vs 10+ minutes).

📖 Need Help? Check the 🔧 Troubleshooting Guide or 📋 Usage Examples

⚡ Quick Start

🎯 Get your first personalized outreach emails in under 10 minutes

🚀 One-Command Setup

# Clone, install, and run your first campaign
git clone <repository-url> && cd job-prospect-automation
pip install -r requirements.txt
cp .env.example .env
# Add your API keys to .env (see Configuration section)
python cli.py run-campaign --limit 5 --generate-emails

📋 Step-by-Step Setup

🔧 Detailed Installation Steps

1️⃣ Clone and Install

git clone <repository-url>
cd job-prospect-automation
pip install -r requirements.txt

2️⃣ Configure API Keys

cp .env.example .env
# Edit .env with your API keys (see Configuration section below)

3️⃣ Validate Setup

python cli.py validate-config

This command will:

  • ✅ Validate all configuration settings
  • ✅ Test connections to all APIs (Notion, Hunter.io, AI Provider, Resend)
  • ✅ Verify sender profile completeness

4️⃣ Test Pipeline

python scripts/test_full_pipeline.py

5️⃣ Create Dashboard

python scripts/setup_dashboard.py

6️⃣ Run First Campaign

python cli.py run-campaign --limit 10 --generate-emails

7️⃣ Test Email Generation

python scripts/test_email_pipeline.py

🎯 First Run

After setup, your first campaign will:

  • ✅ Discover 5-10 companies from ProductHunt
  • ✅ Extract team members with AI precision
  • ✅ Find LinkedIn profiles and email addresses
  • ✅ Generate personalized outreach emails
  • ✅ Store everything in Notion with zero truncation

Expected Results:

  • 📊 ~10-15 prospects discovered and processed
  • 💰 ~$0.15-0.25 in AI processing costs (actual: $0.015 per prospect)
  • ⏱️ 3-5 minutes total processing time (4-6x faster with performance optimizations)
  • 📧 High-quality personalized emails ready to send

⚡ Performance Optimization

For maximum speed, run the comprehensive performance optimization script:

# Apply all performance optimizations (recommended)
python scripts/fix_all_performance_issues.py

This script provides:

  • 20x faster LinkedIn finding (6-7 minutes → 10-30 seconds)
  • 4-6x faster overall pipeline (15-20 minutes → 3-5 minutes)
  • 2-3x faster WebDriver operations
  • 3-5x faster HTTP requests
  • Optimized rate limiting across all services

🛠 Installation

Prerequisites

  • Python 3.13 or higher
  • pip package manager
  • Internet connection for API access

Step-by-Step Installation

  1. Clone the Repository

    git clone <repository-url>
    cd job-prospect-automation
  2. Create Virtual Environment (Recommended)

    python -m venv venv
    
    # On Windows
    venv\Scripts\activate
    
    # On macOS/Linux
    source venv/bin/activate
  3. Install Dependencies

    pip install -r requirements.txt
  4. Verify Installation

    python cli.py --help

Docker Installation (Optional)

# Build Docker image
docker build -t job-prospect-automation .

# Run with environment file
docker run --env-file .env job-prospect-automation discover --limit 10

⚙️ Configuration

🔑 Required API Keys

🎯 4 API keys unlock the complete automation pipeline

The system requires the following API keys for full functionality:

  1. Notion Integration Token (Required)

    • Go to Notion Developers
    • Create a new integration with read/write permissions
    • Copy the Internal Integration Token
    • Usage: Stores unlimited prospect data with rich text blocks (no truncation)
  2. Hunter.io API Key (Required)

    • Sign up at Hunter.io
    • Go to API section in your dashboard
    • Copy your API key (free tier: 25 requests/month)
    • Usage: Email discovery with verification and confidence scoring
  3. AI Provider API Key (Choose one or more)

    • 5 AI Providers Supported: Choose the best fit for your needs and budget
    • OpenAI: Most popular, proven performance, extensive features
    • Azure OpenAI: Enterprise-grade with custom deployments and Microsoft integration
    • Anthropic: Constitutional AI with Claude models, safety-focused, long context
    • Google Gemini: Multimodal capabilities with extremely long context windows
    • DeepSeek: Cost-effective with specialized models for coding and reasoning
    • Setup Guide: Complete AI Provider Setup
    • Usage: 47K tokens per company ($0.08), optimized with 2 consolidated AI calls for parsing, analysis, and email generation
  4. Resend API Key (Optional - for email sending)

    • Sign up at Resend
    • Create API key for email delivery
    • Configure domain for better deliverability
    • Usage: Automated email delivery with tracking and analytics

💰 AI Processing Costs

🎯 Real-World Performance Data

Metric Actual Results Cost Analysis
Total Tokens Used 112.15K tokens $0.196 total cost
Prospects Processed 13 prospects Complete pipeline
Tokens per Prospect ~8,627 tokens $0.015 per prospect
Processing Stages Discovery → Email Sent End-to-end automation

📊 Detailed Breakdown:

  • Company Discovery: AI-powered ProductHunt parsing and team extraction
  • Profile Intelligence: LinkedIn scraping with AI structuring
  • Business Analysis: Market insights, funding data, competitive intelligence
  • Email Generation: Personalized outreach with business context
  • Complete Pipeline: From discovery to delivered emails

💡 Cost Efficiency:

  • Per Prospect: $0.015 (vs $15-25 manual research time)
  • Daily Budget (50 prospects): ~$0.75
  • Monthly Investment: ~$22.50 (vs $18,750 manual equivalent)
  • ROI: 1,250x cost savings compared to manual research

Configuration Methods

Method 1: Environment Variables

Create a .env file:

cp .env.example .env

Edit .env with your API keys:

# Required API Keys
NOTION_TOKEN=your_notion_integration_token_here
HUNTER_API_KEY=your_hunter_io_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

# Optional: Email Sending (Resend)
RESEND_API_KEY=your_resend_api_key_here
SENDER_EMAIL=your-name@yourdomain.com
SENDER_NAME=Your Full Name

# Optional: Notification Settings
ENABLE_NOTIFICATIONS=true
NOTIFICATION_METHODS=['notion']  # Available: notion, email, webhook

# Optional: User mention settings for enhanced notifications (future feature)
NOTION_USER_ID=your-notion-user-id  # For @mentions in notifications
USER_EMAIL=your-email@domain.com    # For @remind notifications

# Enhanced AI Features
ENABLE_AI_PARSING=true
ENABLE_PRODUCT_ANALYSIS=true
ENHANCED_PERSONALIZATION=true
AI_PARSING_MODEL=gpt-4
EMAIL_GENERATION_MODEL=gpt-4
ENABLE_LINKEDIN_DISCOVERY=true
ENABLE_DATA_QUALITY_FIXES=true

# Processing Settings
SCRAPING_DELAY=0.3
HUNTER_REQUESTS_PER_MINUTE=10
MAX_PRODUCTS_PER_RUN=50
MAX_PROSPECTS_PER_COMPANY=3
EMAIL_TEMPLATE_TYPE=professional
PERSONALIZATION_LEVEL=medium

# Data Quality Settings
PREVENT_DATA_TRUNCATION=true
ENABLE_RICH_TEXT_STORAGE=true
LINKEDIN_SEARCH_STRATEGIES=4
MAX_LINKEDIN_SEARCH_ATTEMPTS=3

# Workflow Settings
AUTO_SEND_EMAILS=false
EMAIL_REVIEW_REQUIRED=true
ENABLE_ENHANCED_WORKFLOW=true
ENABLE_BATCH_PROCESSING=true

Method 2: Configuration File

Create a configuration file:

python cli.py init-config config.yaml

Edit the generated file with your settings:

NOTION_TOKEN: "your_notion_token_here"
HUNTER_API_KEY: "your_hunter_api_key_here"
OPENAI_API_KEY: "your_openai_api_key_here"

# Rate limiting
SCRAPING_DELAY: 0.3
HUNTER_REQUESTS_PER_MINUTE: 10

# Processing limits
MAX_PRODUCTS_PER_RUN: 50
MAX_PROSPECTS_PER_COMPANY: 10

# Email settings
EMAIL_TEMPLATE_TYPE: "professional"
PERSONALIZATION_LEVEL: "medium"

Use with CLI:

python cli.py --config config.yaml discover

Configuration Validation

Test your configuration:

# Validate all configuration settings and test API connections
python cli.py validate-config

# Test with dry-run mode
python cli.py --dry-run discover --limit 1

# Run comprehensive pipeline test
python scripts/test_full_pipeline.py

📖 Usage

🎯 Powerful CLI and Python API for complete workflow control

🖥️ Graphical User Interface (GUI)

For users who prefer a visual interface over command-line tools, we provide a simple GUI application:

# Run the GUI application
python run_gui.py

# Or on Windows:
run_gui.bat

# Or on Linux/macOS:
./run_gui.sh

The GUI provides:

  • 🎨 User-Friendly Interface: Simple point-and-click operation
  • 📋 All Main Commands: Access to discover, run-campaign, process-company, and generate-emails
  • ⚙️ Configuration Management: Easy setup of environment and config files
  • 📊 Real-time Output: View command output as it runs
  • 🚫 Cancel Operations: Stop long-running processes when needed

GUI Features:

  1. Dashboard Tab: Quick overview and system status
  2. Discover Tab: Run company discovery with customizable parameters
  3. Run Campaign Tab: Execute complete campaigns with email generation
  4. Process Company Tab: Process specific companies
  5. Generate Emails Tab: Create personalized emails for prospects
  6. Settings Tab: Configure environment and default settings

For detailed information about the GUI, see GUI Runner Documentation.

🖥️ Command Line Interface

The system provides a comprehensive CLI with intuitive commands for every workflow:

Global Options

  • --config, -c: Path to configuration file
  • --dry-run: Test mode without API calls
  • --verbose, -v: Enable detailed logging
  • --help: Show help information

Main Commands

1. Configuration and Testing

# Validate configuration
python cli.py validate-config

# Test full pipeline
python scripts/test_full_pipeline.py

# Test email generation only
python scripts/test_email_pipeline.py

2. Discovery Pipeline

# Run complete campaign workflow (recommended)
python cli.py run-campaign --limit 10 --generate-emails

# Alternative: Discovery only
python cli.py discover --limit 10

# Test without API calls
python cli.py --dry-run discover --limit 5

# Run with sender profile
python cli.py discover --limit 10 --sender-profile profiles/my_profile.md

3. Process Specific Company

# Process by company name
python cli.py process-company "Acme Corp"

# With known domain
python cli.py process-company "Acme Corp" --domain acme.com

4. Email Generation and Sending

# Generate emails for specific prospects
python cli.py generate-emails --prospect-ids "id1,id2,id3"

# Generate emails for recent prospects (convenience command)
python cli.py generate-emails-recent --limit 5

# Generate emails for recent prospects
python cli.py generate-emails-recent --limit 3

# Alternative: Send recently generated emails separately
python cli.py send-emails-recent --limit 5

# Use sender profile
python cli.py generate-emails --prospect-ids "id1" --sender-profile profiles/my_profile.md

5. Data Quality Management

# Fix truncated data in existing prospects
python scripts/fix_all_truncation_issues.py

# Find missing LinkedIn URLs
python scripts/find_missing_linkedin_urls.py

# Test Notion storage limits
python scripts/test_notion_storage_limits.py

# Analyze data quality
python scripts/fix_all_truncation_issues.py analyze

6. AI Provider Management

# List available AI providers
python cli.py list-ai-providers

# Configure a specific provider
python cli.py configure-ai --provider anthropic

# Switch active provider
python cli.py set-ai-provider anthropic

# Validate AI provider configuration
python cli.py validate-ai-config

# Test provider connection
python cli.py test-ai-provider anthropic

7. System Status and Monitoring

# Check system status
python cli.py status

# View batch processing history
python cli.py batch-history

# Monitor AI token usage
python -c "from AI_TOKEN_CONSUMPTION_ANALYSIS import analyze_usage; analyze_usage()"

# Check LinkedIn coverage stats
python scripts/find_missing_linkedin_urls.py --stats

Python API

Use the system programmatically:

from controllers.prospect_automation_controller import ProspectAutomationController
from utils.config import Config
from services.email_generator import EmailTemplate

# Initialize
config = Config.from_file("config.yaml")  # or Config.from_env()
controller = ProspectAutomationController(config)

# Run discovery pipeline
results = controller.run_discovery_pipeline(limit=10)
print(f"Found {results['summary']['prospects_found']} prospects")

# Process specific company
from models.data_models import CompanyData
company = CompanyData(
    name="Acme Corp", 
    domain="acme.com", 
    product_url="https://acme.com",
    description="AI-powered analytics platform"
)
prospects = controller.process_company(company)

# Generate emails
prospect_ids = [p.id for p in prospects if p.id]
email_results = controller.generate_outreach_emails(
    prospect_ids=prospect_ids,
    template_type=EmailTemplate.COLD_OUTREACH
)

# Generate and send emails
send_results = controller.generate_and_send_outreach_emails(
    prospect_ids=prospect_ids,
    template_type=EmailTemplate.COLD_OUTREACH,
    send_immediately=True
)

📚 API Documentation

🔧 Comprehensive Python API for custom integrations

🏗️ Core Components

ProspectAutomationController

Main orchestrator for the entire workflow.

Key Methods:

  • run_discovery_pipeline(limit=50): Run complete discovery workflow with AI enhancement
  • process_company(company_data): Process single company with AI parsing and analysis
  • generate_outreach_emails(prospect_ids, template_type): Generate personalized emails
  • generate_and_send_outreach_emails(prospect_ids, template_type, send_immediately): Generate and send emails
  • send_prospect_emails(prospect_ids, batch_size=5, delay=30): Send already generated emails with batch processing
  • get_workflow_status(): Get system status and statistics
  • set_sender_profile(profile_path): Set sender profile for personalization

AIProviderManager

Central manager for all AI providers with thread-safe operations.

Key Methods:

  • get_provider_manager(): Get singleton instance
  • configure_provider_manager(config): Configure with system settings
  • list_providers(): List all registered providers
  • get_active_provider_name(): Get currently active provider
  • set_active_provider(name): Switch active provider
  • validate_provider(name): Validate provider configuration
  • make_completion(request, provider_name): Make AI completion request
  • get_provider_status(): Get comprehensive provider status

Data Models

CompanyData

@dataclass
class CompanyData:
    name: str
    domain: str
    product_url: str
    description: str
    launch_date: datetime

Prospect

@dataclass
class Prospect:
    id: str
    name: str
    role: str
    company: str
    linkedin_url: Optional[str]
    email: Optional[str]
    contacted: bool
    notes: str
    created_at: datetime
    
    # Email tracking fields
    email_generation_status: str
    email_delivery_status: str
    email_subject: str
    email_content: str
    email_generated_date: Optional[str]
    email_sent_date: Optional[str]

Service Components

ProductHuntScraper

  • get_latest_products(limit): Multi-strategy ProductHunt discovery
  • extract_team_info(product_url): 4-strategy team extraction (LinkedIn URLs extracted when available from ProductHunt)
  • extract_company_domain(product_data): Website URL extraction with validation

AIParser

  • structure_team_data(raw_html, company): AI-powered team data structuring
  • parse_linkedin_profile(raw_html): LinkedIn profile parsing with confidence scoring
  • parse_product_info(raw_content): Product analysis with market intelligence
  • extract_business_metrics(company_data): Business insights and funding analysis

LinkedInFinderOptimized

  • find_linkedin_urls_for_team(team_members): Ultra-fast LinkedIn URL discovery with smart caching
  • _fast_linkedin_search(member): Single fast search strategy with 3 methods
  • _direct_linkedin_search(member): Direct URL pattern matching with HEAD requests
  • _quick_google_search(member): Fast Google/DuckDuckGo search with 3s timeout
  • _generate_likely_linkedin_url(member): Intelligent URL generation from name patterns
  • _quick_url_check(url): Fast URL validation with 2s timeout

EmailFinder

  • find_company_emails(domain): Hunter.io integration with pattern generation
  • find_person_email(name, domain): Specific person email discovery
  • verify_email(email): Email validation with confidence scoring

NotionDataManager

  • store_ai_structured_data(prospect_id, **data): Zero-truncation data storage
  • _create_rich_text_blocks(text): Smart content splitting for unlimited length
  • get_prospect_data_for_email(prospect_id): Complete data retrieval for personalization
  • create_campaign_dashboard(): Create campaign management databases
  • create_campaign(campaign_data, campaigns_db_id): Track campaign progress
  • log_processing_step(logs_db_id, ...): Log detailed processing steps
  • update_system_status(status_db_id, ...): Monitor component health

EmailGenerator

  • generate_enhanced_outreach_email(prospect_id, notion_manager): AI-powered personalization
  • generate_and_send_bulk_emails(prospect_ids): Batch processing with rate limiting
  • _prepare_personalization_data(prospect, ai_data): Rich context preparation

💡 Examples

🎯 Real-world examples for every use case

🚀 Complete Workflow Example

#!/usr/bin/env python3
"""Complete workflow example with AI enhancement and data quality fixes."""

from controllers.prospect_automation_controller import ProspectAutomationController
from utils.config import Config
from utils.logging_config import setup_logging
from services.email_generator import EmailTemplate

def main():
    # Setup
    setup_logging(log_level="INFO")
    config = Config.from_file("config.yaml")
    controller = ProspectAutomationController(config)
    
    # Run enhanced discovery pipeline with LinkedIn discovery
    print("Starting AI-enhanced discovery with LinkedIn URL finding...")
    results = controller.run_discovery_pipeline(limit=10)
    
    # Display comprehensive results
    summary = results['summary']
    print(f"Companies processed: {summary['companies_processed']}")
    print(f"Prospects found: {summary['prospects_found']}")
    print(f"Emails found: {summary['emails_found']}")
    print(f"LinkedIn profiles: {summary.get('linkedin_profiles_extracted', 0)}")
    print(f"AI structured data: {summary.get('ai_structured_data_created', 0)}")
    print(f"Success rate: {summary['success_rate']:.1f}%")
    print(f"Token usage: ~{summary.get('total_tokens', 0)} tokens")
    
    # Get prospects from Notion with complete data
    prospects = controller.notion_manager.get_prospects()
    prospect_ids = [p.id for p in prospects[:5] if p.id and p.email]
    
    if prospect_ids:
        print(f"Generating enhanced emails for {len(prospect_ids)} prospects...")
        
        # Generate emails using AI-structured data (no truncation)
        email_results = controller.generate_outreach_emails(
            prospect_ids=prospect_ids,
            template_type=EmailTemplate.COLD_OUTREACH
        )
        
        successful = len(email_results.get('successful', []))
        failed = len(email_results.get('failed', []))
        print(f"Email generation: {successful} successful, {failed} failed")
        
        # Show personalization quality
        for result in email_results.get('successful', [])[:2]:
            print(f"Generated email for {result['prospect_name']}:")
            print(f"  Subject: {result['email_content']['subject']}")
            print(f"  Personalization score: {result['email_content']['personalization_score']:.2f}")
            print(f"  Body preview: {result['email_content']['body'][:150]}...")
        
        # Send emails with rate limiting
        if successful > 0:
            send_results = controller.generate_and_send_outreach_emails(
                prospect_ids=prospect_ids[:2],
                template_type=EmailTemplate.COLD_OUTREACH,
                send_immediately=False,  # Set to True to actually send
                delay_between_emails=2.0
            )
            
            print(f"Email sending: {send_results.get('emails_generated', 0)} generated")
            print(f"Sender profile used: {send_results.get('sender_profile_used', False)}")
            
        # Alternative: Send already generated emails in batches
        # This is useful when you want to review emails before sending
        prospect_ids_to_send = [p.id for p in prospects if p.email][:3]
        if prospect_ids_to_send:
            batch_results = controller.send_prospect_emails(
                prospect_ids=prospect_ids_to_send,
                batch_size=2,  # Send 2 emails per batch
                delay=10       # Wait 10 seconds between batches
            )
            
            print(f"Batch email sending: {batch_results['total_sent']} sent, {batch_results['total_failed']} failed")
    else:
        print("No prospects with emails found for email generation")

if __name__ == "__main__":
    main()

Data Quality Management Example

#!/usr/bin/env python3
"""Data quality management and LinkedIn discovery example."""

from services.notion_manager import NotionDataManager
from services.linkedin_finder import LinkedInFinder
from utils.config import Config

def main():
    config = Config.from_env()
    notion_manager = NotionDataManager(config)
    linkedin_finder = LinkedInFinder(config)
    
    # Analyze current data quality
    print("Analyzing data quality...")
    prospects = notion_manager.get_prospects()
    
    # Check for truncated data
    truncated_count = 0
    missing_linkedin_count = 0
    
    for prospect in prospects:
        prospect_data = notion_manager.get_prospect_data_for_email(prospect.id)
        
        # Check for truncation indicators
        for field, value in prospect_data.items():
            if isinstance(value, str) and (
                len(value) in [200, 300, 400, 500] or  # Old limits
                value.endswith('...')  # Truncation indicator
            ):
                truncated_count += 1
                break
        
        # Check for missing LinkedIn URLs
        if not prospect_data.get('linkedin_url'):
            missing_linkedin_count += 1
    
    print(f"Data quality analysis:")
    print(f"  Total prospects: {len(prospects)}")
    print(f"  Prospects with truncated data: {truncated_count}")
    print(f"  Prospects without LinkedIn URLs: {missing_linkedin_count}")
    
    # Fix missing LinkedIn URLs
    if missing_linkedin_count > 0:
        print(f"Finding LinkedIn URLs for {missing_linkedin_count} prospects...")
        
        # Process in batches
        from models.data_models import TeamMember
        prospects_without_linkedin = [p for p in prospects if not p.linkedin_url]
        
        for prospect in prospects_without_linkedin[:5]:  # Process first 5
            team_member = TeamMember(
                name=prospect.name,
                role=prospect.role,
                company=prospect.company,
                linkedin_url=None
            )
            
            updated_members = linkedin_finder.find_linkedin_urls_for_team([team_member])
            
            if updated_members and updated_members[0].linkedin_url:
                # Update in Notion
                properties = {"LinkedIn": {"url": updated_members[0].linkedin_url}}
                notion_manager.client.pages.update(
                    page_id=prospect.id,
                    properties=properties
                )
                print(f"✓ Found LinkedIn URL for {prospect.name}")
            else:
                print(f"✗ No LinkedIn URL found for {prospect.name}")
    
    # Test data storage without truncation
    print("Testing unlimited data storage...")
    test_prospect = prospects[0] if prospects else None
    
    if test_prospect:
        # Store very long content
        long_content = "This is a comprehensive business analysis. " * 200  # 8000+ chars
        
        success = notion_manager.store_ai_structured_data(
            prospect_id=test_prospect.id,
            business_insights=long_content
        )
        
        if success:
            # Retrieve and verify
            retrieved_data = notion_manager.get_prospect_data_for_email(test_prospect.id)
            retrieved_length = len(retrieved_data.get('business_insights', ''))
            
            print(f"✓ Stored {len(long_content)} characters")
            print(f"✓ Retrieved {retrieved_length} characters")
            print(f"✓ No truncation: {retrieved_length == len(long_content)}")

if __name__ == "__main__":
    main()

Batch Processing Example

#!/usr/bin/env python3
"""Batch processing with progress tracking."""

from controllers.prospect_automation_controller import ProspectAutomationController
from models.data_models import CompanyData
from utils.config import Config

def progress_callback(progress):
    """Handle progress updates."""
    print(f"Progress: {progress.processed_companies}/{progress.total_companies}")
    print(f"Current: {progress.current_company}")
    print(f"Success rate: {progress.success_rate:.1f}%")

def main():
    config = Config.from_env()
    controller = ProspectAutomationController(config)
    
    # Define companies to process
    companies = [
        CompanyData(name="Company 1", domain="company1.com", ...),
        CompanyData(name="Company 2", domain="company2.com", ...),
        # ... more companies
    ]
    
    # Run batch processing
    results = controller.run_batch_processing(
        companies=companies,
        batch_size=3,
        progress_callback=progress_callback
    )
    
    print(f"Batch completed: {results['status']}")
    print(f"Total prospects: {results['summary']['total_prospects']}")

if __name__ == "__main__":
    main()

Error Handling Example

#!/usr/bin/env python3
"""Error handling and monitoring example."""

from utils.error_handling import get_error_handler, retry_with_backoff
from utils.api_monitor import get_api_monitor

@retry_with_backoff(max_retries=3)
def unreliable_api_call():
    """Example of API call with retry logic."""
    # Your API call here
    pass

def main():
    error_handler = get_error_handler()
    api_monitor = get_api_monitor()
    
    try:
        result = unreliable_api_call()
    except Exception as e:
        # Handle error with context
        error_info = error_handler.handle_error(
            error=e,
            service="my_service",
            operation="api_call",
            context={"param": "value"}
        )
        print(f"Error handled: {error_info.error_id}")
    
    # Check API health
    health = api_monitor.get_service_health()
    for service, status in health.items():
        print(f"{service}: {status.status.value}")

if __name__ == "__main__":
    main()

🔧 Troubleshooting

🛠️ Quick solutions for common issues

⚠️ Common Issues

1. Configuration Issues

Problem: Error: Configuration validation failed

Solutions:

# Validate your configuration
python cli.py validate-config

# Check specific issues
python cli.py validate-config --check-profile profiles/my_profile.md

# Test with dry-run
python cli.py --dry-run discover --limit 1

2. API Key Issues

Problem: Error: Invalid API key or Error: NOTION_TOKEN environment variable is required

Solutions:

  • Verify .env file exists and contains correct keys
  • Check environment variable names match exactly
  • Ensure no extra spaces or quotes in .env file
  • Test individual APIs and connections: python cli.py validate-config

3. Email Generation Issues

Problem: Error: Failed to generate email or 'dict' object has no attribute 'basic_info'

Solutions:

# Test email pipeline specifically
python scripts/test_email_pipeline.py

# Debug email content character issues
python scripts/debug_email_content.py

# Check sender profile completeness
python cli.py validate-config --check-profile profiles/my_profile.md

# Verify prospect data quality
python cli.py status

4. Rate Limiting

Problem: Error: Rate limit exceeded

Solutions:

  • Increase SCRAPING_DELAY in configuration (try 1.0 or higher)
  • Reduce HUNTER_REQUESTS_PER_MINUTE (try 5 or lower)
  • Use smaller batch sizes: --limit 5
  • Wait before retrying (system has automatic backoff)

5. Notion Database Issues

Problem: Error: Cannot access Notion database

Solutions:

  • Ensure Notion integration has proper permissions
  • Let system create new database automatically
  • Verify integration is added to the workspace

Note: The system automatically handles duplicate prospects by returning the existing prospect's ID instead of creating a new entry. This ensures data consistency and prevents duplicate records in your database.


🏗️ Architecture Overview

graph TD
    A[ProductHunt Discovery] --> B[AI Team Extraction]
    B --> C[LinkedIn Profile Discovery]
    C --> D[Email Finding]
    D --> E[AI Business Analysis]
    E --> F[Personalized Email Generation]
    F --> G[Notion Storage]
    G --> H[Email Delivery]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
    style F fill:#f1f8e9
    style G fill:#e3f2fd
    style H fill:#fff8e1
Loading

🔄 Data Flow

  1. Discovery: Multi-strategy ProductHunt scraping with Apollo GraphQL
  2. Extraction: 4-strategy team member identification with AI structuring
  3. Enrichment: LinkedIn discovery + email finding + business analysis
  4. Storage: Zero-truncation Notion storage with rich text blocks
  5. Generation: AI-powered personalized emails with business context
  6. Delivery: Automated sending with tracking and analytics

🎯 Use Cases

Job Seekers Sales Teams Recruiters Business Development
Find hiring companies Prospect new clients Discover talent pools Identify partnerships
Connect with hiring managers Generate warm leads Build candidate pipelines Research market opportunities
Personalized outreach AI-powered sales emails Talent acquisition Strategic relationship building

🚀 Performance Metrics

Metric Manual Process ProspectAI Improvement
Time per Company 45-60 minutes 3-5 minutes 10-15x faster
Data Accuracy 60-70% 85-95% 25-35% better
Email Personalization Basic AI-enhanced Professional grade
Scalability 5-10 companies/day 50-100 companies/day 10x scale
Cost per Prospect $15-25 (time) $0.015 1,000-1,600x cheaper

🛡️ Security & Privacy

  • 🔐 API Key Security: Environment-based configuration with validation
  • 🛡️ Rate Limiting: Respectful API usage with exponential backoff
  • 🔒 Data Privacy: No sensitive data stored in logs or temporary files
  • ✅ Compliance: GDPR-compliant data handling practices
  • 🚫 Anti-Spam: Built-in email validation and deliverability checks

🤝 Contributing

We welcome contributions! Here's how to get started:

🔧 Development Setup

# Clone and setup development environment
git clone <repository-url>
cd job-prospect-automation
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -r requirements-dev.txt  # Development dependencies

🧪 Running Tests

# Run all tests
pytest tests/

# Run specific test categories
python scripts/test_full_pipeline.py
python scripts/test_email_pipeline.py
python scripts/test_linkedin_optimization.py

# Run performance tests
python scripts/test_performance_comparison.py

📝 Code Style

  • Black: Code formatting
  • Flake8: Linting
  • MyPy: Type checking
  • Pytest: Testing framework
  • Import Analyzer: Unused import detection and circular dependency analysis

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • AI Provider for intelligent processing capabilities
  • Notion for unlimited data storage
  • Hunter.io for email discovery
  • ProductHunt for company discovery
  • Resend for email delivery
  • Python Community for excellent libraries

🌟 Star History

Star History Chart


💬 Questions? Issues? Ideas?


Made with ❤️ for job seekers and sales professionals worldwide

Transform your outreach. Scale your success. Automate your future.

⭐ Star this repo if ProspectAI helped you land your dream job or close more deals!

r Notion database. Additionally, the system maintains a cache of processed companies and domains to avoid reprocessing the same companies in subsequent runs.

6. No Prospects Found

Problem: No prospects found or No team members found

Solutions:

# Test with different companies
python cli.py process-company "TechStartup" --domain techstartup.com

# Check ProductHunt scraping
python scripts/test_full_pipeline.py

# Verify team extraction
python scripts/test_team_extraction.py

# Test AI team extraction specifically
python -c "from services.ai_team_extractor import AITeamExtractor; from utils.config import Config; extractor = AITeamExtractor(Config.from_env()); print('AI extractor ready')"

7. Data Truncation Issues

Problem: Business insights truncated or Incomplete personalization data

Solutions:

# Check for truncated data
python scripts/fix_all_truncation_issues.py analyze

# Fix existing truncated data
python scripts/fix_all_truncation_issues.py

# Test Notion storage limits
python scripts/test_notion_storage_limits.py

# Verify data completeness
python -c "from services.notion_manager import NotionDataManager; from utils.config import Config; nm = NotionDataManager(Config.from_env()); prospects = nm.get_prospects(); print(f'Found {len(prospects)} prospects')"

8. Missing LinkedIn URLs

Problem: LinkedIn URL not found or Low LinkedIn coverage

Solutions:

# Check LinkedIn coverage statistics
python scripts/find_missing_linkedin_urls.py --stats

# Find missing LinkedIn URLs
python scripts/find_missing_linkedin_urls.py

# Test LinkedIn finder
python scripts/find_missing_linkedin_urls.py --test

# Manual test for specific person
python -c "from services.linkedin_finder import LinkedInFinder; from models.data_models import TeamMember; from utils.config import Config; finder = LinkedInFinder(Config.from_env()); member = TeamMember(name='John Smith', role='CEO', company='TestCorp', linkedin_url=None); result = finder.find_linkedin_urls_for_team([member]); print(f'Found: {result[0].linkedin_url if result and result[0].linkedin_url else \"Not found\"}')"

Debugging Steps

  1. Run Comprehensive Tests

    # Test full pipeline
    python scripts/test_full_pipeline.py
    
    # Test email functionality
    python scripts/test_email_pipeline.py
    
    # Test AI personalization data quality
    python scripts/test_personalization_fix.py
    
    # Validate configuration
    python cli.py validate-config
  2. Debug Specific Issues

    # Debug company discovery process step-by-step
    python scripts/debug_discovery.py
    
    # Debug email generation process step-by-step
    python scripts/debug_email_generation.py
    
    # Test dashboard creation components
    python scripts/test_dashboard_creation.py
    
    # Debug daily analytics creation issues
    python scripts/debug_daily_analytics.py
    
    # Debug email content character issues
    python scripts/debug_email_content.py
    
    # Debug Notion storage during parallel processing
    python scripts/debug_notion_storage.py
    
    # Debug individual components
    python cli.py --dry-run discover --limit 1
  3. Test Individual Components

    # Test discovery only
    python cli.py --dry-run discover --limit 1
    
    # Test specific company
    python cli.py --dry-run process-company "TechStartup" --domain techstartup.com
    
    # Test email generation
    python scripts/test_simple_email.py
    
    # Debug email content issues
    python scripts/debug_email_content.py
  4. Check System Status

    # Check overall status
    python cli.py status
    
    # Check with verbose logging
    python cli.py --verbose --dry-run discover --limit 1
  5. Monitor API Usage

    # Check API quotas
    python cli.py validate-config
    
    # Test API connections
    python -c "from utils.config import Config; Config.from_file('config.yaml').validate()"

Getting Help

  1. Check Documentation

    • Read this README thoroughly
    • Check docs/CLI_USAGE.md for detailed CLI help
    • Review example files in examples/ directory
  2. Enable Verbose Logging

    python cli.py --verbose [command]
  3. Use Dry-Run Mode

    python cli.py --dry-run [command]
  4. Check System Status

    python cli.py status

Performance Optimization

  1. Adjust Rate Limits

    • Increase delays for stability
    • Decrease for faster processing (risk of blocks)
  2. Optimize Batch Sizes

    • Smaller batches: More stable, slower
    • Larger batches: Faster, higher memory usage
  3. Monitor Resource Usage

    • Check memory consumption during large operations
    • Monitor API quota usage
    • Use progress callbacks for long operations

✅ Current Status

The system is fully functional with recent major improvements. Here's what's working:

✅ Core Features (Fully Operational)

  • Multi-Strategy Discovery: ProductHunt scraping with Apollo GraphQL parsing
  • AI-Enhanced Team Extraction: 4-strategy team identification with 95%+ success rate
  • LinkedIn URL Discovery: Multi-search approach finding 60-80% of missing LinkedIn URLs
  • Zero-Truncation Data Storage: Complete preservation of business insights and personalization data
  • AI-Powered Email Generation: Advanced AI personalization with rich business context
  • Comprehensive Notion Integration: Unlimited content storage with rich text blocks

🚀 Recent Major Improvements

  • Data Quality Fixes: Eliminated all data truncation issues (was losing 70%+ of content)
  • LinkedIn Discovery: Added intelligent LinkedIn URL finding for missing profiles
  • AI Parser Enhancement: Increased token limits (2.5x more complete data)
  • Rich Text Storage: Notion integration now handles unlimited content length
  • Token Optimization: Comprehensive AI usage analysis and cost optimization

📊 Performance Metrics

Recent test run results:

✅ Configuration validation: PASSED
✅ Discovery pipeline: PASSED (10 companies, 23 prospects, 18 with emails)
✅ LinkedIn URL discovery: PASSED (found 14/23 missing LinkedIn URLs)
✅ Data storage: PASSED (6,195 char product summaries stored without truncation)
✅ Email generation: PASSED (AI personalization with complete business context)
✅ Token usage: ~47K tokens per company (~$0.082 cost)
✅ All services initialized successfully

🧪 Comprehensive Test Suite

  • python scripts/test_full_pipeline.py - Complete end-to-end workflow test
  • python scripts/test_email_pipeline.py - Email generation and sending validation
  • python scripts/test_email_send.py - Debug Resend API email sending issues
  • python scripts/debug_email_content.py - Analyze email content for problematic characters
  • python scripts/debug_notion_storage.py - Debug Notion storage during parallel processing
  • python scripts/test_personalization_fix.py - Verify AI personalization data quality and completeness
  • python scripts/test_company_deduplication.py - Test company deduplication functionality and performance
  • python scripts/test_notion_storage_limits.py - Data storage without truncation test
  • python scripts/test_data_fixes.py - LinkedIn finder and AI parser validation
  • python scripts/find_missing_linkedin_urls.py --stats - LinkedIn coverage analysis
  • python scripts/fix_all_truncation_issues.py analyze - Data quality assessment
  • python cli.py validate-config - Configuration validation and API connection testing
  • python cli.py --dry-run discover --limit 1 - Safe discovery test

💰 Cost Analysis (Transparent AI Usage)

  • Per Company: 46,900 tokens ($0.082)
  • Daily (50 companies): ~$4.11
  • Monthly: ~$123 for comprehensive prospect intelligence
  • ROI: High-quality personalized emails with complete business context

📚 Documentation

Comprehensive documentation is available in the docs/ directory and root-level guides:

Core Documentation

Advanced Features

Technical Deep Dives

Quick Reference Guides

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for details.

Development Setup

  1. Fork and Clone

    git clone <your-fork-url>
    cd job-prospect-automation
  2. Install Development Dependencies

    pip install -r requirements.txt
    pip install -e .
  3. Run Tests

    pytest tests/
  4. Code Quality

    black .
    flake8 .
    mypy .
    
    # Import analysis and optimization
    python utils/import_analyzer.py
    python utils/validate_imports.py

👨‍💻 Author

Created by Minhal Abdul Sami

ProspectAI - Revolutionizing job prospecting with intelligent automation

Connect on LinkedIn to stay updated on the latest features and improvements!

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • ProductHunt for providing the data source
  • Hunter.io for email discovery services
  • Notion for database and organization capabilities
  • AI Provider for intelligent email generation
  • All contributors and users of this project

Need Help? Check the Troubleshooting section or open an issue on GitHub.

About

AI-powered job prospect automation: Discover companies from ProductHunt → Extract team data → Generate personalized emails using LLMs. Features parallel processing, Notion integration, and comprehensive campaign management.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages