- Overview
- Features
- Screenshots
- Tech Stack
- Prerequisites
- Local Installation
- Environment Variables
- Database Setup
- Running the Application
- Project Structure
- Contributors
- Important Notes
- Disclaimer
- License
ScrapeBun is a powerful, visual workflow automation platform designed for building complex web scraping pipelines. Create drag-and-drop workflows with AI-powered data extraction, automatic CAPTCHA detection, conditional logic, and scheduled execution - all through an intuitive interface.
- Visual Workflow Builder: Design complex scraping workflows with a node-based editor
- AI-Powered Extraction: Leverage OpenAI and Google's Gemini AI for intelligent data extraction
- Smart CAPTCHA Handling: Automatic detection and user-guided resolution
- Scheduled Execution: Cron-based scheduling with timezone support
- Credit-Based System: Integrated Stripe billing for scalable usage
- Real-time Monitoring: Track execution logs, status, and resource consumption
- π¨ Visual Workflow Editor - Drag-and-drop node-based workflow creation
- π€ AI Data Extraction - Extract structured data using natural language prompts
- π Advanced Control Flow - Support for loops, conditions, merge, and parallel execution
- πΈ Snapshot-Based Scraping - Reuse cached HTML for efficient workflows
- π Cron Scheduling - Schedule workflows with flexible timing options
- π Credential Management - Secure storage of API keys and authentication tokens
- π³ Stripe Integration - Credit-based billing system with multiple pricing tiers
- π Analytics Dashboard - Monitor workflow performance and credit usage
- π― Execution History - Detailed logs and debugging information
| Category | Nodes |
|---|---|
| Data Collection | Navigate URL, Page to HTML, Extract with AI |
| Control Flow | Condition, Loop, Wait for User Input, Merge |
| Data Processing | Extract Text, Read Property, Add Property, Scroll to Element |
| Interaction | Click Element, Fill Input, Deliver via Webhook |
| AI Integration | OpenAI & Gemini AI extraction |
Click to view application screenshots
- Framework: Next.js 14 (App Router)
- Language: TypeScript 5.6
- Styling: Tailwind CSS + shadcn/ui
- State Management: React Query (TanStack)
- Workflow Editor: React Flow (@xyflow/react)
- Animations: Framer Motion
- Icons: Lucide React + Tabler Icons
- Runtime: Node.js 18+
- Database: PostgreSQL with Prisma ORM
- Authentication: Clerk
- Web Scraping: Puppeteer + Chromium
- HTML Parsing: Cheerio
- AI Integration:
- Payment Processing: Stripe
- Scheduling: cron-parser + cronstrue
- Version Control: Git
- Package Manager: npm
- Linting: ESLint
- Deployment: Vercel (optimized)
Before you begin, ensure you have the following installed:
- Node.js: v18.0.0 or higher (Download)
- npm: v9.0.0 or higher (comes with Node.js)
- PostgreSQL: v14.0 or higher (Download)
- Git: Latest version (Download)
- A Clerk account for authentication (Sign up)
- An OpenAI API key (Get one)
- A Google Gemini API key (Get one)
- A Stripe account for payments (optional, for billing features)
git clone https://github.com/10Pratik01/ScrapeBun.git
cd ScrapeBunnpm installThis will automatically:
- Install all Node.js dependencies
- Generate Prisma client
- Download Chromium for Puppeteer
Create a .env file in the project root:
touch .envCopy and paste the following into your .env file and replace the placeholder values:
# ============================================
# DATABASE
# ============================================
DATABASE_URL="postgresql://username:password@localhost:5432/scrapebun?schema=public"
# ============================================
# CLERK AUTHENTICATION
# ============================================
# Get these from: https://dashboard.clerk.com/
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY="pk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CLERK_SECRET_KEY="sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Clerk redirect URLs
NEXT_PUBLIC_CLERK_SIGN_IN_URL="/sign-in"
NEXT_PUBLIC_CLERK_SIGN_UP_URL="/sign-up"
# ============================================
# AI PROVIDERS
# ============================================
# OpenAI API Key (https://platform.openai.com/api-keys)
OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Google Gemini API Key (https://ai.google.dev/)
GEMINI_API_KEY="AIzaSyxxxxxxxxxxxxxxxxxxxxxxxxx"
# ============================================
# STRIPE PAYMENT (Optional)
# ============================================
# Get these from: https://dashboard.stripe.com/apikeys
STRIPE_SECRET_KEY="sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY="pk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Stripe Price IDs (create products in Stripe Dashboard)
STRIPE_SMALL_PACK_PRICE_ID="price_xxxxxxxxxxxxxxxxxxxxx"
STRIPE_MEDIUM_PACK_PRICE_ID="price_xxxxxxxxxxxxxxxxxxxxx"
STRIPE_LARGE_PACK_PRICE_ID="price_xxxxxxxxxxxxxxxxxxxxx"
# ============================================
# APPLICATION SETTINGS
# ============================================
# Your application URL (use http://localhost:3000 for local development)
NEXT_PUBLIC_APP_URL="http://localhost:3000"
APP_URL="http://localhost:3000"
# Encryption key for credentials (generate a random 32-character string)
ENCRYPTION_KEY="your-32-character-encryption-key-here-change-this"
# API Secret for webhook authentication (generate a random string)
API_SECRET="your-api-secret-key-for-webhooks-change-this"
# ============================================
# CHROME/PUPPETEER (Optional)
# ============================================
# Custom Chrome path (auto-detected in most cases)
# CHROME_PATH="/usr/bin/google-chrome"
# Node environment
NODE_ENV="development"π Click to see detailed explanations
DATABASE_URL: PostgreSQL connection string- Format:
postgresql://[user]:[password]@[host]:[port]/[database]?schema=public - Example:
postgresql://postgres:mysecretpassword@localhost:5432/scrapebun?schema=public
- Format:
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY: Public key for Clerk (safe to expose)CLERK_SECRET_KEY: Secret key for Clerk (keep private)- Get both from your Clerk Dashboard
OPENAI_API_KEY: Required for AI-powered data extraction with GPT modelsGEMINI_API_KEY: Required for AI-powered data extraction with Gemini models
- Only required if you want to enable the billing/credits system
- Create a Stripe account and get keys from the Dashboard
- Create three products (Small/Medium/Large credit packs) and use their Price IDs
ENCRYPTION_KEY: Used to encrypt stored credentials - MUST be 32 characters- Generate:
openssl rand -hex 16
- Generate:
API_SECRET: Used for webhook authentication- Generate:
openssl rand -hex 32
- Generate:
# Using psql
psql -U postgres
CREATE DATABASE scrapebun;
\qOr use a GUI tool like pgAdmin or TablePlus.
# Generate Prisma Client
npx prisma generate
# Create database tables
npx prisma db pushIf you want to add initial credits to a user:
# Edit add-credits.sql with your Clerk user ID
# Then run:
psql -U postgres -d scrapebun -f add-credits.sql# Open Prisma Studio to view/edit data
npx prisma studioThis opens a web interface at http://localhost:5555 to browse your database.
npm run devOpen http://localhost:3000 in your browser.
# Build the application
npm run build
# Start production server
npm startScrapeBun uses Puppeteer for web scraping, which requires Chrome/Chromium:
# List installed browsers
npx puppeteer browsers list
# If Chrome is missing, install it
npx puppeteer browsers install chromeFor production deployment issues, see DEPLOYMENT.md.
scrapBun/
βββ app/ # Next.js app directory
β βββ (auth)/ # Authentication pages (sign-in/sign-up)
β βββ (dashboard)/ # Dashboard pages (home, workflows, credentials, billing)
β βββ api/ # API routes
β β βββ workflows/ # Workflow execution & cron endpoints
β βββ workflow/ # Workflow editor
β βββ layout.tsx # Root layout with Clerk provider
β
βββ actions/ # Server actions
β βββ workflows.ts # Workflow CRUD operations
β βββ runWorkflow.ts # Workflow execution
β βββ credentials.ts # Credential management
β βββ billings.ts # Stripe billing
β βββ analytics.ts # Usage analytics
β
βββ components/ # React components
β βββ ui/ # shadcn/ui components
β βββ [feature-components]/ # Feature-specific components
β
βββ lib/ # Core library code
β βββ workflow/ # Workflow engine
β β βββ engine/ # Execution engine (V2)
β β β βββ executors/ # Node type executors
β β β βββ registry.ts # Executor registry
β β βββ executor/ # Legacy executors (V1)
β β βββ task/ # Task definitions
β βββ prisma.ts # Prisma client
β βββ billing.ts # Stripe integration
β βββ credential.ts # Encryption utilities
β βββ helper.ts # Utility functions
β
βββ prisma/ # Database schema & migrations
β βββ schema.prisma # Prisma schema
β
βββ public/ # Static assets
β βββ website/ # Screenshot images
β βββ logo.png # Application logo
β
βββ hooks/ # Custom React hooks
βββ schema/ # Zod validation schemas
βββ .env # Environment variables (DO NOT COMMIT)
βββ package.json # Dependencies
βββ tsconfig.json # TypeScript config
βββ tailwind.config.ts # Tailwind CSS config
βββ next.config.mjs # Next.js config
app/: Next.js 14 App Router structure with route groupsactions/: Server-side functions called from client componentslib/workflow/engine/: Core workflow execution engine with node executorscomponents/: Reusable UI components built with shadcn/uiprisma/: Database schema and ORM configuration
![]() Pratik Patil |
![]() Viraj Pawar |
![]() Om Thanage |
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
Please ensure:
- Code follows the existing style (TypeScript + ESLint)
- All tests pass (if applicable)
- Commits are descriptive
Important
- Never commit
.env- It contains sensitive API keys - Change default encryption keys - Always use unique, random keys in production
- Rotate API secrets regularly - Especially for production deployments
- Use environment-specific keys - Separate keys for dev/staging/production
Tip
- Database: Use connection pooling for production PostgreSQL
- Chrome: Increase memory limits for Vercel/serverless (3GB minimum)
- Caching: Enable Redis for caching scraped pages (future enhancement)
- Rate Limiting: Implement rate limiting for API routes
Warning
- CAPTCHAs: Automatic detection only; manual solving required
- JavaScript-Heavy Sites: May require custom wait conditions
- Large Datasets: Paginated scraping recommended for 1000+ items
- Serverless Timeouts: Vercel has 300s max execution time
Caution
- OpenAI API: Can be expensive for large-scale extraction
- Stripe: Test mode recommended for development
- Puppeteer: Memory-intensive; monitor serverless costs
- Database: Use managed PostgreSQL for auto-scaling
Caution
This project is provided strictly for EDUCATIONAL PURPOSES ONLY.
-
β Learn: Use to understand web scraping, workflow automation, and AI integration
-
β Experiment: Build personal projects and proof-of-concepts
-
β Study: Analyze the codebase for educational research
-
β Commercial Use: Do NOT use for unauthorized commercial scraping
-
β TOS Violations: Always respect website Terms of Service
-
β Data Theft: Never extract data without permission
-
β Rate Abuse: Do NOT overwhelm servers with requests
- Check robots.txt: Always respect website crawling policies
- Rate Limiting: Add delays between requests (use WAIT nodes)
- User-Agent: Identify your scraper properly
- Permission: Get explicit permission for commercial data collection
- Privacy: Handle personal data according to GDPR/CCPA regulations
By using ScrapeBun, you agree:
- To use it responsibly and ethically
- To comply with all applicable laws and regulations
- That the developers are NOT liable for misuse
- To respect the intellectual property of scraped websites
Remember: Just because you can scrape something doesn't mean you should.
This project is released under the MIT License for educational purposes.
MIT License
Copyright (c) 2024 Pratik Patil
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software for EDUCATIONAL PURPOSES ONLY...
See LICENSE file for full details.
- Documentation: DEPLOYMENT.md - Production deployment guide
- GitHub Repo: 10Pratik01/ScrapeBun
- Report Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with amazing open-source tools:
- Next.js - React framework
- Prisma - Database ORM
- Clerk - Authentication
- shadcn/ui - UI components
- Puppeteer - Web scraping
- OpenAI & Google - AI integration
Made with β€οΈ for the developer community
If this project helped you learn something new, please consider giving it a β on GitHub!












