Skip to content

10Pratik01/ScrapeBun

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ScrapeBun Logo

🐰 ScrapeBun

Advanced Web Scraping Workflow Automation Platform

Next.js TypeScript Prisma License


πŸ“– Table of Contents


🌟 Overview

ScrapeBun is a powerful, visual workflow automation platform designed for building complex web scraping pipelines. Create drag-and-drop workflows with AI-powered data extraction, automatic CAPTCHA detection, conditional logic, and scheduled execution - all through an intuitive interface.

Why ScrapeBun?

  • Visual Workflow Builder: Design complex scraping workflows with a node-based editor
  • AI-Powered Extraction: Leverage OpenAI and Google's Gemini AI for intelligent data extraction
  • Smart CAPTCHA Handling: Automatic detection and user-guided resolution
  • Scheduled Execution: Cron-based scheduling with timezone support
  • Credit-Based System: Integrated Stripe billing for scalable usage
  • Real-time Monitoring: Track execution logs, status, and resource consumption

✨ Features

Core Capabilities

  • 🎨 Visual Workflow Editor - Drag-and-drop node-based workflow creation
  • πŸ€– AI Data Extraction - Extract structured data using natural language prompts
  • πŸ”„ Advanced Control Flow - Support for loops, conditions, merge, and parallel execution
  • πŸ“Έ Snapshot-Based Scraping - Reuse cached HTML for efficient workflows
  • πŸ• Cron Scheduling - Schedule workflows with flexible timing options
  • πŸ” Credential Management - Secure storage of API keys and authentication tokens
  • πŸ’³ Stripe Integration - Credit-based billing system with multiple pricing tiers
  • πŸ“Š Analytics Dashboard - Monitor workflow performance and credit usage
  • 🎯 Execution History - Detailed logs and debugging information

Supported Node Types

Category Nodes
Data Collection Navigate URL, Page to HTML, Extract with AI
Control Flow Condition, Loop, Wait for User Input, Merge
Data Processing Extract Text, Read Property, Add Property, Scroll to Element
Interaction Click Element, Fill Input, Deliver via Webhook
AI Integration OpenAI & Gemini AI extraction

πŸ“Έ Screenshots

Click to view application screenshots

Home Page

Home Page

Workflow Dashboard

Workflow Page

Workflow Editor

Workflow Editor

Execution Monitor

Execution Screen

Scheduling Dialog

Schedule Workflow Schedule Dialog

Optimization Features

Optimize Dialog

Credentials Management

Credentials Page

Credits & Billing

Credits Page


πŸ›  Tech Stack

Frontend

Backend

DevOps & Tooling

  • Version Control: Git
  • Package Manager: npm
  • Linting: ESLint
  • Deployment: Vercel (optimized)

πŸ“‹ Prerequisites

Before you begin, ensure you have the following installed:

  • Node.js: v18.0.0 or higher (Download)
  • npm: v9.0.0 or higher (comes with Node.js)
  • PostgreSQL: v14.0 or higher (Download)
  • Git: Latest version (Download)

Additional Requirements

  • A Clerk account for authentication (Sign up)
  • An OpenAI API key (Get one)
  • A Google Gemini API key (Get one)
  • A Stripe account for payments (optional, for billing features)

πŸš€ Local Installation

Step 1: Clone the Repository

git clone https://github.com/10Pratik01/ScrapeBun.git
cd ScrapeBun

Step 2: Install Dependencies

npm install

This will automatically:

  • Install all Node.js dependencies
  • Generate Prisma client
  • Download Chromium for Puppeteer

πŸ” Environment Variables

Step 1: Create .env File

Create a .env file in the project root:

touch .env

Step 2: Configure Environment Variables

Copy and paste the following into your .env file and replace the placeholder values:

# ============================================
# DATABASE
# ============================================
DATABASE_URL="postgresql://username:password@localhost:5432/scrapebun?schema=public"

# ============================================
# CLERK AUTHENTICATION
# ============================================
# Get these from: https://dashboard.clerk.com/
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY="pk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CLERK_SECRET_KEY="sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Clerk redirect URLs
NEXT_PUBLIC_CLERK_SIGN_IN_URL="/sign-in"
NEXT_PUBLIC_CLERK_SIGN_UP_URL="/sign-up"

# ============================================
# AI PROVIDERS
# ============================================
# OpenAI API Key (https://platform.openai.com/api-keys)
OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Google Gemini API Key (https://ai.google.dev/)
GEMINI_API_KEY="AIzaSyxxxxxxxxxxxxxxxxxxxxxxxxx"

# ============================================
# STRIPE PAYMENT (Optional)
# ============================================
# Get these from: https://dashboard.stripe.com/apikeys
STRIPE_SECRET_KEY="sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY="pk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Stripe Price IDs (create products in Stripe Dashboard)
STRIPE_SMALL_PACK_PRICE_ID="price_xxxxxxxxxxxxxxxxxxxxx"
STRIPE_MEDIUM_PACK_PRICE_ID="price_xxxxxxxxxxxxxxxxxxxxx"
STRIPE_LARGE_PACK_PRICE_ID="price_xxxxxxxxxxxxxxxxxxxxx"

# ============================================
# APPLICATION SETTINGS
# ============================================
# Your application URL (use http://localhost:3000 for local development)
NEXT_PUBLIC_APP_URL="http://localhost:3000"
APP_URL="http://localhost:3000"

# Encryption key for credentials (generate a random 32-character string)
ENCRYPTION_KEY="your-32-character-encryption-key-here-change-this"

# API Secret for webhook authentication (generate a random string)
API_SECRET="your-api-secret-key-for-webhooks-change-this"

# ============================================
# CHROME/PUPPETEER (Optional)
# ============================================
# Custom Chrome path (auto-detected in most cases)
# CHROME_PATH="/usr/bin/google-chrome"

# Node environment
NODE_ENV="development"

Environment Variable Details

πŸ” Click to see detailed explanations

Database

  • DATABASE_URL: PostgreSQL connection string
    • Format: postgresql://[user]:[password]@[host]:[port]/[database]?schema=public
    • Example: postgresql://postgres:mysecretpassword@localhost:5432/scrapebun?schema=public

Clerk Authentication

  • NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY: Public key for Clerk (safe to expose)
  • CLERK_SECRET_KEY: Secret key for Clerk (keep private)
  • Get both from your Clerk Dashboard

AI Providers

  • OPENAI_API_KEY: Required for AI-powered data extraction with GPT models
  • GEMINI_API_KEY: Required for AI-powered data extraction with Gemini models

Stripe (Optional)

  • Only required if you want to enable the billing/credits system
  • Create a Stripe account and get keys from the Dashboard
  • Create three products (Small/Medium/Large credit packs) and use their Price IDs

Security Keys

  • ENCRYPTION_KEY: Used to encrypt stored credentials - MUST be 32 characters
    • Generate: openssl rand -hex 16
  • API_SECRET: Used for webhook authentication
    • Generate: openssl rand -hex 32

πŸ—„ Database Setup

Step 1: Create PostgreSQL Database

# Using psql
psql -U postgres
CREATE DATABASE scrapebun;
\q

Or use a GUI tool like pgAdmin or TablePlus.

Step 2: Run Prisma Migrations

# Generate Prisma Client
npx prisma generate

# Create database tables
npx prisma db push

Step 3: (Optional) Seed Initial Data

If you want to add initial credits to a user:

# Edit add-credits.sql with your Clerk user ID
# Then run:
psql -U postgres -d scrapebun -f add-credits.sql

Step 4: View Database (Optional)

# Open Prisma Studio to view/edit data
npx prisma studio

This opens a web interface at http://localhost:5555 to browse your database.


▢️ Running the Application

Development Mode

npm run dev

Open http://localhost:3000 in your browser.

Production Build

# Build the application
npm run build

# Start production server
npm start

Verify Chrome Installation (Important!)

ScrapeBun uses Puppeteer for web scraping, which requires Chrome/Chromium:

# List installed browsers
npx puppeteer browsers list

# If Chrome is missing, install it
npx puppeteer browsers install chrome

For production deployment issues, see DEPLOYMENT.md.


πŸ“ Project Structure

scrapBun/
β”œβ”€β”€ app/                          # Next.js app directory
β”‚   β”œβ”€β”€ (auth)/                  # Authentication pages (sign-in/sign-up)
β”‚   β”œβ”€β”€ (dashboard)/             # Dashboard pages (home, workflows, credentials, billing)
β”‚   β”œβ”€β”€ api/                     # API routes
β”‚   β”‚   └── workflows/           # Workflow execution & cron endpoints
β”‚   β”œβ”€β”€ workflow/                # Workflow editor
β”‚   └── layout.tsx               # Root layout with Clerk provider
β”‚
β”œβ”€β”€ actions/                      # Server actions
β”‚   β”œβ”€β”€ workflows.ts             # Workflow CRUD operations
β”‚   β”œβ”€β”€ runWorkflow.ts           # Workflow execution
β”‚   β”œβ”€β”€ credentials.ts           # Credential management
β”‚   β”œβ”€β”€ billings.ts              # Stripe billing
β”‚   └── analytics.ts             # Usage analytics
β”‚
β”œβ”€β”€ components/                   # React components
β”‚   β”œβ”€β”€ ui/                      # shadcn/ui components
β”‚   └── [feature-components]/   # Feature-specific components
β”‚
β”œβ”€β”€ lib/                         # Core library code
β”‚   β”œβ”€β”€ workflow/                # Workflow engine
β”‚   β”‚   β”œβ”€β”€ engine/              # Execution engine (V2)
β”‚   β”‚   β”‚   β”œβ”€β”€ executors/      # Node type executors
β”‚   β”‚   β”‚   └── registry.ts     # Executor registry
β”‚   β”‚   β”œβ”€β”€ executor/            # Legacy executors (V1)
β”‚   β”‚   └── task/                # Task definitions
β”‚   β”œβ”€β”€ prisma.ts                # Prisma client
β”‚   β”œβ”€β”€ billing.ts               # Stripe integration
β”‚   β”œβ”€β”€ credential.ts            # Encryption utilities
β”‚   └── helper.ts                # Utility functions
β”‚
β”œβ”€β”€ prisma/                      # Database schema & migrations
β”‚   └── schema.prisma            # Prisma schema
β”‚
β”œβ”€β”€ public/                      # Static assets
β”‚   β”œβ”€β”€ website/                 # Screenshot images
β”‚   └── logo.png                 # Application logo
β”‚
β”œβ”€β”€ hooks/                       # Custom React hooks
β”œβ”€β”€ schema/                      # Zod validation schemas
β”œβ”€β”€ .env                         # Environment variables (DO NOT COMMIT)
β”œβ”€β”€ package.json                 # Dependencies
β”œβ”€β”€ tsconfig.json                # TypeScript config
β”œβ”€β”€ tailwind.config.ts           # Tailwind CSS config
└── next.config.mjs              # Next.js config

Key Directories Explained

  • app/: Next.js 14 App Router structure with route groups
  • actions/: Server-side functions called from client components
  • lib/workflow/engine/: Core workflow execution engine with node executors
  • components/: Reusable UI components built with shadcn/ui
  • prisma/: Database schema and ORM configuration

πŸ‘₯ Contributors

Pratik Patil
Pratik Patil

Viraj Pawar
Viraj Pawar
Om Thanage
Om Thanage

Want to Contribute?

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m 'Add amazing feature'
  4. Push to the branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Please ensure:

  • Code follows the existing style (TypeScript + ESLint)
  • All tests pass (if applicable)
  • Commits are descriptive

πŸ“ Important Notes

Security Considerations

Important

  • Never commit .env - It contains sensitive API keys
  • Change default encryption keys - Always use unique, random keys in production
  • Rotate API secrets regularly - Especially for production deployments
  • Use environment-specific keys - Separate keys for dev/staging/production

Performance Tips

Tip

  • Database: Use connection pooling for production PostgreSQL
  • Chrome: Increase memory limits for Vercel/serverless (3GB minimum)
  • Caching: Enable Redis for caching scraped pages (future enhancement)
  • Rate Limiting: Implement rate limiting for API routes

Known Limitations

Warning

  • CAPTCHAs: Automatic detection only; manual solving required
  • JavaScript-Heavy Sites: May require custom wait conditions
  • Large Datasets: Paginated scraping recommended for 1000+ items
  • Serverless Timeouts: Vercel has 300s max execution time

Cost Management

Caution

  • OpenAI API: Can be expensive for large-scale extraction
  • Stripe: Test mode recommended for development
  • Puppeteer: Memory-intensive; monitor serverless costs
  • Database: Use managed PostgreSQL for auto-scaling

⚠️ IMPORTANT DISCLAIMER

Caution

This project is provided strictly for EDUCATIONAL PURPOSES ONLY.

Legal & Ethical Notice

  • βœ… Learn: Use to understand web scraping, workflow automation, and AI integration

  • βœ… Experiment: Build personal projects and proof-of-concepts

  • βœ… Study: Analyze the codebase for educational research

  • ❌ Commercial Use: Do NOT use for unauthorized commercial scraping

  • ❌ TOS Violations: Always respect website Terms of Service

  • ❌ Data Theft: Never extract data without permission

  • ❌ Rate Abuse: Do NOT overwhelm servers with requests

Best Practices

  1. Check robots.txt: Always respect website crawling policies
  2. Rate Limiting: Add delays between requests (use WAIT nodes)
  3. User-Agent: Identify your scraper properly
  4. Permission: Get explicit permission for commercial data collection
  5. Privacy: Handle personal data according to GDPR/CCPA regulations

Developer Responsibility

By using ScrapeBun, you agree:

  • To use it responsibly and ethically
  • To comply with all applicable laws and regulations
  • That the developers are NOT liable for misuse
  • To respect the intellectual property of scraped websites

Remember: Just because you can scrape something doesn't mean you should.


πŸ“œ License

This project is released under the MIT License for educational purposes.

MIT License

Copyright (c) 2024 Pratik Patil

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software for EDUCATIONAL PURPOSES ONLY...

See LICENSE file for full details.


πŸ”— Useful Links


πŸ™ Acknowledgments

Built with amazing open-source tools:


Made with ❀️ for the developer community

If this project helped you learn something new, please consider giving it a ⭐ on GitHub!

⬆ Back to Top

About

Hackathon project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages