Skip to content

DragonDiffusionbyBoyo/DragonSnorting

Repository files navigation

🐉 Dragon Image Scraper

An ethical, high-performance Node.js-based image scraper designed to responsibly download high-resolution images from Google Image searches. This project is optimized for creating clean, high-quality datasets for AI training and research.

Features

  • Production-Ready: This scraper has been battle-tested and is ready for immediate deployment for real-world tasks like creating LoRA datasets.
  • Intelligent Scraping: The scraper uses advanced techniques to bypass common pitfalls and extract full-size, high-quality images directly from their source websites.
  • Configurable: Customize your scraping experience through an interactive command-line interface and the dragon_config.json file.
  • Robust & Stable: Includes comprehensive error handling, logging, and anti-detection measures like user-agent rotation to ensure a stable and persistent scraping experience.
  • Ethical by Design: Built with a focus on respectful and responsible scraping, with clear documentation on how to maintain ethical practices, such as the quality vs. content filtering trade-off when using SafeSearch.

Project Status

The project is considered Mission Accomplished! and is in a production-ready state. The scraper has successfully evolved from a simple thumbnail downloader to a "full-resolution treasure hunter" capable of capturing high-quality images with a proven 94% success rate.

The next phase of development involves integrating Vision-Language Models (VLMs) to enable automated quality assessment and captioning, but this functionality is not included in the current release.


Getting Started

These instructions will get a copy of the project up and running on your local machine.

Prerequisites

You will need to have Node.js and npm installed on your system.

Installation

  1. Clone the repository:
    git clone https://github.com/YourUsername/dragon-image-scraper.git
    cd dragon-image-scraper
  2. Install the dependencies:
    npm install
  3. Set up your environment variables: If your scraper uses any environment variables, you should create a .env file from a provided example (.env.example if you have one). Do not commit sensitive information to GitHub.

Configuration

The dragon-launcher.js script will guide you through an interactive setup. However, the core configuration is managed by the dragon_config.json file.

  • "version" and "author" are for metadata.
  • "lastRun" tracks the last time the scraper was executed.

This is the current minimal configuration. Future versions may expose more options here.


Usage

To begin a hunt, simply run the main launcher from your terminal:

node dragon-launcher.js

The scraper offers both a Quick Hunt mode for immediate deployment with optimized defaults and an Advanced Hunt mode for precise control over all parameters.

The scraper will save all successfully validated images to the dragon_downloads directory, organized by search term.


Proven Performance

Based on recent tests, the scraper demonstrates strong performance:

  • Capture Rate: 94% success rate for capturing images.
  • Speed: Processes over 15 images per minute.
  • Resolution: Achieves a 100% rate of real, full-size image extraction, avoiding low-quality thumbnails.
  • Quality: Successfully filters for high-quality images, with best captures ranging from 800x800 to 960x604 pixels.

Core Discoveries

During development, several key findings were made to improve the scraper's performance and ethical compliance:

  • SafeSearch Impact: Disabling SafeSearch significantly improves image resolution and quality, providing access to professional and commercial content. This highlights a clear quality-versus-content-filtering trade-off for users.
  • Persistent Hunting Logic: The scraper will continue to hunt for images until it reaches its target quota, only counting successfully validated images towards the final total. This ensures a more reliable and consistent output.

How It Works

This scraper is built on a modular architecture:

  • Core Scraping (google-images-scraper.js and enhanced-google-images-scraper.js): These modules handle the actual navigation and parsing of Google Images to locate and extract the URLs of the full-size images.
  • Launcher (dragon-launcher.js): This is the main interface that ties everything together, providing the user with an interactive experience and orchestrating the entire scraping process based on user input and configuration.

License

MIT


Acknowledgments

  • Developed with the assistance of a local Large Language Model.

  • This project would not be possible without the open-source Node.js community and related libraries.

Releases

No releases published

Packages

No packages published