This Chrome extension is a specialized tool developed as part of a research project focused on creating a highly contextual Question & Answer (Q&A) dataset for Bangladeshi students planning to study in India. The initial research targets four key universities: Sharda University, Noida International University (NIU), Amity University, and Galgotias University.
The primary function of this tool is to extract and preprocess web content into clean .txt files, which will serve as the foundation for generating a Q&A dataset to train or fine-tune Large Language Models (LLMs).
This project is open-source under the MIT License. The repository is available at: https://github.com/codermillat/WebScrape
This extension served as the primary data collection tool for the SetForge research project, a sophisticated, multi-stage pipeline that transforms raw web data into a high-quality, instruction-formatted dataset.
The data gathered using WebScrape was instrumental in generating the Indian University Guidance for Bangladeshi Students dataset, now publicly available on the Hugging Face Hub.
- Extracts visible and dynamically-loaded content from any webpage
- NEW: In-page "Sider" UI for multi-capture sessions and persistent data management
- NEW: Extracts text from embedded and directly-viewed PDF files via
pdf.js - Filters out navigation elements, ads, and other boilerplate content
- Preserves meaningful content structure through DOM-ordered extraction
- Remove Duplicates: Eliminates repetitive content using line- and sentence-level analysis
- URL/Email Cleaning: Strips web artifacts and contact information
- Stop Word Removal: Filters common English and web-specific stop words
- Content Categorization: Automatically identifies and structures:
- Institution/Organization information
- Academic programs, courses, and fee structures
- Faculty and staff details (heuristic)
- Student testimonials (heuristic)
- Contact information
- Raw Text: Original extracted content, minimally processed
- Clean Text: Preprocessed and deduplicated for general use
- JSON Format: Machine-readable structured data including metadata and categorized sections
- Full-Page Structured Extract: Human-readable
.txtwith labeled sections (Title, Metadata, Headings, Paragraphs, Lists, Tables, Links, Images)
- Toggleable preprocessing options for fine-grained control
- Real-time format switching in the popup UI
- NEW: Advanced options for excluding boilerplate, including hidden elements, and managing metadata
- Persistent Sessions: Captures are stored locally using IndexedDB for persistence across browser sessions
- Downloads Integration: Seamlessly saves extracted files via the Chrome Downloads API
- Smart Naming: Automatic filename generation based on page title and domain
- Visual Feedback: Success notifications and clear error handling
- Download the extension files
- Open Chrome and go to
chrome://extensions/ - Enable "Developer mode" (top right toggle)
- Click "Load unpacked" and select the extension folder
- Pin the extension to your toolbar for easy access
- Navigate to any webpage
- Click the extension icon
- Click "Extract Text"
- Copy or download the extracted content
- After extracting text, click the Settings icon in the popup or open the in-page Sider UI.
- Configure processing options:
- โ Remove Duplicates
- โ Remove URLs/Emails
- โ Remove Stop Words
- โ Extract Sections
- โ Extract Key Phrases
- Select output format:
- Raw Text: Unprocessed content
- Clean Text: Basic cleaning applied
- JSON Format: Structured data export
- Text files:
.txtformat for all text outputs - JSON files:
.jsonformat for structured data - Smart naming: Automatic filename generation based on page content and domain.
This extension now includes a powerful in-page "Sider" UI for managing complex data collection tasks.
- Press Ctrl+Shift+E (or Cmd+Shift+E on Mac) to toggle the Sider UI on any webpage.
- Click "Add" to capture the current page's content. Assign a label for easy identification.
- The captured content is added to a persistent list, grouped by domain.
- Select multiple captures from different pages and download them as a single, cleaned
.txtfile.
- Persistent Captures: All captured data is saved locally using IndexedDB, so it persists even if you close the tab or browser.
- Session Management: Group captures by domain for organized data collection.
- Bulk Downloading: Select and download multiple captures at once.
- Duplicate Prevention: Automatically ignores duplicate content captures.
- Targeted Cleanup: Clear all captures for a specific site or remove individual items.
WebScrape/
โโโ manifest.json # Extension configuration (MV3)
โโโ popup.html # Extension popup interface
โโโ popup.css # Styling for popup
โโโ popup.js # Main popup logic
โโโ content.js # In-page Sider UI and content extraction
โโโ text-processor.js # Advanced text preprocessing engine
โโโ background.js # Service worker for downloads and PDF extraction
โโโ options.html # Options page for extraction settings
โโโ options.js # Logic for options page
โโโ lib/ # Contains pdf.js library
โโโ icons/ # Extension icons
โโโ README.md # This file
โโโ privacy-policy.md # Privacy policy
The JSON Format provides structured output that can be easily adapted for language model training:
{
"metadata": {
"processed_at": "2023-10-27T10:00:00.00Z",
"stats": {
"originalLength": 5000,
"processedLength": 3200,
"compressionRatio": 0.64,
"tokenCount": 450,
"uniqueTokens": 350,
"vocabularyDiversity": 0.77
}
},
"content": {
"sections": {
"title": "University Name",
"main_content": "Main content overview...",
"programs": "List of courses...",
"faculty": "Professor names...",
"testimonials": "Student feedback...",
"contact_info": "Address, phone, email...",
"fee_tables": "Course fees..."
},
"key_phrases": ["keyword1", "keyword2"],
"processed_text": "The full cleaned text..."
}
}The extension provides detailed analytics:
- Compression Ratio: How much content was cleaned vs original
- Vocabulary Diversity: Percentage of unique words
- Token Count: Number of meaningful words extracted
- Content Categories: Automatically identified sections
| Option | Description | Default |
|---|---|---|
| Remove Duplicates | Eliminate repetitive sentences and lines | โ ON |
| Remove URLs | Strip web links and email addresses | โ ON |
| Remove Numbers | Filter out numeric content | โ OFF |
| Remove Stop Words | Filter common English and web-specific words | โ ON |
| Include Hidden Elements | Include non-visible elements in extraction | โ OFF |
| Auto-scroll Page | Scroll to load lazy content before extraction | โ OFF |
| Full-Page Structured | Extract full-page content into labeled sections | โ OFF |
| Exclude Boilerplate | Skip header/nav/footer/ads when extracting | โ OFF |
| Include Metadata | Include meta description and Open Graph tags | โ ON |
| Extract Sections | Heuristically categorize content into sections | โ ON |
| Extract Key Phrases | Identify important topics using n-grams | โ ON |
| Format | Use Case | File Type |
|---|---|---|
| Raw | Original content preservation | .txt |
| Clean | General text processing | .txt |
| JSON | Data analysis & integration | .json |
For a non-technical guide to the extension's workflow, please see:
For a technical analysis of the codebase, including its current status and architectural observations, please see:
- Extract content from university websites, including course catalogs and fee structures
- Process academic papers and brochures in PDF format
- Generate clean datasets for educational AI models
- Create clean, structured training data from web content
- Generate prompt engineering datasets from curated web text
- Preprocess data for fine-tuning domain-specific language models
- Analyze website content quality
- Extract structured data from unstructured pages
- Monitor content changes over time
- Research competitor content
- Gather industry information
- Build knowledge bases from web sources
- Chrome/Chromium browser
- Basic understanding of Chrome extension development
- Optional: Python 3.7+ for advanced processing
- Make changes to extension files
- Go to
chrome://extensions/ - Click the reload button for the extension
- Test changes on target websites
Not applicable in this repository. Previous references to Python helper scripts have been removed for accuracy.
- Local Processing: All text processing happens locally in your browser.
- No Data Collection: The extension does not send your data to any external servers.
- Minimal Permissions: Only requests access rights necessary for its core functionality.
- Content Security: Follows modern Chrome extension security best practices (Manifest V3).
See: privacy-policy.md
Extension not working on some pages
- Chrome system pages (
chrome://) are not accessible - Some sites block content script injection
- Try refreshing the page and re-extracting
Settings not saving
- Settings are session-based for privacy
- Configure options each time you open the extension
Poor extraction quality
- Adjust processing options in the popup or Sider UI
- Try different output formats
- Some complex, JavaScript-heavy sites may not be fully captured
- Use "Clean Text" format for general use
- Use "JSON Format" for structured data analysis
- Disable computationally intensive options (like key phrase extraction) for faster results
-
Persistent settings storageโ COMPLETED - Batch processing multiple pages from a list of URLs
- Custom preprocessing rules (e.g., regex-based filters)
- Export to cloud storage services (e.g., Google Drive)
- Advanced content filtering options (e.g., by word count)
- NEW: In-page annotation and labeling for supervised learning datasets
This project is open source and available under the MIT License.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project would not be possible without the incredible work of the open-source community. We extend our sincere gratitude to:
- The Mozilla
pdf.jsTeam: For creating and maintaining the powerfulpdf.jslibrary, which enables robust text extraction from PDF documents directly in the browser. Their work is fundamental to this extension's ability to process academic documents and brochures.
This extension also incorporates text preprocessing techniques inspired by established research in Natural Language Processing (NLP) and web content extraction best practices.
This tool is dedicated to the goal of making educational information more accessible through technology. ๐