A comprehensive Python tool for detecting meaningful differences between two web pages, with special support for cleaning Wayback Machine artifacts. Perfect for developers migrating websites or verifying changes.
- Intelligent Diff Engine: Focuses on meaningful changes (content, structure, scripts) while ignoring noise
- Wayback Machine Support: Automatically detects and removes Wayback Machine banners, scripts, and URL rewrites
- Significance Scoring: Categorizes changes as high, medium, or low significance
- Multiple Output Formats: Text, JSON, and unified diff formats
- Visual Comparison: Take screenshots in multiple browsers and generate side-by-side comparison images
- Developer-Focused: Highlights changes that matter for migrations and development
# Clone the repository
git clone https://github.com/sergio/Website-Diff.git
cd Website-Diff
# Create virtual environment (recommended)
python3 -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install basic dependencies
pip install -r requirements.txt
# For visual comparison (optional but recommended)
pip install selenium Pillow webdriver-manager
# Install in development mode
pip install -e .Compare two URLs:
website-diff https://example.com/page1 https://example.com/page2The tool automatically detects Wayback Machine URLs and cleans artifacts:
# Compare a Wayback archive with current page
website-diff https://web.archive.org/web/20230101/https://example.com/ https://example.com/
# Compare two Wayback archives
website-diff https://web.archive.org/web/20230101/https://example.com/ \
https://web.archive.org/web/20230201/https://example.com/Save output to a file:
website-diff url1 url2 -o diff.txtOutput as JSON:
website-diff url1 url2 --format jsonUnified diff format:
website-diff url1 url2 --format unifiedGenerate markdown report (includes visual comparison images):
website-diff url1 url2 --markdownThe markdown report includes:
- Executive summary with change statistics
- Visual comparison screenshots (if
--visualis used) - High/medium/low significance changes
- Site-wide comparison results (if
--traverseis used) - Recommendations based on findings
Reports are saved to ./reports/ by default (configurable with --report-dir).
Take screenshots and compare them visually:
# Enable visual comparison (screenshots)
website-diff url1 url2 --visual
# Auto-detect and use all available browsers (default)
website-diff url1 url2 --visual --markdown
# Compare in specific browsers
website-diff url1 url2 --visual --browsers chrome firefox edge
# Generate markdown report with images
website-diff url1 url2 --visual --markdown
# Custom screenshot directory
website-diff url1 url2 --visual --screenshot-dir ./my-screenshots
# Custom viewport size
website-diff url1 url2 --visual --viewport-width 1280 --viewport-height 720
# Run browser in visible mode (for debugging)
website-diff url1 url2 --visual --no-headlessVisual comparison generates:
- Screenshots of both pages in each browser
- Side-by-side comparison images
- Difference highlighting (red pixels show differences)
- Markdown report with embedded image references (when using
--markdown)
# Don't clean Wayback Machine artifacts
website-diff url1 url2 --no-clean-wayback
# Don't ignore whitespace differences
website-diff url1 url2 --no-ignore-whitespace
# Set custom timeout
website-diff url1 url2 --timeout 60
# Verbose output
website-diff url1 url2 --verboseWhen a Wayback Machine URL is detected, the tool automatically:
- Removes Header Artifacts: Strips analytics scripts, playback scripts, and banner CSS
- Removes Footer Comments: Removes archival metadata and copyright notices
- Restores URLs: Converts Wayback-prefixed URLs back to original URLs
- Normalizes Content: Handles whitespace and formatting differences
Changes are categorized by significance:
- High Significance: Structural changes, content changes, meta tags, scripts, stylesheets
- Medium Significance: Attribute changes, styling, div/span modifications
- Low Significance: Whitespace, comments, minor formatting
The diff engine:
- Focuses on meaningful content changes
- Ignores noise like timestamps, auto-generated IDs
- Provides context around changes
- Groups changes by significance for easy review
After migrating a website from Wayback Machine archives, verify that the migration was successful:
website-diff https://web.archive.org/web/20230101/https://oldsite.com/ https://newsite.com/Monitor a website for meaningful changes:
website-diff https://example.com/page1 https://example.com/page2 -o changes.txt
# With markdown report
website-diff https://example.com/page1 https://example.com/page2 --markdownCompare development and production versions:
website-diff https://dev.example.com/page https://prod.example.com/pageThe default text output includes:
- Summary statistics (total changes, added/removed/modified)
- Significance breakdown
- Detailed changes grouped by significance
- Context around each change
Structured JSON output for programmatic processing:
{
"summary": {
"total_changes": 15,
"added": 5,
"removed": 3,
"modified": 7,
"high_significance": 2,
"medium_significance": 8,
"low_significance": 5
},
"changes": [
{
"type": "modified",
"old_text": "...",
"new_text": "...",
"significance": "high",
...
}
]
}0: No changes detected1: Low or medium significance changes2: High significance changes detected
- Python 3.8+
- requests library
- For visual comparison:
- selenium
- Pillow (PIL)
- webdriver-manager (optional, for automatic driver management)
- Chrome or Firefox browser installed
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
Note: This software is NOT for commercial use.
See the LICENSE file for details.
Run tests locally:
# Install test dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=website_diff --cov-report=htmlContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new features
- Ensure all tests pass:
pytest tests/ -v - Submit a Pull Request
The project uses GitHub Actions for:
- CI: Runs tests on push/PR across Python 3.8-3.11
- Release: Automatically creates GitHub releases when version tags are pushed
To create a release:
git tag v1.0.0
git push origin v1.0.0This will:
- Run all tests
- Build the package
- Create a GitHub release with distribution files
Inspired by discussions on software testing tools for comparing websites, particularly for migration scenarios.