Agrty is a sophisticated web scraping tool designed to aggregate product and price data from major online grocery retailers. It uses a hybrid approach of detailed Selenium-based crawling and efficient offline parsing.
The project is organized by data source in the sources/ directory.
- Target:
nakup.itesco.cz - Crawler (
crawler.py):- Architecture: Multi-threaded Selenium crawler.
- Strategy: "Click-based" navigation (mimics user behavior) to traverse categories and pagination.
- Features:
- Resiliency: Automated recovery from connection failures and page load timeouts. If a category fails, it restarts from scratch in a fresh window.
- State Management: Tracks processed products and category hierarchy in
data/tesco_raw/tesco_state.jsonto allow pausing and resuming. - Raw Data: Saves full HTML source of product pages (gzipped) to
data/tesco_raw/for offline parsing.
- Parser (
parser.py):- Input: Gzipped HTML files from
data/tesco_raw/. - Extraction: Hybrid extraction using:
- Apollo Cache: Extracts the hydrated React state (Apollo) directly from the HTML for structured data.
- JSON-LD: Fallback to Schema.org structural metadata.
- Exhaustive DOM: Final fallback to CSS selectors.
- Output:
data/tesco.result.json
- Input: Gzipped HTML files from
- Target:
kupi.cz(Price aggregation and flyer site) - Crawler: Selenium-based crawler that archives deal pages.
- Raw Data: Saves raw HTML content (gzipped) to
data/kupi_raw/.
- Raw Data: Saves raw HTML content (gzipped) to
- Parser (
parser.py):- Features:
- Dual-Path Parsing: Handles both detail view (
sleva_*.html) and category grid view (slevy_*.html) files. - Smart Date Parsing: Converts Czech natural language dates (e.g., "dnes končí", "čt 15. 1.") into standard ISO ranges.
- Deduplication: Merges offers from different files based on product name.
- Dual-Path Parsing: Handles both detail view (
- Output:
data/kupi.result.json
- Features:
- Target:
wolt.com(Delivery platform) - Unified Crawler (
sources/wolt/crawler.py):- Architecture: Shared codebase for multiple stores.
- Multi-Store Support: Configurable via
--storeargument (maps to specific venue URLs). - Features/Strategy:
- Dynamic Category Discovery: Automatically scans the store page to find all categories.
- Browser Pooling: Reuses WebDriver instances to minimize overhead.
- Resumable: Tracks progress using
CrawlerState, allowing pause/resume.
- Unified Parser (
sources/wolt/parser.py):- Input: Raw HTML files from
data/<store>_raw/. - Features: Extracts product details, prices, and images from Wolt's standardized layout.
- Output:
data/<store>.result.json(e.g.,albert.result.json)
- Input: Raw HTML files from
The output files (*.result.json) conform to a unified JSON schema defined in:
sources/schema.json
products: Array of product objects.name: Standardized product name.product_url: Original URL of the product.prices: Array of price offers.price: Current price.original_price: Price before discount.unit_price: Price per unit (e.g. per kg).condition: Conditions (e.g., "Clubcard").validity_start/validity_end: Date range for the offer.
This project relies heavily on Selenium WebDriver for crawling.
- Human-like Behavior: The Tesco crawler is specifically designed to avoid detection by behaving like a user (clicking menus, scrolling, waiting for elements) rather than just hitting APIs.
- Performance:
- Parallelism: Configurable
ThreadExecutorallows running multiple browser windows simultaneously (--workers N). - Headless: Supports running in
headless=newmode for efficiency on servers.
- Parallelism: Configurable
The project employs a Map-Reduce inspired architecture for parsing to handle large datasets efficiently and robustly.
-
Map Phase (Crawling):
- The crawlers treat each page as an independent unit of work.
- Raw HTML of each page is saved individually as a compressed file (e.g.,
item_123.html.gz) in araw_datadirectory. - This ensures that if the crawler crashes, no progress is lost, and individual pages can be re-crawled or inspected without affecting the rest.
-
Reduce Phase (Parsing):
- The parser runs as a separate offline process.
- It iterates over all files in the
raw_datadirectory usingglob. - Parallel Processing: It utilizes
ProcessPoolExecutorto parse thousands of HTML files in parallel, maximizing CPU usage. - Each worker extracts data from a single file and returns a list of product objects.
- Aggregation and Deduplication: The main process collects lists from all workers, flattens them, and performs deduplication (e.g., merging duplicate products found on different pages) before writing the final JSON output.
This separation allows for rapid iteration on parsing logic without re-crawling the web pages.
Raw data extracted by parsers (*.result.json) undergoes a series of transformations before being consumed by the browser application. This pipeline is managed by scripts/processing.
Run the default pipeline:
./scripts/processing --defaultWhich executes the following steps in order:
-
Filter for Food (
processing/filter_for_food.py)- Input:
*.result.json - Action: Filters products to keep only food-related items using a keyword whitelist (positive) and blacklist (negative).
- Output:
*.001.filter_for_food.json
- Input:
-
Remove Expired Offers (
processing/remove_expired_offers.py)- Input:
*.001.filter_for_food.json - Action: Filters out price offers where
validity_enddate is in the past. - Output:
*.002.remove_expired_offers.json
- Input:
-
Enrich Brands (
processing/enrich_brands.py)- Input:
*.002.remove_expired_offers.json - Action: Guesses missing brand names from product titles using heuristics (e.g., "Coca-Cola Zero" -> Brand: "Coca-Cola") if the source didn't provide structured brand data.
- Output:
*.003.enrich_brands.json
- Input:
-
Normalize Data (
processing/normalize_data.py)- Input:
*.003.enrich_brands.json - Action:
- Converts units to base units (e.g., g -> kg).
- Calculates missing package sizes from price/unit_price.
- Output:
*.004.normalize_data.json
- Input:
-
Assign AI Categories (
processing/assign_ai_categories.py)- Input:
*.004.normalize_data.json - Action: Uses mappings or heuristics to assign standardized category IDs.
- Output:
*.005.assign_ai_categories.json
- Input:
-
Build Categories (
processing/build_categories.py)- Input:
*.005.assign_ai_categories.json - Action:
- Constructs a hierarchical category tree from the flat product list.
- Assigns stable IDs to categories.
- Tags each product with the IDs of its category lineage.
- Output:
*.006.build_categories.json-> Final*.processed.json
- Input:
The browser/ directory contains a modern web application for visualizing the aggregated data.
- Stack: React + Vite
- Name: Agravity Deals
- Functionality:
- Data Source: Dynamically loads processed data (
*.processed.json) viaindex.json. - Filtering:
- Stores: Local filtering by specific retailer.
- Categories: Hierarchical tree-based filtering with support for inclusion/exclusion logic.
- Brands: Filter by manufacturer brand.
- Sorting: Sort products by Absolute Price or Unit Price (ASC/DESC).
- Comparison: Displays aggregated offers per product, allowing easy comparison of prices across different stores and package sizes.
- Unit Pricing: Automatically calculates and highlights unit prices (e.g., per kg/l) to reveal true value.
- Deep Linking:
- Product:
product://<source>::<url>- Opens specific product details. - Category:
category://<source>::<categoryId>?store_name=<name>&product_url=<url>- Opens "Explore" view for specific category and optionally opens the product detail modal.
- Product:
- Data Source: Dynamically loads processed data (
- Navigate to
browser/. - Node.js Setup (CRITICAL):
- This project relies on a specific Node.js version.
- Always run
nvm usein thebrowser/directory to load the correct version from.nvmrc. - If you don't have
nvm, check.nvmrcfor the required version and install it manually.
- Install dependencies:
npm install. - Data Setup: The build process automatically copies
data/*.processed.jsontopublic/and generates the index.- Ensure you have generated data in
../data/(run./scripts/processing --default).
- Ensure you have generated data in
- Start dev server:
npm run dev.
For running crawlers and scripts in sources/:
- Always use a virtual environment.
- Create/Activate:
python3 -m venv .venvandsource .venv/bin/activate. - Install requirements:
pip install -r requirements.txt.
The project uses GitHub Actions for daily data updates and deployment.
- Workflow:
.github/workflows/deploy.yml - Schedule: Runs automatically at 2:00 AM daily.
- Pipeline:
- Crawl Phase: Runs crawlers for all defined stores in parallel.
- Parse Phase: Parses raw HTML into JSON (
*.result.json). - Collect & Process Phase:
- Aggregates results.
- Runs the Processing Pipeline (
./scripts/processing --default). - Uploads both raw (
*.result.json) and processed (*.processed.json) data as artifacts.
- Deploy Phase: Builds the React application with the processed data and deploys it to GitHub Pages.
When adding a new retailer (e.g., Penny), you MUST update the deployment workflow:
- Open
.github/workflows/deploy.yml. - Add the new store identifier to the
matrix.storelist in BOTH thecrawlandparsejobs. - Add configuration to the
matrix.includesection for:- Crawl Args: Define worker count and flags (e.g.,
--headless). - Parse Args: Define worker count.
- Crawl Args: Define worker count and flags (e.g.,
# Example addition to matrix
- store: penny
crawl_args: --headless --workers 4Failure to do this will result in the new store being ignored by the daily automation.
Note: This README.md and the sources/schema.json are maintained by the AI assistant.
- When to update:
- If you add a new data source.
- If you change the crawler architecture.
- If you modify the output schema.
- If you add or significantly change a step in the Data Processing Pipeline.
- If you add or significantly change a step in the Data Processing Pipeline.
- How to update: Explicitly ask the assistant to "update the project README" after making your code changes.
All temporary, test, or intermediate scripts and files should be placed in the vibes/ directory. This keeps the main source tree clean and organized.
- Avoid Spaghetti Code: Keep component logic clean and declarative. Complex logic should be extracted to utilities or hooks.
- Utilities: Encapsulate reusable logic in thematic
.tsfiles in theutils/directory (e.g.,links.ts,format.ts).