Add parser for foodrepo.org#11

Open

alikhan-s wants to merge 6 commits intosolutionaryme:mainfrom

alikhan-s:feature/foodrepo-parser

alikhan-s commented Aug 29, 2025

Add parser for foodrepo.org

This parser collects product data from foodrepo.org and outputs it in the JSON format required by Barbase:

{
"barcode": "string",
"name": "string",
"image_links": ["url_1", "url_2"]
}

Uses Selenium to handle dynamic page loading
Collects all product links without duplicates
Includes barcode, name, and image URLs


          Add parser for foodrepo.org

034f4c4

smvrnn requested a review from Copilot

August 30, 2025 18:48

This comment was marked as outdated.

Sign in to view

alikhan-s added 2 commits

August 31, 2025 15:32


          Improved foodrepo parser

c34095f


          Update README with setup instructions for Windows/Mac/Linux

fc8287d

smvrnn requested a review from Copilot

September 2, 2025 16:40

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull Request Overview

Adds a new web scraper for foodrepo.org that extracts product data including barcodes, names, and image URLs in the JSON format required by Barbase. The parser uses Selenium to handle dynamic page loading and collects product information from all unique product links.

Key changes:

New Selenium-based parser that scrapes product data from foodrepo.org
Extracts barcodes, product names, and image URLs in required JSON format
Comprehensive documentation with installation and usage instructions

Reviewed Changes

Copilot reviewed 2 out of 9 changed files in this pull request and generated 5 comments.

File	Description
domains/foodrepo.org/parser.py	Main scraper implementation using Selenium and BeautifulSoup
domains/foodrepo.org/README.md	Documentation covering installation, usage, and parser functionality

Files not reviewed (6)

.idea/.gitignore: Language not supported
.idea/barbase-tools.iml: Language not supported
.idea/inspectionProfiles/profiles_settings.xml: Language not supported
.idea/misc.xml: Language not supported
.idea/modules.xml: Language not supported
.idea/vcs.xml: Language not supported

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

domains/foodrepo.org/parser.py Outdated

Comment on lines 21 to 77

+              products_data = []
+              with webdriver.Chrome(service=service) as driver:
+                  # URL of the page with the list of products
+                  base_url = "https://www.foodrepo.org/en/products"
+                  driver.get(base_url)
+                  # Wait until at least one link to the product appears on the page
+                  wait = WebDriverWait(driver, 10)
+                  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))
+                  # Receive HTML
+                  soup = BeautifulSoup(driver.page_source, 'html.parser')
+                  # Find all links to product pages
+                  product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))
+                  # Going through each link
+                  for link in product_links:
+                      product_url = f"https://www.foodrepo.org{link}"
+                      driver.get(product_url)
+                      # Wait of h1 to appear
+                      wait.until(EC.presence_of_element_located((By.TAG_NAME, "h1")))
+                      product_soup = BeautifulSoup(driver.page_source, 'html.parser')
+                      # Product name
+                      title_tag = product_soup.find('h1')
+                      if title_tag:
+                          text = title_tag.get_text(strip=True)
+                          title_text = text if not text.isdigit() else "Not found"
+                      else:
+                          title_text = "Not found"
+                      # Images
+                      img_tags = product_soup.find_all('img', alt=lambda x: x and x.startswith("Image #"))
+                      img_urls = [img['src'] for img in img_tags if img.get('src')]
+                      # Barcode (EAN)
+                      barcode_div = product_soup.find('span', class_='font-weight-bold', string='Barcode')
+                      if barcode_div and barcode_div.parent:
+                          barcode_text = barcode_div.parent.get_text(strip=True).replace('Barcode', '').strip()
+                          barcode = barcode_text if barcode_text.isdigit() else 'Not found'
+                      else:
+                          barcode = 'Not found'
+                      products_data.append({
+                          "barcode": barcode,
+                          "name": title_text,
+                          "image_links": img_urls
+                      })
+              # Save in JSON
+              with open("foodrepo.json", "w", encoding="utf-8") as f:
+                  json.dump(products_data, f, ensure_ascii=False, indent=4)
+              print("Data successfully saved in 'foodrepo.json'")

Copilot AI Sep 2, 2025

The global variable products_data should be defined inside the main execution block or within a function to improve code organization and avoid global state.

Suggested change

      
            products_data = []
          
            with webdriver.Chrome(service=service) as driver:
          
                # URL of the page with the list of products
          
                base_url = "https://www.foodrepo.org/en/products"
          
                driver.get(base_url)
          
                # Wait until at least one link to the product appears on the page
          
                wait = WebDriverWait(driver, 10)
          
                wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))
          
                # Receive HTML
          
                soup = BeautifulSoup(driver.page_source, 'html.parser')
          
                # Find all links to product pages
          
                product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))
          
                # Going through each link
          
                for link in product_links:
          
                    product_url = f"https://www.foodrepo.org{link}"
          
                    driver.get(product_url)
          
                    # Wait of h1 to appear
          
                    wait.until(EC.presence_of_element_located((By.TAG_NAME, "h1")))
          
                    product_soup = BeautifulSoup(driver.page_source, 'html.parser')
          
                    # Product name
          
                    title_tag = product_soup.find('h1')
          
                    if title_tag:
          
                        text = title_tag.get_text(strip=True)
          
                        title_text = text if not text.isdigit() else "Not found"
          
                    else:
          
                        title_text = "Not found"
          
                    # Images
          
                    img_tags = product_soup.find_all('img', alt=lambda x: x and x.startswith("Image #"))
          
                    img_urls = [img['src'] for img in img_tags if img.get('src')]
          
                    # Barcode (EAN)
          
                    barcode_div = product_soup.find('span', class_='font-weight-bold', string='Barcode')
          
                    if barcode_div and barcode_div.parent:
          
                        barcode_text = barcode_div.parent.get_text(strip=True).replace('Barcode', '').strip()
          
                        barcode = barcode_text if barcode_text.isdigit() else 'Not found'
          
                    else:
          
                        barcode = 'Not found'
          
                    products_data.append({
          
                        "barcode": barcode,
          
                        "name": title_text,
          
                        "image_links": img_urls
          
                    })
          
            # Save in JSON
          
            with open("foodrepo.json", "w", encoding="utf-8") as f:
          
                json.dump(products_data, f, ensure_ascii=False, indent=4)
          
            print("Data successfully saved in 'foodrepo.json'")
          
            def main():
          
                products_data = []
          
                with webdriver.Chrome(service=service) as driver:
          
                    # URL of the page with the list of products
          
                    base_url = "https://www.foodrepo.org/en/products"
          
                    driver.get(base_url)
          
                    # Wait until at least one link to the product appears on the page
          
                    wait = WebDriverWait(driver, 10)
          
                    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))
          
                    # Receive HTML
          
                    soup = BeautifulSoup(driver.page_source, 'html.parser')
          
                    # Find all links to product pages
          
                    product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))
          
                    # Going through each link
          
                    for link in product_links:
          
                        product_url = f"https://www.foodrepo.org{link}"
          
                        driver.get(product_url)
          
                        # Wait of h1 to appear
          
                        wait.until(EC.presence_of_element_located((By.TAG_NAME, "h1")))
          
                        product_soup = BeautifulSoup(driver.page_source, 'html.parser')
          
                        # Product name
          
                        title_tag = product_soup.find('h1')
          
                        if title_tag:
          
                            text = title_tag.get_text(strip=True)
          
                            title_text = text if not text.isdigit() else "Not found"
          
                        else:
          
                            title_text = "Not found"
          
                        # Images
          
                        img_tags = product_soup.find_all('img', alt=lambda x: x and x.startswith("Image #"))
          
                        img_urls = [img['src'] for img in img_tags if img.get('src')]
          
                        # Barcode (EAN)
          
                        barcode_div = product_soup.find('span', class_='font-weight-bold', string='Barcode')
          
                        if barcode_div and barcode_div.parent:
          
                            barcode_text = barcode_div.parent.get_text(strip=True).replace('Barcode', '').strip()
          
                            barcode = barcode_text if barcode_text.isdigit() else 'Not found'
          
                        else:
          
                            barcode = 'Not found'
          
                        products_data.append({
          
                            "barcode": barcode,
          
                            "name": title_text,
          
                            "image_links": img_urls
          
                        })
          
                # Save in JSON
          
                with open("foodrepo.json", "w", encoding="utf-8") as f:
          
                    json.dump(products_data, f, ensure_ascii=False, indent=4)
          
                print("Data successfully saved in 'foodrepo.json'")
          
            if __name__ == "__main__":
          
                main()

Copilot uses AI. Check for mistakes.

domains/foodrepo.org/parser.py Outdated

+                      product_url = f"https://www.foodrepo.org{link}"
+                      driver.get(product_url)
+                      # Wait of h1 to appear

Copilot AI Sep 2, 2025

Grammar error in comment. Should be 'Wait for h1 to appear' instead of 'Wait of h1 to appear'.

Suggested change

      
                    # Wait of h1 to appear
          
                    # Wait for h1 to appear

Copilot uses AI. Check for mistakes.

domains/foodrepo.org/parser.py Outdated

+                  wait = WebDriverWait(driver, 10)
+                  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))
+                  # Receive HTML

Copilot AI Sep 2, 2025

Comment should use 'Get HTML' or 'Retrieve HTML' instead of 'Receive HTML' for clarity.

Suggested change

      
                # Receive HTML
          
                # Get HTML

Copilot uses AI. Check for mistakes.

domains/foodrepo.org/parser.py Outdated

+                  # Find all links to product pages
+                  product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))
+                  # Going through each link

Copilot AI Sep 2, 2025

Comment should be 'Go through each link' or 'Iterate through each link' instead of 'Going through each link'.

Suggested change

      
                # Going through each link
          
                # Iterate through each link

Copilot uses AI. Check for mistakes.

domains/foodrepo.org/parser.py Outdated

+                          "image_links": img_urls
+                      })
+              # Save in JSON

Copilot AI Sep 2, 2025

Comment should be 'Save as JSON' or 'Save to JSON' instead of 'Save in JSON'.

Suggested change

      
            # Save in JSON
          
            # Save as JSON

Copilot uses AI. Check for mistakes.


          🧹 Обновить .gitignore, добавив исключения для venv

aa7c7f4

Contributor

smvrnn commented Sep 2, 2025

Необходимо спарсить весь сайт, а не только Главную страницу
Вроде на сайте можно получить API ключ и возможно получить данные проще и быстрее

alikhan-s added 2 commits

September 4, 2025 12:37


          Refactor: replace manual parser with FoodRepo API scraper

be9a39a


          Merge branch 'feature/foodrepo-parser' of https://github.com/alikhan-…

85fed1d

…s/barbase-tools into feature/foodrepo-parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet