Skip to content

Add parser for foodrepo.org#11

Open
alikhan-s wants to merge 6 commits intosolutionaryme:mainfrom
alikhan-s:feature/foodrepo-parser
Open

Add parser for foodrepo.org#11
alikhan-s wants to merge 6 commits intosolutionaryme:mainfrom
alikhan-s:feature/foodrepo-parser

Conversation

@alikhan-s
Copy link

Add parser for foodrepo.org

This parser collects product data from foodrepo.org and outputs it in the JSON format required by Barbase:

{
"barcode": "string",
"name": "string",
"image_links": ["url_1", "url_2"]
}

  • Uses Selenium to handle dynamic page loading
  • Collects all product links without duplicates
  • Includes barcode, name, and image URLs

@smvrnn smvrnn requested a review from Copilot August 30, 2025 18:48

This comment was marked as outdated.

@smvrnn smvrnn requested a review from Copilot September 2, 2025 16:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new web scraper for foodrepo.org that extracts product data including barcodes, names, and image URLs in the JSON format required by Barbase. The parser uses Selenium to handle dynamic page loading and collects product information from all unique product links.

Key changes:

  • New Selenium-based parser that scrapes product data from foodrepo.org
  • Extracts barcodes, product names, and image URLs in required JSON format
  • Comprehensive documentation with installation and usage instructions

Reviewed Changes

Copilot reviewed 2 out of 9 changed files in this pull request and generated 5 comments.

File Description
domains/foodrepo.org/parser.py Main scraper implementation using Selenium and BeautifulSoup
domains/foodrepo.org/README.md Documentation covering installation, usage, and parser functionality
Files not reviewed (6)
  • .idea/.gitignore: Language not supported
  • .idea/barbase-tools.iml: Language not supported
  • .idea/inspectionProfiles/profiles_settings.xml: Language not supported
  • .idea/misc.xml: Language not supported
  • .idea/modules.xml: Language not supported
  • .idea/vcs.xml: Language not supported

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 21 to 77
products_data = []

with webdriver.Chrome(service=service) as driver:
# URL of the page with the list of products
base_url = "https://www.foodrepo.org/en/products"
driver.get(base_url)

# Wait until at least one link to the product appears on the page
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))

# Receive HTML
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find all links to product pages
product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))

# Going through each link
for link in product_links:
product_url = f"https://www.foodrepo.org{link}"
driver.get(product_url)

# Wait of h1 to appear
wait.until(EC.presence_of_element_located((By.TAG_NAME, "h1")))
product_soup = BeautifulSoup(driver.page_source, 'html.parser')

# Product name
title_tag = product_soup.find('h1')
if title_tag:
text = title_tag.get_text(strip=True)
title_text = text if not text.isdigit() else "Not found"
else:
title_text = "Not found"

# Images
img_tags = product_soup.find_all('img', alt=lambda x: x and x.startswith("Image #"))
img_urls = [img['src'] for img in img_tags if img.get('src')]

# Barcode (EAN)
barcode_div = product_soup.find('span', class_='font-weight-bold', string='Barcode')
if barcode_div and barcode_div.parent:
barcode_text = barcode_div.parent.get_text(strip=True).replace('Barcode', '').strip()
barcode = barcode_text if barcode_text.isdigit() else 'Not found'
else:
barcode = 'Not found'

products_data.append({
"barcode": barcode,
"name": title_text,
"image_links": img_urls
})

# Save in JSON
with open("foodrepo.json", "w", encoding="utf-8") as f:
json.dump(products_data, f, ensure_ascii=False, indent=4)

print("Data successfully saved in 'foodrepo.json'")
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global variable products_data should be defined inside the main execution block or within a function to improve code organization and avoid global state.

Suggested change
products_data = []
with webdriver.Chrome(service=service) as driver:
# URL of the page with the list of products
base_url = "https://www.foodrepo.org/en/products"
driver.get(base_url)
# Wait until at least one link to the product appears on the page
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))
# Receive HTML
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find all links to product pages
product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))
# Going through each link
for link in product_links:
product_url = f"https://www.foodrepo.org{link}"
driver.get(product_url)
# Wait of h1 to appear
wait.until(EC.presence_of_element_located((By.TAG_NAME, "h1")))
product_soup = BeautifulSoup(driver.page_source, 'html.parser')
# Product name
title_tag = product_soup.find('h1')
if title_tag:
text = title_tag.get_text(strip=True)
title_text = text if not text.isdigit() else "Not found"
else:
title_text = "Not found"
# Images
img_tags = product_soup.find_all('img', alt=lambda x: x and x.startswith("Image #"))
img_urls = [img['src'] for img in img_tags if img.get('src')]
# Barcode (EAN)
barcode_div = product_soup.find('span', class_='font-weight-bold', string='Barcode')
if barcode_div and barcode_div.parent:
barcode_text = barcode_div.parent.get_text(strip=True).replace('Barcode', '').strip()
barcode = barcode_text if barcode_text.isdigit() else 'Not found'
else:
barcode = 'Not found'
products_data.append({
"barcode": barcode,
"name": title_text,
"image_links": img_urls
})
# Save in JSON
with open("foodrepo.json", "w", encoding="utf-8") as f:
json.dump(products_data, f, ensure_ascii=False, indent=4)
print("Data successfully saved in 'foodrepo.json'")
def main():
products_data = []
with webdriver.Chrome(service=service) as driver:
# URL of the page with the list of products
base_url = "https://www.foodrepo.org/en/products"
driver.get(base_url)
# Wait until at least one link to the product appears on the page
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))
# Receive HTML
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find all links to product pages
product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))
# Going through each link
for link in product_links:
product_url = f"https://www.foodrepo.org{link}"
driver.get(product_url)
# Wait of h1 to appear
wait.until(EC.presence_of_element_located((By.TAG_NAME, "h1")))
product_soup = BeautifulSoup(driver.page_source, 'html.parser')
# Product name
title_tag = product_soup.find('h1')
if title_tag:
text = title_tag.get_text(strip=True)
title_text = text if not text.isdigit() else "Not found"
else:
title_text = "Not found"
# Images
img_tags = product_soup.find_all('img', alt=lambda x: x and x.startswith("Image #"))
img_urls = [img['src'] for img in img_tags if img.get('src')]
# Barcode (EAN)
barcode_div = product_soup.find('span', class_='font-weight-bold', string='Barcode')
if barcode_div and barcode_div.parent:
barcode_text = barcode_div.parent.get_text(strip=True).replace('Barcode', '').strip()
barcode = barcode_text if barcode_text.isdigit() else 'Not found'
else:
barcode = 'Not found'
products_data.append({
"barcode": barcode,
"name": title_text,
"image_links": img_urls
})
# Save in JSON
with open("foodrepo.json", "w", encoding="utf-8") as f:
json.dump(products_data, f, ensure_ascii=False, indent=4)
print("Data successfully saved in 'foodrepo.json'")
if __name__ == "__main__":
main()

Copilot uses AI. Check for mistakes.
product_url = f"https://www.foodrepo.org{link}"
driver.get(product_url)

# Wait of h1 to appear
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error in comment. Should be 'Wait for h1 to appear' instead of 'Wait of h1 to appear'.

Suggested change
# Wait of h1 to appear
# Wait for h1 to appear

Copilot uses AI. Check for mistakes.
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[href*='/en/products/']")))

# Receive HTML
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should use 'Get HTML' or 'Retrieve HTML' instead of 'Receive HTML' for clarity.

Suggested change
# Receive HTML
# Get HTML

Copilot uses AI. Check for mistakes.
# Find all links to product pages
product_links = list(set([a['href'] for a in soup.find_all('a', href=True) if '/en/products/' in a['href']]))

# Going through each link
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be 'Go through each link' or 'Iterate through each link' instead of 'Going through each link'.

Suggested change
# Going through each link
# Iterate through each link

Copilot uses AI. Check for mistakes.
"image_links": img_urls
})

# Save in JSON
Copy link

Copilot AI Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be 'Save as JSON' or 'Save to JSON' instead of 'Save in JSON'.

Suggested change
# Save in JSON
# Save as JSON

Copilot uses AI. Check for mistakes.
@smvrnn
Copy link
Contributor

smvrnn commented Sep 2, 2025

  1. Необходимо спарсить весь сайт, а не только Главную страницу
  2. Вроде на сайте можно получить API ключ и возможно получить данные проще и быстрее

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants