Skip to content

Feature OpenFoodFacts Parser#14

Open
alikhan-s wants to merge 4 commits intosolutionaryme:mainfrom
alikhan-s:feature/openfoodfacts-parser
Open

Feature OpenFoodFacts Parser#14
alikhan-s wants to merge 4 commits intosolutionaryme:mainfrom
alikhan-s:feature/openfoodfacts-parser

Conversation

@alikhan-s
Copy link

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Python-based parser for the OpenFoodFacts API that incrementally fetches and stores product data. The implementation uses async/concurrent requests to efficiently download products and maintains state to track progress across runs.

Key changes:

  • Adds an asynchronous Python parser with concurrent API requests and state persistence
  • Implements incremental data fetching based on last update timestamps
  • Includes .gitignore updates for IDE files and large output files

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 12 comments.

File Description
domains/world.openfoodfacts.org/another_implementation/state.json Initial state file with last update timestamp for tracking parsing progress
domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py Main parser implementation with async API fetching, state management, and JSON output
.gitignore Adds PyCharm IDE files and large products.json output to gitignore

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"barcode": p.get("code"),
"name": p.get("product_name"),
"image_links": [p.get("image_url")],
"updated_at": datetime.utcfromtimestamp(p["last_updated_t"]).isoformat()
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of datetime.utcfromtimestamp() is deprecated in Python 3.12+. Consider using datetime.fromtimestamp(timestamp, tz=timezone.utc) instead to follow current best practices and avoid deprecation warnings.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +47
with open(OUTPUT_FILE, "w") as f:
f.write("[\n")


def append_product(p):
with open(OUTPUT_FILE, "a", encoding="utf-8") as f:
json.dump(p, f, ensure_ascii=False)
f.write(",\n")


Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output file will have a trailing comma after the last product, which makes it invalid JSON. The file needs proper JSON array formatting with a closing bracket. Consider using a different approach such as writing complete JSON at once, or tracking whether to add a comma before each entry.

Suggested change
with open(OUTPUT_FILE, "w") as f:
f.write("[\n")
def append_product(p):
with open(OUTPUT_FILE, "a", encoding="utf-8") as f:
json.dump(p, f, ensure_ascii=False)
f.write(",\n")
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
# Initialize as a valid empty JSON array
json.dump([], f, ensure_ascii=False)
def append_product(p):
# Ensure the output file exists and is initialized as a JSON array
ensure_output_file()
# Read the existing array, append the new product, and rewrite the file
with open(OUTPUT_FILE, "r+", encoding="utf-8") as f:
try:
data = json.load(f)
except json.JSONDecodeError:
data = []
if not isinstance(data, list):
data = []
data.append(p)
f.seek(0)
json.dump(data, f, ensure_ascii=False)
f.truncate()

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +83
return {
"barcode": p.get("code"),
"name": p.get("product_name"),
"image_links": [p.get("image_url")],
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image_url is wrapped in a list but could be None if the product has no image. This would result in [None] in the output. Consider filtering out None values or not wrapping a single value in a list, or check if the value exists before wrapping.

Suggested change
return {
"barcode": p.get("code"),
"name": p.get("product_name"),
"image_links": [p.get("image_url")],
image_url = p.get("image_url")
return {
"barcode": p.get("code"),
"name": p.get("product_name"),
"image_links": [image_url] if image_url else [],

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +134
if "last_updated_t" not in p:
continue
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a product is missing the 'last_updated_t' field, it's silently skipped but not counted. This could lead to confusion about the actual number of products processed versus downloaded. Consider logging when products are skipped.

Copilot uses AI. Check for mistakes.
state["last_updated_t"] = max_ts
save_state(state)

print("Parsing finished. Don't forget to close array with ]")
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message instructs to manually close the JSON array, which is error-prone. This is related to the invalid JSON generation issue mentioned earlier. The script should properly close the JSON array automatically.

Copilot uses AI. Check for mistakes.
try:
r = await client.get(API_URL, params=params)
if r.status_code == 200:
return r.json()
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns None for non-200 status codes without logging the error or status code, making it difficult to diagnose API issues. Consider logging the status code and response body to help with debugging.

Suggested change
return r.json()
return r.json()
print(f"[ERROR] Non-200 response for page {page}: status={r.status_code}, body={r.text}")

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +69
except (httpx.ReadTimeout, httpx.ConnectTimeout):
print(f"[TIMEOUT] page {page}, attempt {attempt+1}/{retries}")
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic only handles timeout exceptions (ReadTimeout, ConnectTimeout) but not other common HTTP errors like ConnectionError, HTTPStatusError, or network failures. Consider catching a broader set of exceptions or using httpx.HTTPError as a base exception.

Suggested change
except (httpx.ReadTimeout, httpx.ConnectTimeout):
print(f"[TIMEOUT] page {page}, attempt {attempt+1}/{retries}")
except httpx.HTTPError as exc:
print(f"[HTTP ERROR] page {page}, attempt {attempt+1}/{retries}: {exc}")

Copilot uses AI. Check for mistakes.
empty_pages += 1
continue

products = data["products"]
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No validation is performed on the response JSON structure before accessing 'products'. If the API returns malformed JSON or the structure changes, this could cause a KeyError. Consider adding validation or using .get() with a default value.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +149
for pr in all_products:
append_product(pr)
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File I/O operations are performed for each product individually within a loop (lines 148-149), which is inefficient. Consider batching the writes or accumulating products in memory and writing them in larger chunks to reduce I/O overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +141
all_products.append(extract_product(p))

Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If extract_product() raises an exception (e.g., KeyError on p["last_updated_t"]), it will crash the entire script without proper error handling. Consider wrapping the extraction in a try-except block to handle malformed product data gracefully.

Suggested change
all_products.append(extract_product(p))
try:
pr = extract_product(p)
except Exception as e:
print(f"Skipping malformed product due to error: {e}")
continue
all_products.append(pr)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants