Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a Python-based parser for the OpenFoodFacts API that incrementally fetches and stores product data. The implementation uses async/concurrent requests to efficiently download products and maintains state to track progress across runs.
Key changes:
- Adds an asynchronous Python parser with concurrent API requests and state persistence
- Implements incremental data fetching based on last update timestamps
- Includes .gitignore updates for IDE files and large output files
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| domains/world.openfoodfacts.org/another_implementation/state.json | Initial state file with last update timestamp for tracking parsing progress |
| domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py | Main parser implementation with async API fetching, state management, and JSON output |
| .gitignore | Adds PyCharm IDE files and large products.json output to gitignore |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "barcode": p.get("code"), | ||
| "name": p.get("product_name"), | ||
| "image_links": [p.get("image_url")], | ||
| "updated_at": datetime.utcfromtimestamp(p["last_updated_t"]).isoformat() |
There was a problem hiding this comment.
The use of datetime.utcfromtimestamp() is deprecated in Python 3.12+. Consider using datetime.fromtimestamp(timestamp, tz=timezone.utc) instead to follow current best practices and avoid deprecation warnings.
| with open(OUTPUT_FILE, "w") as f: | ||
| f.write("[\n") | ||
|
|
||
|
|
||
| def append_product(p): | ||
| with open(OUTPUT_FILE, "a", encoding="utf-8") as f: | ||
| json.dump(p, f, ensure_ascii=False) | ||
| f.write(",\n") | ||
|
|
||
|
|
There was a problem hiding this comment.
The output file will have a trailing comma after the last product, which makes it invalid JSON. The file needs proper JSON array formatting with a closing bracket. Consider using a different approach such as writing complete JSON at once, or tracking whether to add a comma before each entry.
| with open(OUTPUT_FILE, "w") as f: | |
| f.write("[\n") | |
| def append_product(p): | |
| with open(OUTPUT_FILE, "a", encoding="utf-8") as f: | |
| json.dump(p, f, ensure_ascii=False) | |
| f.write(",\n") | |
| with open(OUTPUT_FILE, "w", encoding="utf-8") as f: | |
| # Initialize as a valid empty JSON array | |
| json.dump([], f, ensure_ascii=False) | |
| def append_product(p): | |
| # Ensure the output file exists and is initialized as a JSON array | |
| ensure_output_file() | |
| # Read the existing array, append the new product, and rewrite the file | |
| with open(OUTPUT_FILE, "r+", encoding="utf-8") as f: | |
| try: | |
| data = json.load(f) | |
| except json.JSONDecodeError: | |
| data = [] | |
| if not isinstance(data, list): | |
| data = [] | |
| data.append(p) | |
| f.seek(0) | |
| json.dump(data, f, ensure_ascii=False) | |
| f.truncate() |
| return { | ||
| "barcode": p.get("code"), | ||
| "name": p.get("product_name"), | ||
| "image_links": [p.get("image_url")], |
There was a problem hiding this comment.
The image_url is wrapped in a list but could be None if the product has no image. This would result in [None] in the output. Consider filtering out None values or not wrapping a single value in a list, or check if the value exists before wrapping.
| return { | |
| "barcode": p.get("code"), | |
| "name": p.get("product_name"), | |
| "image_links": [p.get("image_url")], | |
| image_url = p.get("image_url") | |
| return { | |
| "barcode": p.get("code"), | |
| "name": p.get("product_name"), | |
| "image_links": [image_url] if image_url else [], |
| if "last_updated_t" not in p: | ||
| continue |
There was a problem hiding this comment.
If a product is missing the 'last_updated_t' field, it's silently skipped but not counted. This could lead to confusion about the actual number of products processed versus downloaded. Consider logging when products are skipped.
| state["last_updated_t"] = max_ts | ||
| save_state(state) | ||
|
|
||
| print("Parsing finished. Don't forget to close array with ]") |
There was a problem hiding this comment.
The error message instructs to manually close the JSON array, which is error-prone. This is related to the invalid JSON generation issue mentioned earlier. The script should properly close the JSON array automatically.
| try: | ||
| r = await client.get(API_URL, params=params) | ||
| if r.status_code == 200: | ||
| return r.json() |
There was a problem hiding this comment.
The function returns None for non-200 status codes without logging the error or status code, making it difficult to diagnose API issues. Consider logging the status code and response body to help with debugging.
| return r.json() | |
| return r.json() | |
| print(f"[ERROR] Non-200 response for page {page}: status={r.status_code}, body={r.text}") |
| except (httpx.ReadTimeout, httpx.ConnectTimeout): | ||
| print(f"[TIMEOUT] page {page}, attempt {attempt+1}/{retries}") |
There was a problem hiding this comment.
The retry logic only handles timeout exceptions (ReadTimeout, ConnectTimeout) but not other common HTTP errors like ConnectionError, HTTPStatusError, or network failures. Consider catching a broader set of exceptions or using httpx.HTTPError as a base exception.
| except (httpx.ReadTimeout, httpx.ConnectTimeout): | |
| print(f"[TIMEOUT] page {page}, attempt {attempt+1}/{retries}") | |
| except httpx.HTTPError as exc: | |
| print(f"[HTTP ERROR] page {page}, attempt {attempt+1}/{retries}: {exc}") |
| empty_pages += 1 | ||
| continue | ||
|
|
||
| products = data["products"] |
There was a problem hiding this comment.
No validation is performed on the response JSON structure before accessing 'products'. If the API returns malformed JSON or the structure changes, this could cause a KeyError. Consider adding validation or using .get() with a default value.
| for pr in all_products: | ||
| append_product(pr) |
There was a problem hiding this comment.
File I/O operations are performed for each product individually within a loop (lines 148-149), which is inefficient. Consider batching the writes or accumulating products in memory and writing them in larger chunks to reduce I/O overhead.
| all_products.append(extract_product(p)) | ||
|
|
There was a problem hiding this comment.
If extract_product() raises an exception (e.g., KeyError on p["last_updated_t"]), it will crash the entire script without proper error handling. Consider wrapping the extraction in a try-except block to handle malformed product data gracefully.
| all_products.append(extract_product(p)) | |
| try: | |
| pr = extract_product(p) | |
| except Exception as e: | |
| print(f"Skipping malformed product due to error: {e}") | |
| continue | |
| all_products.append(pr) |
No description provided.