Feature OpenFoodFacts Parser#14

Open

alikhan-s wants to merge 4 commits intosolutionaryme:mainfrom

alikhan-s:feature/openfoodfacts-parser

alikhan-s commented Nov 29, 2025

No description provided.

alikhan-s added 4 commits

October 14, 2025 15:16


          Add gitignore .idea/ for Pycharm

8dea386


          feat(parser): working response from API

d0a4a15


          feat(parser): working parser

e93b194


          feat(parser): OpenFoodFacts parser without testing

be43519

smvrnn requested a review from Copilot

December 19, 2025 18:06

Copilot started reviewing on behalf of smvrnn

December 19, 2025 18:06

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

This PR introduces a Python-based parser for the OpenFoodFacts API that incrementally fetches and stores product data. The implementation uses async/concurrent requests to efficiently download products and maintains state to track progress across runs.

Key changes:

Adds an asynchronous Python parser with concurrent API requests and state persistence
Implements incremental data fetching based on last update timestamps
Includes .gitignore updates for IDE files and large output files

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 12 comments.

File	Description
domains/world.openfoodfacts.org/another_implementation/state.json	Initial state file with last update timestamp for tracking parsing progress
domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py	Main parser implementation with async API fetching, state management, and JSON output
.gitignore	Adds PyCharm IDE files and large products.json output to gitignore

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

+                      "barcode": p.get("code"),
+                      "name": p.get("product_name"),
+                      "image_links": [p.get("image_url")],
+                      "updated_at": datetime.utcfromtimestamp(p["last_updated_t"]).isoformat()

Copilot AI Dec 19, 2025

The use of datetime.utcfromtimestamp() is deprecated in Python 3.12+. Consider using datetime.fromtimestamp(timestamp, tz=timezone.utc) instead to follow current best practices and avoid deprecation warnings.

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

Comment on lines +38 to +47

+                      with open(OUTPUT_FILE, "w") as f:
+                          f.write("[\n")
+              def append_product(p):
+                  with open(OUTPUT_FILE, "a", encoding="utf-8") as f:
+                      json.dump(p, f, ensure_ascii=False)
+                      f.write(",\n")

Copilot AI Dec 19, 2025

The output file will have a trailing comma after the last product, which makes it invalid JSON. The file needs proper JSON array formatting with a closing bracket. Consider using a different approach such as writing complete JSON at once, or tracking whether to add a comma before each entry.

Suggested change

      
                    with open(OUTPUT_FILE, "w") as f:
          
                        f.write("[\n")
          
            def append_product(p):
          
                with open(OUTPUT_FILE, "a", encoding="utf-8") as f:
          
                    json.dump(p, f, ensure_ascii=False)
          
                    f.write(",\n")
          
                    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
          
                        # Initialize as a valid empty JSON array
          
                        json.dump([], f, ensure_ascii=False)
          
            def append_product(p):
          
                # Ensure the output file exists and is initialized as a JSON array
          
                ensure_output_file()
          
                # Read the existing array, append the new product, and rewrite the file
          
                with open(OUTPUT_FILE, "r+", encoding="utf-8") as f:
          
                    try:
          
                        data = json.load(f)
          
                    except json.JSONDecodeError:
          
                        data = []
          
                    if not isinstance(data, list):
          
                        data = []
          
                    data.append(p)
          
                    f.seek(0)
          
                    json.dump(data, f, ensure_ascii=False)
          
                    f.truncate()

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

Comment on lines +80 to +83

+                  return {
+                      "barcode": p.get("code"),
+                      "name": p.get("product_name"),
+                      "image_links": [p.get("image_url")],

Copilot AI Dec 19, 2025

The image_url is wrapped in a list but could be None if the product has no image. This would result in [None] in the output. Consider filtering out None values or not wrapping a single value in a list, or check if the value exists before wrapping.

Suggested change

      
                return {
          
                    "barcode": p.get("code"),
          
                    "name": p.get("product_name"),
          
                    "image_links": [p.get("image_url")],
          
                image_url = p.get("image_url")
          
                return {
          
                    "barcode": p.get("code"),
          
                    "name": p.get("product_name"),
          
                    "image_links": [image_url] if image_url else [],

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

Comment on lines +133 to +134

		if "last_updated_t" not in p:
		continue

Copilot AI Dec 19, 2025

If a product is missing the 'last_updated_t' field, it's silently skipped but not counted. This could lead to confusion about the actual number of products processed versus downloaded. Consider logging when products are skipped.

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

+                      state["last_updated_t"] = max_ts
+                      save_state(state)
+                  print("Parsing finished. Don't forget to close array with ]")

Copilot AI Dec 19, 2025

The error message instructs to manually close the JSON array, which is error-prone. This is related to the invalid JSON generation issue mentioned earlier. The script should properly close the JSON array automatically.

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

+                      try:
+                          r = await client.get(API_URL, params=params)
+                          if r.status_code == 200:
+                              return r.json()

Copilot AI Dec 19, 2025

The function returns None for non-200 status codes without logging the error or status code, making it difficult to diagnose API issues. Consider logging the status code and response body to help with debugging.

Suggested change

      
                            return r.json()
          
                            return r.json()
          
                        print(f"[ERROR] Non-200 response for page {page}: status={r.status_code}, body={r.text}")

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

Comment on lines +68 to +69

		except (httpx.ReadTimeout, httpx.ConnectTimeout):
		print(f"[TIMEOUT] page {page}, attempt {attempt+1}/{retries}")

Copilot AI Dec 19, 2025

The retry logic only handles timeout exceptions (ReadTimeout, ConnectTimeout) but not other common HTTP errors like ConnectionError, HTTPStatusError, or network failures. Consider catching a broader set of exceptions or using httpx.HTTPError as a base exception.

Suggested change

      
                    except (httpx.ReadTimeout, httpx.ConnectTimeout):
          
                        print(f"[TIMEOUT] page {page}, attempt {attempt+1}/{retries}")
          
                    except httpx.HTTPError as exc:
          
                        print(f"[HTTP ERROR] page {page}, attempt {attempt+1}/{retries}: {exc}")

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

+                              empty_pages += 1
+                              continue
+                          products = data["products"]

Copilot AI Dec 19, 2025

No validation is performed on the response JSON structure before accessing 'products'. If the API returns malformed JSON or the structure changes, this could cause a KeyError. Consider adding validation or using .get() with a default value.

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

Comment on lines +148 to +149

		for pr in all_products:
		append_product(pr)

Copilot AI Dec 19, 2025

File I/O operations are performed for each product individually within a loop (lines 148-149), which is inefficient. Consider batching the writes or accumulating products in memory and writing them in larger chunks to reduce I/O overhead.

Copilot uses AI. Check for mistakes.

domains/world.openfoodfacts.org/another_implementation/openfoodfacts_parser.py

Comment on lines +140 to +141

		all_products.append(extract_product(p))

Copilot AI Dec 19, 2025

If extract_product() raises an exception (e.g., KeyError on p["last_updated_t"]), it will crash the entire script without proper error handling. Consider wrapping the extraction in a try-except block to handle malformed product data gracefully.

Suggested change

      
                            all_products.append(extract_product(p))
          
                            try:
          
                                pr = extract_product(p)
          
                            except Exception as e:
          
                                print(f"Skipping malformed product due to error: {e}")
          
                                continue
          
                            all_products.append(pr)

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet