feat: add csv feature to extract_tables by jdev01-del · Pull Request #79 · CambioML/any-parser

jdev01-del · 2025-01-09T17:45:43Z

User description

Description

extract_tables used to only support html return format. This commit makes it also support csv return format.

To change return format, find this line in extract_tables.ipynb:
file_path="./sample_data/test_1figure_1table.png", return_type="csv"

change return_type to either csv or html based on needs.

Input Table:

CSV output:
0,1,2
,latency,(ms)
participants,mean,99th percentile
1,17.0 +1.4,75.0 34.9
2,24.5 +2.5,87.6 35.9
5,31.5 +6.2,104.5 52.2
10,30.0 +3.7,95.6 25.4
25,35.5 +5.6,100.4 42.7
50,42.7 +4.1,93.7 22.9
100,71.4 +7.6,131.2 +17.6
200,150.5 +11.0,320.3 35.1

Related Issue

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring
Performance improvement

How Has This Been Tested?

Locally running extract_tables.ipynb

Screenshots (if applicable)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

PR Type

Enhancement

Description

Added support for CSV output in extract_tables method.
Introduced a utility function flatten_to_string for nested list handling.
Updated example notebook to demonstrate CSV output functionality.
Improved code formatting and added error handling for missing dependencies.

Changes walkthrough 📝

Relevant files

Enhancement

any_parser.py `Add CSV output functionality and utility methods` any_parser/any_parser.py Added `return_type` parameter to `extract_tables` method for CSV or HTML output. Implemented `flatten_to_string` utility for handling nested lists. Added logic to convert HTML tables to CSV using pandas. Improved formatting and added error handling for missing pandas dependency.	+44/-5
extract_tables.ipynb `Update example notebook for CSV output demonstration` examples/extract_tables.ipynb Updated notebook to demonstrate CSV output functionality. Modified imports and added runtime warnings for deprecated pandas usage. Adjusted example code to use `return_type="csv"` in `extract_tables`. Enhanced output display logic for better readability.	+54/-35

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions · 2025-01-09T17:46:37Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The `flatten_to_string` method added in the PR may not handle edge cases effectively, such as deeply nested lists or non-stringifiable objects. This could lead to unexpected behavior or errors. @staticmethod def flatten_to_string(lst): result = [] for item in lst: if isinstance(item, list): result.append(AnyParser.flatten_to_string(item)) else: result.append(str(item)) return "".join(result) CSV Conversion Warning The use of `pd.read_html` for converting HTML to CSV generates a `FutureWarning`. This indicates potential deprecation in future versions of pandas, which could break functionality. try: import pandas as pd except ImportError: raise ImportError( "Please install pandas to use CSV return_type" ) df_list = pd.read_html(extracted_html) csv_list = [] for df in df_list: csv_list.append(df.to_csv(index=False)) csv_output = "\n\n".join(csv_list) return csv_output, time_elapsed Example Code Clarity The example notebook includes commented-out code and lacks clear documentation for the new `return_type` parameter. This could confuse users trying to understand or utilize the new feature. "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display, Markdown\n", "\n", "# from any_parser import AnyParser\n", "import sys\n", "import importlib\n", "\n", "\n", "sys.path.append(\"..\")\n", "import any_parser.any_parser\n", "\n", "importlib.reload(any_parser.any_parser)\n", "from any_parser.any_parser import AnyParser" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "ap = AnyParser(api_key=\"key\")" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ubuntu/any-parser/examples/../any_parser/any_parser.py:232: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.\n", " \n" ] } ], "source": [ "html_output, time_info = ap.extract_tables(\n", " file_path=\"./sample_data/test_1figure_1table.png\", return_type=\"csv\"\n", ")" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3 μs, sys: 0 ns, total: 3 μs\n", "Wall time: 5.01 μs\n" ] } ], "source": [ "time" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "0,1,2\n", ",latency,(ms)\n", "participants,mean,99th percentile\n", "1,17.0 +1.4,75.0 34.9\n", "2,24.5 +2.5,87.6 35.9\n", "5,31.5 +6.2,104.5 52.2\n", "10,30.0 +3.7,95.6 25.4\n", "25,35.5 +5.6,100.4 42.7\n", "50,42.7 +4.1,93.7 22.9\n", "100,71.4 +7.6,131.2 +17.6\n", "200,150.5 +11.0,320.3 35.1\n" ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "if isinstance(html_output, list):\n", " html_output_str = \"\\n\".join(html_output)\n", "else:\n", " html_output_str = html_output\n", "\n", "display(Markdown(html_output_str))" ]

github-actions · 2025-01-09T17:46:50Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
General	Replace direct HTML string usage with `StringIO` to ensure compatibility with future versions of pandas Use `StringIO` when passing literal HTML strings to `pd.read_html` to avoid deprecation warnings and future compatibility issues. any_parser/any_parser.py [232] -df_list = pd.read_html(extracted_html) +from io import StringIO +df_list = pd.read_html(StringIO(extracted_html)) Suggestion importance[1-10]: 10 Why: This suggestion resolves a deprecation warning and ensures future compatibility with pandas by using `StringIO` for literal HTML strings. It is a necessary change to maintain functionality in upcoming versions of pandas.	10
	Add validation for the `return_type` parameter to prevent unexpected behavior from invalid inputs Validate the `return_type` parameter in `extract_tables` to ensure it only accepts "html" or "csv" and raise a clear error for invalid values. any_parser/any_parser.py [224] +if return_type.lower() not in ["html", "csv"]: + raise ValueError("Invalid return_type. Expected 'html' or 'csv'.") if return_type.lower() == "csv": Suggestion importance[1-10]: 8 Why: Adding validation for the `return_type` parameter improves code robustness by ensuring only valid inputs are processed. This prevents unexpected behavior and enhances error handling.	8
	Add type checking for `html_output` in the notebook example to handle unexpected data types Ensure the notebook example handles cases where `html_output` is not a valid list or string to prevent runtime errors. examples/extract_tables.ipynb [110-113] if isinstance(html_output, list): html_output_str = "\n".join(html_output) +elif isinstance(html_output, str): + html_output_str = html_output else: - html_output_str = html_output + raise TypeError("html_output must be a list or a string") Suggestion importance[1-10]: 7 Why: The suggestion improves the notebook example by adding type checking for `html_output`, which prevents runtime errors when the data type is unexpected. This enhances the reliability of the example code.	7
Possible issue	Add handling for circular references in nested lists to prevent infinite recursion Ensure that the `flatten_to_string` method handles circular references in nested lists to prevent infinite recursion. any_parser/any_parser.py [190-197] -def flatten_to_string(lst): +def flatten_to_string(lst, seen=None): + if seen is None: + seen = set() + if id(lst) in seen: + raise ValueError("Circular reference detected in list") + seen.add(id(lst)) result = [] for item in lst: if isinstance(item, list): - result.append(AnyParser.flatten_to_string(item)) + result.append(AnyParser.flatten_to_string(item, seen)) else: result.append(str(item)) return "".join(result) Suggestion importance[1-10]: 9 Why: The suggestion effectively addresses a potential issue of infinite recursion in the `flatten_to_string` method by adding handling for circular references. This is a critical improvement for robustness and prevents runtime errors in edge cases.	9

lingjiekong · 2025-01-09T18:38:14Z

@jdev01-del Please fix the build issue due to black format.

Copilot

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

lingjiekong · 2025-01-09T18:47:54Z

any_parser/any_parser.py

+    @staticmethod
+    def flatten_to_string(lst):
+        result = []
+        for item in lst:
+            if isinstance(item, list):
+                result.append(AnyParser.flatten_to_string(item))
+            else:
+                result.append(str(item))
+        return "".join(result)


The flatten_to_string method has multiple critical flaws in handling nested lists:
It incorrectly flattens nested lists by converting them to string representations or appending list objects directly, which prevents true flattening.
The method fails to properly extend the result list with flattened items, causing type errors when attempting to join the result.
The implementation assumes all iterables are lists, which limits its flexibility with other iterable types like tuples or sets.
The method needs to be redesigned to recursively flatten all nested lists into a
single string, ensuring that each nested item is converted to a string and fully
expanded, while supporting various iterable types.

Also, why this is a staticmethod?

Addressed. I used static method because I think this function will only need to support parsing the parameters, so it's easier if I make it a static method.

lingjiekong · 2025-01-09T18:49:56Z

any_parser/any_parser.py

+            df_list = pd.read_html(extracted_html)
+            csv_list = []
+            for df in df_list:
+                csv_list.append(df.to_csv(index=False))
+            csv_output = "\n\n".join(csv_list)


The extract_tables method has a CSV conversion issue when handling multiple tables. When converting HTML tables to CSV, the method incorrectly joins multiple tables using "\n\n".join(csv_list), which breaks the CSV format by inserting unnecessary newlines. Additionally, the method does not properly handle cases where extracted_html is a list, potentially causing type conversion errors when using pd.read_html().

lingjiekong

Make sure you add both html and csv example in the notebook

lingjiekong

Make sure all your github actions are passing before you request for review to save reviwer time.

lingjiekong · 2025-01-13T22:52:04Z

@jdev01-del Make sure you reply to all my comments.

lingjiekong

LGTM

lingjiekong · 2025-01-13T22:51:32Z

examples/extract_tables.ipynb

   "outputs": [],
   "source": [
-    "ap = AnyParser(api_key=\"...\")"
+    "ap = AnyParser(api_key=\"key\")"


nit: let's not change this.

Ubuntu added 2 commits January 9, 2025 02:09

adding csv to extract tables

1ce5b61

add csv feature

891ba1c

jdev01-del requested review from Sdddell, goldmermaid and lingjiekong as code owners January 9, 2025 17:45

github-actions bot added the Review effort [1-5]: 4 label Jan 9, 2025

lingjiekong requested a review from Copilot January 9, 2025 18:38

Copilot AI reviewed Jan 9, 2025

View reviewed changes

lingjiekong reviewed Jan 9, 2025

View reviewed changes

lingjiekong suggested changes Jan 9, 2025

View reviewed changes

addressed comments

3972e7a

lingjiekong approved these changes Jan 14, 2025

View reviewed changes

lingjiekong merged commit fc996f0 into main Jan 14, 2025
1 check passed

Conversation

jdev01-del commented Jan 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Related Issue

Type of Change

How Has This Been Tested?

Screenshots (if applicable)

Checklist

Additional Notes

PR Type

Description

Changes walkthrough 📝

Uh oh!

github-actions bot commented Jan 9, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Jan 9, 2025

PR Code Suggestions ✨

Uh oh!

lingjiekong commented Jan 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

lingjiekong Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

jdev01-del Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

lingjiekong Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

jdev01-del Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

lingjiekong left a comment

Choose a reason for hiding this comment

Uh oh!

lingjiekong left a comment

Choose a reason for hiding this comment

Uh oh!

lingjiekong commented Jan 13, 2025

Uh oh!

lingjiekong left a comment

Choose a reason for hiding this comment

Uh oh!

lingjiekong Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jdev01-del commented Jan 9, 2025 •

edited by github-actions bot

Loading