feat: add csv feature to extract_tables#79
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
@jdev01-del Please fix the build issue due to black format. |
any_parser/any_parser.py
Outdated
| @staticmethod | ||
| def flatten_to_string(lst): | ||
| result = [] | ||
| for item in lst: | ||
| if isinstance(item, list): | ||
| result.append(AnyParser.flatten_to_string(item)) | ||
| else: | ||
| result.append(str(item)) | ||
| return "".join(result) |
There was a problem hiding this comment.
The flatten_to_string method has multiple critical flaws in handling nested lists:
It incorrectly flattens nested lists by converting them to string representations or appending list objects directly, which prevents true flattening.
The method fails to properly extend the result list with flattened items, causing type errors when attempting to join the result.
The implementation assumes all iterables are lists, which limits its flexibility with other iterable types like tuples or sets.
The method needs to be redesigned to recursively flatten all nested lists into a
single string, ensuring that each nested item is converted to a string and fully
expanded, while supporting various iterable types.
Also, why this is a staticmethod?
There was a problem hiding this comment.
Addressed. I used static method because I think this function will only need to support parsing the parameters, so it's easier if I make it a static method.
any_parser/any_parser.py
Outdated
| df_list = pd.read_html(extracted_html) | ||
| csv_list = [] | ||
| for df in df_list: | ||
| csv_list.append(df.to_csv(index=False)) | ||
| csv_output = "\n\n".join(csv_list) |
There was a problem hiding this comment.
The extract_tables method has a CSV conversion issue when handling multiple tables. When converting HTML tables to CSV, the method incorrectly joins multiple tables using "\n\n".join(csv_list), which breaks the CSV format by inserting unnecessary newlines. Additionally, the method does not properly handle cases where extracted_html is a list, potentially causing type conversion errors when using pd.read_html().
lingjiekong
left a comment
There was a problem hiding this comment.
Make sure you add both html and csv example in the notebook
lingjiekong
left a comment
There was a problem hiding this comment.
Make sure all your github actions are passing before you request for review to save reviwer time.
|
@jdev01-del Make sure you reply to all my comments. |
examples/extract_tables.ipynb
Outdated
| "outputs": [], | ||
| "source": [ | ||
| "ap = AnyParser(api_key=\"...\")" | ||
| "ap = AnyParser(api_key=\"key\")" |
There was a problem hiding this comment.
nit: let's not change this.
User description
Description
extract_tablesused to only supporthtmlreturn format. This commit makes it also supportcsvreturn format.To change return format, find this line in
extract_tables.ipynb:file_path="./sample_data/test_1figure_1table.png", return_type="csv"change
return_typeto either csv or html based on needs.Input Table:

CSV output:
0,1,2
,latency,(ms)
participants,mean,99th percentile
1,17.0 +1.4,75.0 34.9
2,24.5 +2.5,87.6 35.9
5,31.5 +6.2,104.5 52.2
10,30.0 +3.7,95.6 25.4
25,35.5 +5.6,100.4 42.7
50,42.7 +4.1,93.7 22.9
100,71.4 +7.6,131.2 +17.6
200,150.5 +11.0,320.3 35.1
Related Issue
Type of Change
How Has This Been Tested?
Locally running extract_tables.ipynb
Screenshots (if applicable)
Checklist
Additional Notes
PR Type
Enhancement
Description
Added support for CSV output in
extract_tablesmethod.Introduced a utility function
flatten_to_stringfor nested list handling.Updated example notebook to demonstrate CSV output functionality.
Improved code formatting and added error handling for missing dependencies.
Changes walkthrough 📝
any_parser.py
Add CSV output functionality and utility methodsany_parser/any_parser.py
return_typeparameter toextract_tablesmethod for CSV or HTMLoutput.
flatten_to_stringutility for handling nested lists.dependency.
extract_tables.ipynb
Update example notebook for CSV output demonstrationexamples/extract_tables.ipynb
usage.
return_type="csv"inextract_tables.