-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Probably related to: #2 (comment)
~/downloads/MHTMLExtractor $ python MHTMLExtractor.py ../a.eml --output_dir test_extract
[04:48:29][MHTMLExtractor.py:345][ERROR]: Error during extraction: 'utf-8' codec can't decode byte 0xff in position 776: invalid start byte
~/downloads/MHTMLExtractor $
The offending file attached: result of Droid Kiwi save to MHTML (aka .eml , renamed as GitHub dislikes .eml as attached).
Grok AI fixed the code to its mini version which kind of works:
~/downloads/Test $ cat simple_mhtml.py import email import sys import os
def extract_html_from_mhtml(file_path, output_dir): # Ensure the output directory exists os.makedirs(output_dir, exist_ok=True) with open(file_path, "rb") as f: # Binary mode msg = email.message_from_bytes(f.read()) for part in msg.walk(): if part.get_content_type() == "text/html": encoding = part.get("Content-Transfer-Encoding", "").lower()
payload = part.get_payload(decode=True) # Decode base64 or quoted-printable if encoding == "binary" or not encoding: try: html_content = payload.decode("utf-8", errors="ignore")
except UnicodeDecodeError: html_content = payload.decode("latin-1", errors="ignore") else: html_content = payload.decode("utf-8", errors="ignore") output_path = os.path.join(output_dir, "extracted.html") with open(output_path, "w", encoding="utf-8") as out: out.write(html_content) print(f"HTML extracted to {output_path}") return print("No HTML part found") if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: python script.py <mhtml_file> <output_dir>") sys.exit(1) extract_html_from_mhtml(sys.argv[1], sys.argv[2]) ~/downloads/Test $
Metadata
Metadata
Assignees
Labels
No labels