Skip to content

'utf-8' codec can't decode byte 0xff in position 776: invalid start byte #8

@Manamama

Description

@Manamama

Probably related to: #2 (comment)

~/downloads/MHTMLExtractor $ python MHTMLExtractor.py ../a.eml --output_dir test_extract
[04:48:29][MHTMLExtractor.py:345][ERROR]: Error during extraction: 'utf-8' codec can't decode byte 0xff in position 776: invalid start byte
~/downloads/MHTMLExtractor $

The offending file attached: result of Droid Kiwi save to MHTML (aka .eml , renamed as GitHub dislikes .eml as attached).

Grok AI fixed the code to its mini version which kind of works:

~/downloads/Test $ cat simple_mhtml.py               import email                                         import sys                                           import os
                                                     def extract_html_from_mhtml(file_path, output_dir):      # Ensure the output directory exists                 os.makedirs(output_dir, exist_ok=True)                                                                    with open(file_path, "rb") as f:  # Binary mode          msg = email.message_from_bytes(f.read())         for part in msg.walk():                                  if part.get_content_type() == "text/html":               encoding = part.get("Content-Transfer-Encoding", "").lower()
            payload = part.get_payload(decode=True)  # Decode base64 or quoted-printable                              if encoding == "binary" or not encoding:                 try:                                                     html_content = payload.decode("utf-8", errors="ignore")
                except UnicodeDecodeError:                               html_content = payload.decode("latin-1", errors="ignore")                                         else:                                                    html_content = payload.decode("utf-8", errors="ignore")                                               output_path = os.path.join(output_dir, "extracted.html")                                                  with open(output_path, "w", encoding="utf-8") as out:                                                         out.write(html_content)                          print(f"HTML extracted to {output_path}")            return                                       print("No HTML part found")                                                                           if __name__ == "__main__":                               if len(sys.argv) != 3:                                   print("Usage: python script.py <mhtml_file> <output_dir>")                                                sys.exit(1)                                      extract_html_from_mhtml(sys.argv[1], sys.argv[2])                                                     ~/downloads/Test $

a.eml.renamed.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions