'utf-8' codec can't decode byte 0xff in position 776: invalid start byte

Probably related to: https://github.com/AScriver/MHTMLExtractor/issues/2#issue-2159499664

```
~/downloads/MHTMLExtractor $ python MHTMLExtractor.py ../a.eml --output_dir test_extract
[04:48:29][MHTMLExtractor.py:345][ERROR]: Error during extraction: 'utf-8' codec can't decode byte 0xff in position 776: invalid start byte
~/downloads/MHTMLExtractor $
```
The offending file attached: result of Droid Kiwi save to MHTML (aka .eml , renamed as GitHub dislikes .eml as attached). 

Grok AI fixed the code to its mini version which kind of works: 

```
~/downloads/Test $ cat simple_mhtml.py               import email                                         import sys                                           import os
                                                     def extract_html_from_mhtml(file_path, output_dir):      # Ensure the output directory exists                 os.makedirs(output_dir, exist_ok=True)                                                                    with open(file_path, "rb") as f:  # Binary mode          msg = email.message_from_bytes(f.read())         for part in msg.walk():                                  if part.get_content_type() == "text/html":               encoding = part.get("Content-Transfer-Encoding", "").lower()
            payload = part.get_payload(decode=True)  # Decode base64 or quoted-printable                              if encoding == "binary" or not encoding:                 try:                                                     html_content = payload.decode("utf-8", errors="ignore")
                except UnicodeDecodeError:                               html_content = payload.decode("latin-1", errors="ignore")                                         else:                                                    html_content = payload.decode("utf-8", errors="ignore")                                               output_path = os.path.join(output_dir, "extracted.html")                                                  with open(output_path, "w", encoding="utf-8") as out:                                                         out.write(html_content)                          print(f"HTML extracted to {output_path}")            return                                       print("No HTML part found")                                                                           if __name__ == "__main__":                               if len(sys.argv) != 3:                                   print("Usage: python script.py <mhtml_file> <output_dir>")                                                sys.exit(1)                                      extract_html_from_mhtml(sys.argv[1], sys.argv[2])                                                     ~/downloads/Test $
```

[a.eml.renamed.txt](https://github.com/user-attachments/files/20260964/a.eml.renamed.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'utf-8' codec can't decode byte 0xff in position 776: invalid start byte #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

'utf-8' codec can't decode byte 0xff in position 776: invalid start byte #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions