Skip to content

UTF encoding issue? #32

@sdmccabe

Description

@sdmccabe

This fails:

netconv.read('/home/main/Downloads/imdb.graphml', 'graphml')                                                                                                                           
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-7aff295fd165> in <module>
----> 1 netconv.read('/home/main/Downloads/imdb.graphml', 'graphml')

~/git/netconv/netconv/__init__.py in read(fname, fmt, *args, **kwargs)
     29 def read(fname, fmt, *args, **kwargs):
     30     with open(fname) as file:
---> 31         text = file.read()
     32     return decode(text, fmt, *args, **kwargs)
     33 

~/.pyenv/versions/miniconda3-latest/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 7734: invalid continuation byte

As does this:

with open("/home/main/Downloads/imdb.graphml", 'r', encoding='utf-8') as fin: 
	text = fin.read() 
	G = netconv.decoders.decode_graphml(text) 

This works:

with open("/home/main/Downloads/imdb.graphml", 'r', encoding='latin-1') as fin: 
	text = fin.read() 
	G = netconv.decoders.decode_graphml(text) 

However, the imdb dataset has a great deal of unicode characters.

I'm inclined to dismiss it as a data quality issue but it might be good to be able to pass the choice of encoding to netconv.read().

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions