bytearray.decode("utf-8") yield unicode error at \xa0 #11588
-
I'm attempting to convert file post data type : Content-Type: text/plain that was received from the socket.recv(1024) Here is a snippet of code i created to replicate the error
The above code yields a UnicodeError: Am i using the wrong decoding method? for incoming HTML webbrowser post data? Also, i am unsure how to just strip that out and re-process the .decode() methond. When i convert to string, that creates a whole mess of issues ... Here is the html text of the post file:
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 7 replies
-
"The above code yields a UnicodeError:" And rightfully so. ›0xA0‹ is not a valid utf-8 character. |
Beta Was this translation helpful? Give feedback.
-
Through some more digging i believe i found something that might work
|
Beta Was this translation helpful? Give feedback.
-
Even when this works, it can be (in my opinion) only a workaround, as a lonely To do it correct, the sender should add a character set option and the receiver should implement appropriate error handling (e.g. rejecting all malformed messages). |
Beta Was this translation helpful? Give feedback.
-
The text file was riddled with chr(160) which i believe is a non-breaking space. At first i just handled that situation, but as @karfas mentioned, it would be better if i dealt with any ascii decimal values that might cause issue. So, for now, unless someone has something better in mind, i run it through this function.. Limiting the text files to the ASCII character set is fine with me for now.
|
Beta Was this translation helpful? Give feedback.
@OscamSatUser I think the confusion is that your original problem was the same as @davefes -- you have a bytes/bytearray containing non-utf-8 data, and actually you're first use your
byte2string
function, then useascii_only
on the result. Not just theascii_only
function.However, the effect of using both functions is that any non-ascii byte in the input is first turned into the utf-8 representation (i.e. 0xa0 becomes 0xc2 0xa0), and then
ascii_only
implicitly converts this back to bytes, to exclude the non-ascii bytes.As @karfas points out, the simpler solution is to just filter out the bytes, then decode the now-sanitised bytes. @karfas I think you meant to write
"".join(chr(c) for c …