bytearray.decode("utf-8") yield unicode error at \xa0 #11588

OscamSatUser · 2023-05-22T16:14:05Z

OscamSatUser
May 22, 2023

I'm attempting to convert file post data type : Content-Type: text/plain that was received from the socket.recv(1024)

Here is a snippet of code i created to replicate the error

b = b'Content-Type: text/plain\r\n\r\n1-1\xa0\tBroken Bow'
sAscii = b.decode("utf-8")

The above code yields a UnicodeError:
My goal is to allow a POST file of a text/plain ascii file and then save that ascii file to a file on the device.

Am i using the wrong decoding method? for incoming HTML webbrowser post data? Also, i am unsure how to just strip that out and re-process the .decode() methond. When i convert to string, that creates a whole mess of issues ...

Here is the html text of the post file:

<form enctype="multipart/form-data" action="/upload" method="POST">
Choose a file to upload: <input name="filename" type="file" /><br />
<input type="submit" value="Upload File" />
</form>

Answered by jimmo

May 23, 2023

@OscamSatUser I think the confusion is that your original problem was the same as @davefes -- you have a bytes/bytearray containing non-utf-8 data, and actually you're first use your byte2string function, then use ascii_only on the result. Not just the ascii_only function.

However, the effect of using both functions is that any non-ascii byte in the input is first turned into the utf-8 representation (i.e. 0xa0 becomes 0xc2 0xa0), and then ascii_only implicitly converts this back to bytes, to exclude the non-ascii bytes.

As @karfas points out, the simpler solution is to just filter out the bytes, then decode the now-sanitised bytes. @karfas I think you meant to write "".join(chr(c) for c …

View full answer

GitHubsSilverBullet · 2023-05-22T16:27:43Z

GitHubsSilverBullet
May 22, 2023

"The above code yields a UnicodeError:"

And rightfully so. ›0xA0‹ is not a valid utf-8 character.
A ›0xA0‹ (ASCII) would be encoded as ›0xc2‹ ›0xa0‹ (UTF-8).

0 replies

OscamSatUser · 2023-05-22T16:31:51Z

OscamSatUser
May 22, 2023
Author

Through some more digging i believe i found something that might work

def byte2string(bData):
    try:
        return bData.decode('utf-8')
    except:
        return ''.join(map(chr, bData))

0 replies

karfas · 2023-05-22T19:38:11Z

karfas
May 22, 2023

Even when this works, it can be (in my opinion) only a workaround, as a lonely 0xa0 is neither UTF-8 nor the 7-bit-ascii implied by the text/plain content type.

To do it correct, the sender should add a character set option and the receiver should implement appropriate error handling (e.g. rejecting all malformed messages).
Anyway, you should find out where exactly this 0xa0 comes from and which character set is used on the line.

1 reply

GitHubsSilverBullet May 22, 2023

He's probably receiving ISO8859-1 … -15 (which is an enhanced ASCII charset)
The coding should be present in the transmitted data.

OscamSatUser · 2023-05-22T22:30:03Z

OscamSatUser
May 22, 2023
Author

The text file was riddled with chr(160) which i believe is a non-breaking space. At first i just handled that situation, but as @karfas mentioned, it would be better if i dealt with any ascii decimal values that might cause issue.

So, for now, unless someone has something better in mind, i run it through this function.. Limiting the text files to the ASCII character set is fine with me for now.

def ascii_only(text):
    return "".join(c for c in text if ord(c)<128)

6 replies

davefes May 23, 2023

I was getting a Unicode error in a LoRa link so tried your function and got the error:

dave@davef:~$ python3 ./PV_error.py
Traceback (most recent call last):
File "./PV_error.py", line 6, in
PV_msg = ascii_only(msg)
File "./PV_error.py", line 2, in ascii_only
return "".join(c for c in text if ord(c)<128)
File "./PV_error.py", line 2, in
return "".join(c for c in text if ord(c)<128)
TypeError: ord() expected string of length 1, but int found

Like you I would like to strip out the unwanted characters to see if the msg is trying to tell me something.

Any hints?

karfas May 23, 2023

@davefes: This depends upon the context this function is called with.

@OscamSatUser seems to call the funcion with a string, whereas you seem to call it with a bytearray (or something similar).
ord() doesn't work for integers/bytes (as the error message told you).

Just juse for this use case:

return "".join(c for c in text if c<128)

OscamSatUser May 23, 2023
Author

I'm not expert on python, but when i look at your errors, it looks like you passed a variable INT/Byte to the function and not a string.
I call this function after the bytearray() has been converted to a string...

jimmo May 23, 2023
Maintainer

@OscamSatUser I think the confusion is that your original problem was the same as @davefes -- you have a bytes/bytearray containing non-utf-8 data, and actually you're first use your byte2string function, then use ascii_only on the result. Not just the ascii_only function.

However, the effect of using both functions is that any non-ascii byte in the input is first turned into the utf-8 representation (i.e. 0xa0 becomes 0xc2 0xa0), and then ascii_only implicitly converts this back to bytes, to exclude the non-ascii bytes.

As @karfas points out, the simpler solution is to just filter out the bytes, then decode the now-sanitised bytes. @karfas I think you meant to write "".join(chr(c) for c in text if c<128) btw, but you can avoid creating so many intermediate strings just using bytes.decode, e.g.:

e.g.

b_in = b'Content-Type: text/plain\r\n\r\n1-1\xa0\tBroken Bow'
s_ascii = bytes(c for c in b if c < 128).decode()

This approach of filtering out bytes >= 128 will work because utf-8 always sets the high bit on continuation bytes. But in general the issue here is that you need to implement a decoder for the specific data you're receiving. In @OscamSatUser's case, others have already pointed out this sounds like enhanced ASCII (i.e. you should implement a mapping to the corresponding utf-8 characters), and in @davefes case, it's not clear but either you're being sent something that isn't utf-8 or you're getting message corruption.

Answer selected by OscamSatUser

OscamSatUser May 23, 2023
Author

This was a great answer pointing out the differences between bytearray and string. Thank you @jimmo

davefes May 23, 2023

Thanks all for the hints. @jimmo it looks like message corruption to me:
https://github.com/orgs/micropython/discussions/11515

bytearray.decode("utf-8") yield unicode error at \xa0 #11588

Uh oh!

Replies: 4 comments · 7 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OscamSatUser May 22, 2023 Author

Uh oh!

Uh oh!

Uh oh!

OscamSatUser May 22, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OscamSatUser May 23, 2023 Author

Uh oh!

jimmo May 23, 2023 Maintainer

Uh oh!

Uh oh!

OscamSatUser May 23, 2023 Author

Uh oh!

Uh oh!

Replies: 4 comments 7 replies

OscamSatUser
May 22, 2023
Author

OscamSatUser
May 22, 2023
Author

OscamSatUser May 23, 2023
Author

jimmo May 23, 2023
Maintainer

OscamSatUser May 23, 2023
Author