Skip to content

Parser gives UnicodeDecodeError on what should be good code #139516

@tom-pytel

Description

@tom-pytel

Bug report

Bug description:

In 3.12 and below parses fine. Note the special unicode character in the inner string, its '\u3001' (utf8 b'\xe3\x80\x81'), it goes from good to fail by removing a space or parenthesizing the whole expression, so positional?

>>> from io import BytesIO
>>> from tokenize import tokenize
>>> 
>>> src_good = '''f"{f(a=lambda: '、' \n)}"'''
>>> src_bad1 = '''f"{f(a=lambda: '、'\n)}"'''
>>> src_bad2 = '''(f"{f(a=lambda: '、' \n)}")'''
>>>
>>> for token in tokenize(BytesIO(src_good.encode()).readline): print(token)
... 
TokenInfo(type=65 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=59 (FSTRING_START), string='f"', start=(1, 0), end=(1, 2), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=55 (OP), string='{', start=(1, 2), end=(1, 3), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=1 (NAME), string='f', start=(1, 3), end=(1, 4), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=55 (OP), string='(', start=(1, 4), end=(1, 5), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=1 (NAME), string='a', start=(1, 5), end=(1, 6), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=55 (OP), string='=', start=(1, 6), end=(1, 7), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=1 (NAME), string='lambda', start=(1, 7), end=(1, 13), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=55 (OP), string=':', start=(1, 13), end=(1, 14), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=3 (STRING), string="'、'", start=(1, 15), end=(1, 18), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=63 (NL), string='\n', start=(1, 19), end=(1, 20), line='f"{f(a=lambda: \'\' \n')
TokenInfo(type=55 (OP), string=')', start=(2, 0), end=(2, 1), line=')}"')
TokenInfo(type=55 (OP), string='}', start=(2, 1), end=(2, 2), line=')}"')
TokenInfo(type=61 (FSTRING_END), string='"', start=(2, 2), end=(2, 3), line=')}"')
TokenInfo(type=4 (NEWLINE), string='', start=(2, 3), end=(2, 4), line=')}"')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
>>> for token in tokenize(BytesIO(src_bad1.encode()).readline): print(token)
... 
TokenInfo(type=65 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=59 (FSTRING_START), string='f"', start=(1, 0), end=(1, 2), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=55 (OP), string='{', start=(1, 2), end=(1, 3), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=1 (NAME), string='f', start=(1, 3), end=(1, 4), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=55 (OP), string='(', start=(1, 4), end=(1, 5), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=1 (NAME), string='a', start=(1, 5), end=(1, 6), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=55 (OP), string='=', start=(1, 6), end=(1, 7), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=1 (NAME), string='lambda', start=(1, 7), end=(1, 13), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=55 (OP), string=':', start=(1, 13), end=(1, 14), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=3 (STRING), string="'、'", start=(1, 15), end=(1, 18), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=63 (NL), string='\n', start=(1, 18), end=(1, 19), line='f"{f(a=lambda: \'\'\n')
TokenInfo(type=55 (OP), string=')', start=(2, 0), end=(2, 1), line=')}"')
Traceback (most recent call last):
  File "<python-input-8>", line 1, in <module>
    for token in tokenize(BytesIO(src_bad1.encode()).readline): print(token)
                 ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/tokenize.py", line 492, in tokenize
    yield from _generate_tokens_from_c_tokenizer(rl_gen.__next__, encoding, extra_tokens=True)
  File "/usr/local/lib/python3.13/tokenize.py", line 582, in _generate_tokens_from_c_tokenizer
    for info in it:
                ^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 13-14: unexpected end of data

The other bad source and other permutations give same error. You get immediate error typing the bad src interactively.

CPython versions tested on:

3.13, 3.14, 3.15

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions