Skip to content

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented Oct 6, 2025

A = followed by a : in an f-string expression could cause the tokenizer to erroneously think it was starting a format spec, leading to incorrect internal state and possible decode errors if this results in split unicode characters on copy. This PR fixes this by disallowing = to set in_debug state unless it is encountered at the top level of an f-string expression.

This problem exists back to py 3.13 and this PR can probably be backported easily enough.

@tom-pytel
Copy link
Contributor Author

Ping @pablogsal. I added the test to test_tokenize instead of test_fstring as it seems to fit there better.

@pablogsal
Copy link
Member

pablogsal commented Oct 6, 2025

Please add a rest for the f-string test file as well as this will be a semantic test that needs to hold true even if we change the tokenizer of some other implementation doesn't have the same tokenizer

# gh-139516
# The '\n' is explicit to ensure no trailing whitespace which would invalidate the test.
# Must use tokenize instead of compile so that source is parsed by line which exposes the bug.
list(tokenize.tokenize(BytesIO('''f"{f(a=lambda: 'à'\n)}"'''.encode()).readline))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Isn't it possible to trigger this in an exec or eval call? Or perhaps a file with an encoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below VVV

@tom-pytel
Copy link
Contributor Author

Please add a rest for the f-string test file as well as this will be a semantic test that needs to hold true even if we change the tokenizer of some other implementation doesn't have the same tokenizer

Done. But I had to use tokenize() because of an interesting quirk. The bug shows up with tokenize() or executing a python script directly with the bad source or typing it into the repl. It does not show up with compile() or ast.parse() or eval() or exec() or import .... The difference seems to be if the source is read line by line or not, in which case if the full string is available on parse then the tail end of the string past the NL is present to offset from on copy and the bug doesn't present.

Let me know if this test is good enough or if you want something else.

@pablogsal
Copy link
Member

pablogsal commented Oct 6, 2025

Let me know if this test is good enough or if you want something else.

yes, going via the tokenizer makes no sense here. The pourpose of what I asked is that alternative implementations will still run these tests files to check if they are compliant and we need to provide a way to run a file or exec some code and say "this is what we expect". You are triggering the bug via a specific aspect of CPython but I would prefer if we could trigger it end-to-end via a file. There are more tests executing python over files, check in test_syntax or test_grammar or test_compile.

@tom-pytel
Copy link
Contributor Author

Running error as script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants