Skip to content

Conversation

@jirkamarsik
Copy link

@jirkamarsik jirkamarsik commented Nov 7, 2024

When a pattern is being compiled in _compiler.py's optimize_charset, the RANGE opcode is translated into the RANGE_UNI_IGNORE opcode. This should be done only in regexes which set the Unicode flag, otherwise we get Unicode case folding behavior in regexes which set the ASCII or Locale mode flags.

The correct way to check for Unicode mode in optimize_charset would be to check if fixes:, because the fixes argument is None in ASCII and Locale modes and a dict in Unicode mode. The code currently uses the condition if fixup:, but fixup is None only in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too and the RANGE opcode is translated to a RANGE_UNI_IGNORE opcode for character sets which include characters outside of the basic multilingual plane (the second time an IndexError is thrown in optimize_charset).

When an ASCII regex would use a character range that exceeds the bounds
of the basic multilingual plane, it would be compiled into an opcode
that performs Unicode case folding. Now, only Unicode regexes can use
the Unicode-specific case folding opcode.
@ghost
Copy link

ghost commented Nov 7, 2024

The following commit authors need to sign the Contributor License Agreement:

Click the button to sign:
CLA not signed

@bedevere-app
Copy link

bedevere-app bot commented Nov 7, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! This needs a NEWS entry, as it's a user-facing bug, and you'll also need to sign the CLA.

Comment on lines +2630 to +2631
# gh-126505
# should match in Unicode mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# gh-126505
# should match in Unicode mode
# GH-126505: should match in Unicode mode

@ZeroIntensity ZeroIntensity added topic-regex needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Nov 7, 2024
@vstinner
Copy link
Member

vstinner commented Nov 7, 2024

cc @serhiy-storchaka

@jirkamarsik
Copy link
Author

Closing this Pull Request in favor of @serhiy-storchaka's upcoming fix.
#126505 (comment)

@jirkamarsik jirkamarsik closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants