gh-66428 Stop including all bidirectional "B" characters in line breakers #132369
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is not a direct fix to the issue (as it is not clear how/if it should be fixed at this point), but to this comment:
And the answer by @malemburg
It turns out that
LineBreak.txtis already used to gather the line break categories, but thatBwas still kept as indicating a line breaker. Looking at the current state of Unicode (16.0)Bthat are non-tailorable line breakers are in the BK, CR, LF or NL categories, so they were already captured.B(U+001C, U+001D, U+001E) are all combining marks and shouldn't break lines.So I just removed the
bidirectional == "B"condition in the code generating the list of line breakers and that should be it.I think from an API perspective, this only affects
str.splitlines(), which at this point is only tested for behaviour against CR, LF and CR+LF and no other line breaker, so I didn't add any test, but I can if it seems useful.In general, I don't expect this to be a huge compatility-breaking change given the conversation in #66428, but don't really know how to check for that apart from searching for the codepoints (in U+ and
\xforms) on Github, which didn't return any Python code that would be broken.This is my first PR, so it's very likely I missed something, please let me know!
📚 Documentation preview 📚: https://cpython-previews--132369.org.readthedocs.build/