Skip to content

Conversation

@LoicGrobol
Copy link

@LoicGrobol LoicGrobol commented Apr 10, 2025

This is not a direct fix to the issue (as it is not clear how/if it should be fixed at this point), but to this comment:

I couldn't find a trace of bidirectional B characters being non-tailorable line breaks (classes BK, CR, LF, NL), neither in the line breaking not in the bidirectional algorithm specification. What I think the latter means is that B characters actually only “terminate paragraphs” in the sense of “terminating the effect of directional formatting characters”. Now it is true that \N{PARAGRAPH SEPARATOR} is both B and BK, but does it means that all B characters should be line breaks? In particular what about \N{FILE SEPARATOR}, \N{GROUP SEPARATOR} and \N{RECORD SEPARATOR}, who are all caught by str.splitlines, but have the Combining Mark property (see e.g. the list of line breaks), which should in fact forbid a line break?

And the answer by @malemburg

At the time this code was written (in 2000), Unicode 3.0 was the current version and in those days, bidirectional B code points were listed as "Paragraph Separator", in particular the important ones CR, LF. See https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html and https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt for details.

I don't remember, but it's well possible that I wasn't aware of the CM LineBreak property at the time.

I guess the _PyUnicode_IsLinebreak() function should really use the LineBreak.txt file as basis and only include code points which are listed as "Non-tailorable" in that file: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt

Perhaps it's time to fix this after 25 years 😄

Patches are welcome !

It turns out that LineBreak.txt is already used to gather the line break categories, but that B was still kept as indicating a line breaker. Looking at the current state of Unicode (16.0)

  • All the characters in B that are non-tailorable line breakers are in the BK, CR, LF or NL categories, so they were already captured.
  • The other characters in B (U+001C, U+001D, U+001E) are all combining marks and shouldn't break lines.

So I just removed the bidirectional == "B" condition in the code generating the list of line breakers and that should be it.

I think from an API perspective, this only affects str.splitlines(), which at this point is only tested for behaviour against CR, LF and CR+LF and no other line breaker, so I didn't add any test, but I can if it seems useful.

In general, I don't expect this to be a huge compatility-breaking change given the conversation in #66428, but don't really know how to check for that apart from searching for the codepoints (in U+ and \x forms) on Github, which didn't return any Python code that would be broken.

This is my first PR, so it's very likely I missed something, please let me know!


📚 Documentation preview 📚: https://cpython-previews--132369.org.readthedocs.build/

L. Grobol added 3 commits April 10, 2025 15:17
…rectional B but don't have the line break property

This is exactly three characters: U+001C "FILE SEPARATOR", U+001D "GROUP SEPARATOR" and U+001E "RECORD SEPARATOR", all of which have the Combining Mark line breaking property, meaning that they should *not* be present at a line break
@bedevere-app
Copy link

bedevere-app bot commented Apr 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link

python-cla-bot bot commented Apr 10, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Apr 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Apr 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant