Support Unicode 16.0 #23205

khwilliamson · 2025-04-19T03:44:47Z

This branch adds support for Unicode 15.1, then 16.0

The process for upgrading to new Unicode releases is mostly automatic, and non-problematic., with the exception of the break position algorithms, used in regular expression patterns \X \b{gcb} \b{wb} \b{lb} \b{sb}. sb hasn't been changed in a long time, and this PR doesn't affect it. The other ones are.

The new releases are described in https://www.unicode.org/versions/Unicode15.1.0/
and https://www.unicode.org/versions/Unicode16.0.0/

I assert that no correctly functioning program should expect that an unassigned code point that is eligible for assignment will stay that way And that is what these releases are mostly about, creating characters where before there were none.

The best case for saying that this pull request changes the behavior of correctly functioning programs is U+5146, which is a Han script character meaning trillion in Taiwan and Japan, and million in mainland China. Prior to 15.1 the trillion meaning was what Unicode took; starting with 15.1, it takes on the mainland meaning of million.

Other than this, the changes (as opposed to additions) involve the \X (\b{gcb}) and especially the \b{lb} constructs. The latter has significant improvements for Indic languages, which Unicode says brings its Standard into much better compliance with what native speakers expect. There are also changes to the line breaking algorithm for text enclosed in «» quotation marks used In France.

This series of commits is lengthy, but all but a few are exclusively about the tables used by regexec.c to determine if a break of the given type is permissible at this position in the parse string. The first many are to improve our infrastructure to make it easier to update these properties in the future. As a result, the code actually doing the version updates is minimal. The tables affected by the improvement commits is under source control, and can be seen to be unaffected by them, hence with no behavior change at all, or explained in the corresponding commit message as mostly being bug fixes.

The crux of the first set of commits is to make the code in `mk_invlists' closely resemble the text in the Unicode documents that spell out the rules. In essence, those documents define pseudo-code. This makes it far easier to compare what we have with what they say to do. I have WIP to actually parse the text of the documents, but that is for later. The documents are https://www.unicode.org/reports/tr29 and https://www.unicode.org/reports/tr14/. The culmination of that is to change this file to have the rules here ordered the same way as the documents do. There are certain advantages to doing things in reverse order, so that the lowest priority rule is applied first, and then overwritten by higher priority rules as needed. But it made it hard to compare what we have with what Unicode has. So these commits use the Unicode ordering and do the extra bookkeeping that was avoided by the old ordering.

Most of the rules for deciding if a break is possible don't require context. You don't break a word or a line between alphabetic characters (in Western languages anyway). You don't need to look further than the two characters on either side of the current position. Other pairs require more context to decide. In "123." do the digits form a word? The answer is maybe. If the character after the dot is a space, yes they are considered to form a word and the dot is probably indicating the end of a sentence. If the character after the dot is 4, the dot is part of the word '123.4" For rules that require context, small DFAs are needed, and are placed in regexec.c Unicode writes its rules assuming that the DFAs are stackable, that at any given position, you can look behind, and if you find a quote you do one thing; if instead you find a parenthesis, you do another things, etc. Perl got away without having to implement this full generality until 15.1 really made it necessary. The crux of this branch is changing to use stacked DFAs. This removed the need for the existing workarounds to compensate for the lack of that generality, nd makes adding new DFAs in the same spot much easier.

This set of changes requires a perldelta entry, to be furnished

The digest numbers keep changing in this branch. Turn this test off until near its end.

This series of commits has dozens of commits that would otherwise require much more work to generate. This commit temporarily turns off generating EBCDIC tables, and the tables that only change when a new Unicode release happens. Bisecting on an ASCII machine is unaffected

This is useful in debugging

The previous names erroneously implied these were associated with the parameters to these functions; instead rename to indicate they are associated with some local variables.

This includes outdenting and indenting where future commits will add or remove blocks

Changes the wording for some table headings in the generated file to indicate where to find what the abbreviations mean

These lists are densely packed. It is easier to find something if they are sorted

This rule is not affected by spaces, yet the code was saying it should be.

This rule was written here to not include the actions when the character before the candidate break position is a number. This is just plain wrong. The Unicode rules have never said this.

Future Unicode releases will greatly explode the size of certain tables. Prior to this commit, the minimum column size was two, but some table columns fit in a single window column. This commit changes to use the minimum required.

This improves readability

These tables are placed in charclass_invlists.h. They have a row and column for what happens when the position being checked for is at the start or end of the text. This commit reorders the tables so that the edge row and column are, well, at the edges. And it relabels the labels to be '^' and '$' respectively.

This uses a more complex algorithm to generate short labels to demarcate rows and columns in some output tables. This doesn't affect the current tables for Unicode 15.0, but will in future Unicode releases.

where the next commits will want them

Everything is an action. Some are accomplished via DFAs. This commit uses the latter word in places where it is a DFA. It actually uses this new term where it doesn't apply. Future commits will remove those inaccuracies.

Previously, we would just set an individual element directly. This changes most of those to use function calls instead. This has two main benefits. The function can change what's being done without having to change many lines; and these sets had a lot of visual noise with sigils and hash references. The result is a lot easier to read. The next few commits will continue this process. Note that the generated tables are unchanged by this commit. It has no effect on runtime processing. That will be true of the next commits as well. It became obvious in doing this that the rule for Perl_Tailored_HSpace does not belong in the 3's, but comes immediately before that. Arbitrarily use '2z'

And pass the result to the subroutine. This is in preparation for this value to be needed in additional places.

These cells exist so that code is less likely to need to be changed when a new Unicode release comes along. Currently it doesn't matter at all what is in those cells, because they are never read. But future commits will want to make sure they don't refer to dfas that are obsolete and whose references to could be undefined symbols that would abort the compilation. The choice of 0 or 1 to put in the cells was arbitrary; I know of no reason to prefer one or the other

This now matches the order that Unicode gives; for easier checking that our code matches their demands.

Instead of having to loop through all the cells of a row or column, this commit uses '*' to represent the whole thing. This is more in keeping with the text of the Unicode rules which just leaves thing blank if it means everything;

This follows up on the previous commit which allowed simply specifying an entire row or column. This adds the ability to specify a list.

This new function allows removing loops from the main code

And use it in one instance. Previous commits have added the ability to pass multiple items simply to the functions that work on rows and columns. This now gives the ability to complement the set of the multiple items passed.

This is separated out from the previous commit because it is tricky XXX

If both branches of an else lead to the same result, skip the else and set the result unconditionally. That's what this commit does for DFAs that get the same value if they succeed as when they don't. There is one current case where the DFA can return an anomalous result, so it can't be optimized out. Add a field to the hash entry defining that entry, so it doesn't get optimized.

A couple of commits ago, the last necessarily-hard-coded DFA enum besides 0 and 1 was removed. This allows for all the rest to be assigned by using the value of an incrementing variable. This makes it easy to add DFAs in the middle of existing ones, as will happen as future Unicode releases come our way.

haarg · 2025-04-19T19:03:18Z

I think it would be a shame to ship a new perl release that didn't have the latest Unicode standard. Something breaking based on a new Unicode version seem unlikely to be found in things like BBC reports. It seems more likely to only be found in a stable release. And should only impact programs that were already doing something incorrect. So I'm inclined to go ahead with merging this PR before 5.42.

However, this is a rather large change to regexec.c which I'm not sure I'm the best person to evaluate. And this PR has a failing test.

Nit: The "Temporarily skip" commit and its revert should be removed before merging.

Leont · 2025-04-19T19:47:16Z

Failed test ''10000000000000000' is listed as an alias for prop_value_aliases('-_ nv', '1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0')'`

It is suspicious that a test involving a number that doesn't fit in a 32 bit integer fails on a 32 bit perl.

jkeenan · 2025-04-19T19:48:05Z

This p.r. is repeatedly failing one test in the 'linux i386/ubuntu' run on our GH CI.

#   Failed test ''10000000000000000' is listed as an alias for prop_value_aliases('-_ nv', '1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0')'
#   at ../lib/Unicode/UCD.t line 1289.
# 
# Looks like you failed 1 test of 15362.
# 
lib/Unicode/UCD .................................................. FAILED at test 7518

I re-started that particular run, but am observing the same error as in the original run.

This is just for legibility of reading the rules

The numeric value for U+5146 changed in 15.1

In Unicode 15.1, the ideograph U+4EAC now has a numeric value, and that value is 10 quadrillion (1e+16). This is the first instance in Unicode of an integer not fitting in a 32 bit word, as this requires 49 bits. One of the tests in UCD.t requires round-trip equality in converting from string to number and back; skip it for this case and any future similar ones. I find it interesting that U+4EAC is listed as having the meaning "capital city".

Now we are ready to use a new Unicode version, we have to regenerate everything. This was turned off earlier in this branch temporarily until now so as to speed up the testing, as it was known these values wouldn't change until now.

This program generates tables for the Break properties that are somewhat human readable. Before this commit, just the heading line for a column determined its width. This commit factors in the maximum width of any cell in the column as well. It used to be that this required a separate pass, and so wasn't done. But now that separate pass is required anyway for other reasons, and it is simple to add to it this check.

This is includes updates to a few perl files that need to know the current Unicode version, and regenerating perl files that depend on the Unicode data

This had been turned off in this branch to speed up compilatian, and hence development. The code mostly changed in this branch is the same as in ASCII anyway. It could have become an issue only if someone tries to bisect on an EBCDIC machine, which I don't believe has happened, if ever, in decades.

This temporary commit has now served its purpose.

khwilliamson · 2025-04-20T15:06:45Z

The character '9' has the numeric value 9. We're used to characters only taking on numeric values of 0-9. But some scripts have characters that have numeric values be something else, typically when the script doesn't use decimal positional notation. A fairly common case is the script will have a a symbol for 10, one for 20, etc. Roman numerals are a case which many people in the West know about that have single characters that signify other than 0-9. (I recently learned that it wasn't until Shakespeare's time that England taught school children anything but Roman numerals.) Some scripts, typically East Asian, have characters, typically ideographs, that symbolize very large numbers. Unicode 15.1 introduced for the first time one that has a value that doesn't fit in a 32 bit word. The build that was failing was a 32 bit one. And the test that was failing is just one that introduces underscores between digits and verifies that the underscores are ignored, as Unicode requires, in pattern matches. The test is simplistic, and expects that a round trip of string to number and back yields the same string. That won't be true for a value that doesn't fit in a word, it is converted to an NV. So the test isn't valid on such a value, and the fix was to simply skip it.

karenetheridge · 2025-04-20T17:09:59Z

Awaiting signoff from a PSC member for this to be in 5.41.11.

khwilliamson · 2025-04-20T19:29:24Z

With a bit of private encouragement, I merged this to get it in 5.42. @karenetheridge said she would issue a new release should problems with this arise.

khwilliamson added 30 commits April 18, 2025 15:58

Temporarily skip regen porting test in this branch

0d4db07

The digest numbers keep changing in this branch. Turn this test off until near its end.

mk_invlists: Add stack trace facility

e355e50

This is useful in debugging

regexec.c: Rename a couple of variables

d8532cf

The previous names erroneously implied these were associated with the parameters to these functions; instead rename to indicate they are associated with some local variables.

mk_invlists: Change doubled semicolon to single

3bf4f15

mk_invlists.pl: Use feature signatures

58110b6

mk_invlists: White-space comments

041c018

This includes outdenting and indenting where future commits will add or remove blocks

mk_invlists: Clarify output table headings

81c520a

Changes the wording for some table headings in the generated file to indicate where to find what the abbreviations mean

mk_invlists: Sort some lists

d4c449d

These lists are densely packed. It is easier to find something if they are sorted

mk_invlists: Fix rule LB11

dc75f2f

This rule is not affected by spaces, yet the code was saying it should be.

mk_invlists: Fix rule LB12

c7ee419

This rule is not affected by spaces, yet the code was saying it should be.

mk_invlists: Fix rule LB13

e7e0a38

This rule was written here to not include the actions when the character before the candidate break position is a number. This is just plain wrong. The Unicode rules have never said this.

mk_invlists: Add extensive comments

1d7d669

mk_invlists: Narrow some output tables

ee83f07

Future Unicode releases will greatly explode the size of certain tables. Prior to this commit, the minimum column size was two, but some table columns fit in a single window column. This commit changes to use the minimum required.

mk_invlists: Center row labels in output tables

9b8fb89

This improves readability

mk_invlists: Improve output table column headings

e4201f0

This uses a more complex algorithm to generate short labels to demarcate rows and columns in some output tables. This doesn't affect the current tables for Unicode 15.0, but will in future Unicode releases.

mk_invlists: Change two formal parameter names

e07142a

mk_invlists: Move some lines earlier in their functions

f8c3ee4

where the next commits will want them

mk_invlists: Change a word to be more accurate

ef84749

Everything is an action. Some are accomplished via DFAs. This commit uses the latter word in places where it is a DFA. It actually uses this new term where it doesn't apply. Future commits will remove those inaccuracies.

mk_invlists: Hoist calculation to sub callers

28d4709

And pass the result to the subroutine. This is in preparation for this value to be needed in additional places.

mk_invlists: Reorder two statements

3e04bc4

This now matches the order that Unicode gives; for easier checking that our code matches their demands.

mk_invlists: Allow arbitrary list of cells

2ffed1b

This follows up on the previous commit which allowed simply specifying an entire row or column. This adds the ability to specify a list.

mk_invlists: Add no_nobreak_override()

c7f991b

This new function allows removing loops from the main code

mk_invlists: Add ability to specify a complement of list

b3b5d59

And use it in one instance. Previous commits have added the ability to pass multiple items simply to the functions that work on rows and columns. This now gives the ability to complement the set of the multiple items passed.

mk_invlists: Handle Combining Mark: changes CMxZWJ

aa1373f

This is separated out from the previous commit because it is tricky XXX

mk_invlists: move decls comments around

5b830cf

khwilliamson added 3 commits April 18, 2025 16:03

mk_invlists: Remove no longer used function

478b09b

khwilliamson and others added 15 commits April 20, 2025 05:46

mk_invlists: Add a shorter form DFA

63a7d25

This is just for legibility of reading the rules

lib/Unicode/UCD.t: Prepare for Unicode 15.1

ff27cd8

The numeric value for U+5146 changed in 15.1

mk_invlists/regexec.c: Prepare for Unicode 15.1

cd9bf04

mktables: Prepare for Unicode 15.1

9765ad4

mk_invlists: Restore calculation of new keywords, etc

dd21401

Now we are ready to use a new Unicode version, we have to regenerate everything. This was turned off earlier in this branch temporarily until now so as to speed up the testing, as it was known these values wouldn't change until now.

mk_invlists/regexec.c: Prepare for Unicode 16.0

b7e342b

mktables: Prepare for Unicode 16.0

73cff8e

Add Unicode 16.0

8d98672

This is includes updates to a few perl files that need to know the current Unicode version, and regenerating perl files that depend on the Unicode data

mktables: Note break table code for Unicode 16.0 is updated

119b5ac

Revert "Temporarily skip regen porting test in this branch"

ede3743

This temporary commit has now served its purpose.

mk_invlists: Update comments

a80649e

perldelta for Unicode update

bba5581

khwilliamson force-pushed the 16.0 branch from 9105eb9 to bba5581 Compare April 20, 2025 14:33

karenetheridge requested review from ap, book and haarg April 20, 2025 17:10

khwilliamson merged commit 7c4efc4 into Perl:blead Apr 20, 2025
33 checks passed

khwilliamson deleted the 16.0 branch April 20, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Unicode 16.0 #23205

Support Unicode 16.0 #23205

Uh oh!

khwilliamson commented Apr 19, 2025 •

edited

Loading

Uh oh!

haarg commented Apr 19, 2025

Uh oh!

Leont commented Apr 19, 2025

Uh oh!

jkeenan commented Apr 19, 2025

Uh oh!

khwilliamson commented Apr 20, 2025

Uh oh!

karenetheridge commented Apr 20, 2025

Uh oh!

Uh oh!

khwilliamson commented Apr 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Support Unicode 16.0 #23205

Support Unicode 16.0 #23205

Uh oh!

Conversation

khwilliamson commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haarg commented Apr 19, 2025

Uh oh!

Leont commented Apr 19, 2025

Uh oh!

jkeenan commented Apr 19, 2025

Uh oh!

khwilliamson commented Apr 20, 2025

Uh oh!

karenetheridge commented Apr 20, 2025

Uh oh!

Uh oh!

khwilliamson commented Apr 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

khwilliamson commented Apr 19, 2025 •

edited

Loading