Skip to content

Conversation

@khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Apr 19, 2025

This branch adds support for Unicode 15.1, then 16.0

The process for upgrading to new Unicode releases is mostly automatic, and non-problematic., with the exception of the break position algorithms, used in regular expression patterns \X \b{gcb} \b{wb} \b{lb} \b{sb}. sb hasn't been changed in a long time, and this PR doesn't affect it. The other ones are.

The new releases are described in https://www.unicode.org/versions/Unicode15.1.0/
and https://www.unicode.org/versions/Unicode16.0.0/

I assert that no correctly functioning program should expect that an unassigned code point that is eligible for assignment will stay that way And that is what these releases are mostly about, creating characters where before there were none.

The best case for saying that this pull request changes the behavior of correctly functioning programs is U+5146, which is a Han script character meaning trillion in Taiwan and Japan, and million in mainland China. Prior to 15.1 the trillion meaning was what Unicode took; starting with 15.1, it takes on the mainland meaning of million.

Other than this, the changes (as opposed to additions) involve the \X (\b{gcb}) and especially the \b{lb} constructs. The latter has significant improvements for Indic languages, which Unicode says brings its Standard into much better compliance with what native speakers expect. There are also changes to the line breaking algorithm for text enclosed in «» quotation marks used In France.

This series of commits is lengthy, but all but a few are exclusively about the tables used by regexec.c to determine if a break of the given type is permissible at this position in the parse string. The first many are to improve our infrastructure to make it easier to update these properties in the future. As a result, the code actually doing the version updates is minimal. The tables affected by the improvement commits is under source control, and can be seen to be unaffected by them, hence with no behavior change at all, or explained in the corresponding commit message as mostly being bug fixes.

The crux of the first set of commits is to make the code in `mk_invlists' closely resemble the text in the Unicode documents that spell out the rules. In essence, those documents define pseudo-code. This makes it far easier to compare what we have with what they say to do. I have WIP to actually parse the text of the documents, but that is for later. The documents are https://www.unicode.org/reports/tr29 and https://www.unicode.org/reports/tr14/. The culmination of that is to change this file to have the rules here ordered the same way as the documents do. There are certain advantages to doing things in reverse order, so that the lowest priority rule is applied first, and then overwritten by higher priority rules as needed. But it made it hard to compare what we have with what Unicode has. So these commits use the Unicode ordering and do the extra bookkeeping that was avoided by the old ordering.

Most of the rules for deciding if a break is possible don't require context. You don't break a word or a line between alphabetic characters (in Western languages anyway). You don't need to look further than the two characters on either side of the current position. Other pairs require more context to decide. In "123." do the digits form a word? The answer is maybe. If the character after the dot is a space, yes they are considered to form a word and the dot is probably indicating the end of a sentence. If the character after the dot is 4, the dot is part of the word '123.4" For rules that require context, small DFAs are needed, and are placed in regexec.c Unicode writes its rules assuming that the DFAs are stackable, that at any given position, you can look behind, and if you find a quote you do one thing; if instead you find a parenthesis, you do another things, etc. Perl got away without having to implement this full generality until 15.1 really made it necessary. The crux of this branch is changing to use stacked DFAs. This removed the need for the existing workarounds to compensate for the lack of that generality, nd makes adding new DFAs in the same spot much easier.

  • This set of changes requires a perldelta entry, to be furnished

The digest numbers keep changing in this branch.  Turn this test off
until near its end.
This series of commits has dozens of commits that would otherwise
require much more work to generate.  This commit temporarily turns off
generating EBCDIC tables, and the tables that only change when a new
Unicode release happens.  Bisecting on an ASCII machine is unaffected
This is useful in debugging
The previous names erroneously implied these were associated with the
parameters to these functions; instead rename to indicate they are
associated with some local variables.
This includes outdenting and indenting where future commits will add or
remove blocks
Changes the wording for some table headings in the generated file to
indicate where to find what the abbreviations mean
These lists are densely packed.  It is easier to find something if they
are sorted
This rule is not affected by spaces, yet the code was saying it should
be.
This rule is not affected by spaces, yet the code was saying it should
be.
This rule was written here to not include the actions when the character
before the candidate break position is a number.  This is just plain
wrong.  The Unicode rules have never said this.
Future Unicode releases will greatly explode the size of certain tables.
Prior to this commit, the minimum column size was two, but some table
columns fit in a single window column.  This commit changes to use the
minimum required.
These tables are placed in charclass_invlists.h.  They have a row and
column for what happens when the position being checked for is at the
start or end of the text.  This commit reorders the tables so that the
edge row and column are, well, at the edges.  And it relabels the labels
to be '^' and '$' respectively.
This uses a more complex algorithm to generate short labels to demarcate
rows and columns in some output tables.

This doesn't affect the current tables for Unicode 15.0, but will in
future Unicode releases.
where the next commits will want them
Everything is an action.  Some are accomplished via DFAs.  This commit
uses the latter word in places where it is a DFA.  It actually uses this
new term where it doesn't apply.  Future commits will remove those
inaccuracies.
Previously, we would just set an individual element directly.  This
changes most of those to use function calls instead.  This has two main
benefits.  The function can change what's being done without having to
change many lines; and these sets had a lot of visual noise with sigils
and hash references.  The result is a lot easier to read.

The next few commits will continue this process.

Note that the generated tables are unchanged by this commit.  It has no
effect on runtime processing.  That will be true of the next commits as
well.

It became obvious in doing this that the rule for Perl_Tailored_HSpace
does not belong in the 3's, but comes immediately before that.
Arbitrarily use '2z'
And pass the result to the subroutine.

This is in preparation for this value to be needed in additional places.
These cells exist so that code is less likely to need to be changed when
a new Unicode release comes along.  Currently it doesn't matter at all
what is in those cells, because they are never read.  But future commits
will want to make sure they don't refer to dfas that are obsolete and
whose references to could be undefined symbols that would abort the
compilation.

The choice of 0 or 1 to put in the cells was arbitrary; I know of no
reason to prefer one or the other
This now matches the order that Unicode gives; for easier checking that
our code matches their demands.
Instead of having to loop through all the cells of a row or column, this
commit uses '*' to represent the whole thing.  This is more in keeping
with the text of the Unicode rules which just leaves thing blank if it
means everything;
This follows up on the previous commit which allowed simply specifying
an entire row or column.  This adds the ability to specify a list.
This new function allows removing loops from the main code
And use it in one instance.

Previous commits have added the ability to pass multiple items simply to
the functions that work on rows and columns.  This now gives the ability
to complement the set of the multiple items passed.
This is separated out from the previous commit because it is tricky XXX
If both branches of an else lead to the same result, skip the else and
set the result unconditionally.  That's what this commit does for DFAs
that get the same value if they succeed as when they don't.

There is one current case where the DFA can return an anomalous result,
so it can't be optimized out.  Add a field to the hash entry defining
that entry, so it doesn't get optimized.
A couple of commits ago, the last necessarily-hard-coded DFA enum
besides 0 and 1  was removed.  This allows for all the rest to be
assigned by using the value of an incrementing variable.

This makes it easy to add DFAs in the middle of existing ones, as will
happen as future Unicode releases come our way.
@haarg
Copy link
Contributor

haarg commented Apr 19, 2025

I think it would be a shame to ship a new perl release that didn't have the latest Unicode standard. Something breaking based on a new Unicode version seem unlikely to be found in things like BBC reports. It seems more likely to only be found in a stable release. And should only impact programs that were already doing something incorrect. So I'm inclined to go ahead with merging this PR before 5.42.

However, this is a rather large change to regexec.c which I'm not sure I'm the best person to evaluate. And this PR has a failing test.

Nit: The "Temporarily skip" commit and its revert should be removed before merging.

@Leont
Copy link
Contributor

Leont commented Apr 19, 2025

Failed test ''10000000000000000' is listed as an alias for prop_value_aliases('-_ nv', '1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0')'`

It is suspicious that a test involving a number that doesn't fit in a 32 bit integer fails on a 32 bit perl.

@jkeenan
Copy link
Contributor

jkeenan commented Apr 19, 2025

This p.r. is repeatedly failing one test in the 'linux i386/ubuntu' run on our GH CI.

#   Failed test ''10000000000000000' is listed as an alias for prop_value_aliases('-_ nv', '1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0')'
#   at ../lib/Unicode/UCD.t line 1289.
# 
# Looks like you failed 1 test of 15362.
# 
lib/Unicode/UCD .................................................. FAILED at test 7518

I re-started that particular run, but am observing the same error as in the original run.

khwilliamson and others added 15 commits April 20, 2025 05:46
This is just for legibility of reading the rules
The  numeric value for U+5146 changed in 15.1
In Unicode 15.1, the ideograph U+4EAC now has a numeric value, and that
value is 10 quadrillion (1e+16).  This is the first instance in Unicode
of an integer not fitting in a 32 bit word, as this requires 49 bits.
One of the tests in UCD.t requires round-trip equality in converting
from string to number and back; skip it for this case and any future
similar ones.

I find it interesting that U+4EAC is listed as having the meaning
"capital city".
Now we are ready to use a new Unicode version, we have to regenerate
everything.  This was turned off earlier in this branch temporarily
until now so as to speed up the testing, as it was known these values
wouldn't change until now.
This program generates tables for the Break properties that are somewhat
human readable.  Before this commit, just the heading line for a column
determined its width.  This commit factors in the maximum width of any
cell in the column as well.  It used to be that this required a separate
pass, and so wasn't done.  But now that separate pass is required anyway
for other reasons, and it is simple to add to it this check.
This is includes updates to a few perl files that need to know the
current Unicode version, and regenerating perl files that depend on the
Unicode data
This had been turned off in this branch to speed up compilatian, and
hence development.  The code mostly changed in this branch is the same
as in ASCII anyway.  It could have become an issue only if someone tries
to bisect on an EBCDIC machine, which I don't believe has happened, if
ever, in decades.
This temporary commit has now served its purpose.
@khwilliamson
Copy link
Contributor Author

The character '9' has the numeric value 9. We're used to characters only taking on numeric values of 0-9. But some scripts have characters that have numeric values be something else, typically when the script doesn't use decimal positional notation. A fairly common case is the script will have a a symbol for 10, one for 20, etc. Roman numerals are a case which many people in the West know about that have single characters that signify other than 0-9. (I recently learned that it wasn't until Shakespeare's time that England taught school children anything but Roman numerals.) Some scripts, typically East Asian, have characters, typically ideographs, that symbolize very large numbers. Unicode 15.1 introduced for the first time one that has a value that doesn't fit in a 32 bit word. The build that was failing was a 32 bit one. And the test that was failing is just one that introduces underscores between digits and verifies that the underscores are ignored, as Unicode requires, in pattern matches. The test is simplistic, and expects that a round trip of string to number and back yields the same string. That won't be true for a value that doesn't fit in a word, it is converted to an NV. So the test isn't valid on such a value, and the fix was to simply skip it.

@karenetheridge
Copy link
Member

Awaiting signoff from a PSC member for this to be in 5.41.11.

@karenetheridge karenetheridge requested review from ap, book and haarg April 20, 2025 17:10
@khwilliamson khwilliamson merged commit 7c4efc4 into Perl:blead Apr 20, 2025
33 checks passed
@khwilliamson khwilliamson deleted the 16.0 branch April 20, 2025 19:21
@khwilliamson
Copy link
Contributor Author

With a bit of private encouragement, I merged this to get it in 5.42. @karenetheridge said she would issue a new release should problems with this arise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants