-
Notifications
You must be signed in to change notification settings - Fork 601
Support Unicode 16.0 #23205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Unicode 16.0 #23205
Conversation
The digest numbers keep changing in this branch. Turn this test off until near its end.
This series of commits has dozens of commits that would otherwise require much more work to generate. This commit temporarily turns off generating EBCDIC tables, and the tables that only change when a new Unicode release happens. Bisecting on an ASCII machine is unaffected
This is useful in debugging
The previous names erroneously implied these were associated with the parameters to these functions; instead rename to indicate they are associated with some local variables.
This includes outdenting and indenting where future commits will add or remove blocks
Changes the wording for some table headings in the generated file to indicate where to find what the abbreviations mean
These lists are densely packed. It is easier to find something if they are sorted
This rule is not affected by spaces, yet the code was saying it should be.
This rule is not affected by spaces, yet the code was saying it should be.
This rule was written here to not include the actions when the character before the candidate break position is a number. This is just plain wrong. The Unicode rules have never said this.
Future Unicode releases will greatly explode the size of certain tables. Prior to this commit, the minimum column size was two, but some table columns fit in a single window column. This commit changes to use the minimum required.
This improves readability
These tables are placed in charclass_invlists.h. They have a row and column for what happens when the position being checked for is at the start or end of the text. This commit reorders the tables so that the edge row and column are, well, at the edges. And it relabels the labels to be '^' and '$' respectively.
This uses a more complex algorithm to generate short labels to demarcate rows and columns in some output tables. This doesn't affect the current tables for Unicode 15.0, but will in future Unicode releases.
where the next commits will want them
Everything is an action. Some are accomplished via DFAs. This commit uses the latter word in places where it is a DFA. It actually uses this new term where it doesn't apply. Future commits will remove those inaccuracies.
Previously, we would just set an individual element directly. This changes most of those to use function calls instead. This has two main benefits. The function can change what's being done without having to change many lines; and these sets had a lot of visual noise with sigils and hash references. The result is a lot easier to read. The next few commits will continue this process. Note that the generated tables are unchanged by this commit. It has no effect on runtime processing. That will be true of the next commits as well. It became obvious in doing this that the rule for Perl_Tailored_HSpace does not belong in the 3's, but comes immediately before that. Arbitrarily use '2z'
And pass the result to the subroutine. This is in preparation for this value to be needed in additional places.
These cells exist so that code is less likely to need to be changed when a new Unicode release comes along. Currently it doesn't matter at all what is in those cells, because they are never read. But future commits will want to make sure they don't refer to dfas that are obsolete and whose references to could be undefined symbols that would abort the compilation. The choice of 0 or 1 to put in the cells was arbitrary; I know of no reason to prefer one or the other
This now matches the order that Unicode gives; for easier checking that our code matches their demands.
Instead of having to loop through all the cells of a row or column, this commit uses '*' to represent the whole thing. This is more in keeping with the text of the Unicode rules which just leaves thing blank if it means everything;
This follows up on the previous commit which allowed simply specifying an entire row or column. This adds the ability to specify a list.
This new function allows removing loops from the main code
And use it in one instance. Previous commits have added the ability to pass multiple items simply to the functions that work on rows and columns. This now gives the ability to complement the set of the multiple items passed.
This is separated out from the previous commit because it is tricky XXX
If both branches of an else lead to the same result, skip the else and set the result unconditionally. That's what this commit does for DFAs that get the same value if they succeed as when they don't. There is one current case where the DFA can return an anomalous result, so it can't be optimized out. Add a field to the hash entry defining that entry, so it doesn't get optimized.
A couple of commits ago, the last necessarily-hard-coded DFA enum besides 0 and 1 was removed. This allows for all the rest to be assigned by using the value of an incrementing variable. This makes it easy to add DFAs in the middle of existing ones, as will happen as future Unicode releases come our way.
|
I think it would be a shame to ship a new perl release that didn't have the latest Unicode standard. Something breaking based on a new Unicode version seem unlikely to be found in things like BBC reports. It seems more likely to only be found in a stable release. And should only impact programs that were already doing something incorrect. So I'm inclined to go ahead with merging this PR before 5.42. However, this is a rather large change to Nit: The "Temporarily skip" commit and its revert should be removed before merging. |
It is suspicious that a test involving a number that doesn't fit in a 32 bit integer fails on a 32 bit perl. |
|
This p.r. is repeatedly failing one test in the 'linux i386/ubuntu' run on our GH CI. I re-started that particular run, but am observing the same error as in the original run. |
This is just for legibility of reading the rules
The numeric value for U+5146 changed in 15.1
In Unicode 15.1, the ideograph U+4EAC now has a numeric value, and that value is 10 quadrillion (1e+16). This is the first instance in Unicode of an integer not fitting in a 32 bit word, as this requires 49 bits. One of the tests in UCD.t requires round-trip equality in converting from string to number and back; skip it for this case and any future similar ones. I find it interesting that U+4EAC is listed as having the meaning "capital city".
Now we are ready to use a new Unicode version, we have to regenerate everything. This was turned off earlier in this branch temporarily until now so as to speed up the testing, as it was known these values wouldn't change until now.
This program generates tables for the Break properties that are somewhat human readable. Before this commit, just the heading line for a column determined its width. This commit factors in the maximum width of any cell in the column as well. It used to be that this required a separate pass, and so wasn't done. But now that separate pass is required anyway for other reasons, and it is simple to add to it this check.
This is includes updates to a few perl files that need to know the current Unicode version, and regenerating perl files that depend on the Unicode data
This had been turned off in this branch to speed up compilatian, and hence development. The code mostly changed in this branch is the same as in ASCII anyway. It could have become an issue only if someone tries to bisect on an EBCDIC machine, which I don't believe has happened, if ever, in decades.
This temporary commit has now served its purpose.
|
The character '9' has the numeric value 9. We're used to characters only taking on numeric values of 0-9. But some scripts have characters that have numeric values be something else, typically when the script doesn't use decimal positional notation. A fairly common case is the script will have a a symbol for 10, one for 20, etc. Roman numerals are a case which many people in the West know about that have single characters that signify other than 0-9. (I recently learned that it wasn't until Shakespeare's time that England taught school children anything but Roman numerals.) Some scripts, typically East Asian, have characters, typically ideographs, that symbolize very large numbers. Unicode 15.1 introduced for the first time one that has a value that doesn't fit in a 32 bit word. The build that was failing was a 32 bit one. And the test that was failing is just one that introduces underscores between digits and verifies that the underscores are ignored, as Unicode requires, in pattern matches. The test is simplistic, and expects that a round trip of string to number and back yields the same string. That won't be true for a value that doesn't fit in a word, it is converted to an NV. So the test isn't valid on such a value, and the fix was to simply skip it. |
|
Awaiting signoff from a PSC member for this to be in 5.41.11. |
|
With a bit of private encouragement, I merged this to get it in 5.42. @karenetheridge said she would issue a new release should problems with this arise. |
This branch adds support for Unicode 15.1, then 16.0
The process for upgrading to new Unicode releases is mostly automatic, and non-problematic., with the exception of the break position algorithms, used in regular expression patterns
\X \b{gcb} \b{wb} \b{lb} \b{sb}.sbhasn't been changed in a long time, and this PR doesn't affect it. The other ones are.The new releases are described in https://www.unicode.org/versions/Unicode15.1.0/
and https://www.unicode.org/versions/Unicode16.0.0/
I assert that no correctly functioning program should expect that an unassigned code point that is eligible for assignment will stay that way And that is what these releases are mostly about, creating characters where before there were none.
The best case for saying that this pull request changes the behavior of correctly functioning programs is U+5146, which is a Han script character meaning trillion in Taiwan and Japan, and million in mainland China. Prior to 15.1 the trillion meaning was what Unicode took; starting with 15.1, it takes on the mainland meaning of million.
Other than this, the changes (as opposed to additions) involve the
\X (\b{gcb}) and especially the\b{lb}constructs. The latter has significant improvements for Indic languages, which Unicode says brings its Standard into much better compliance with what native speakers expect. There are also changes to the line breaking algorithm for text enclosed in«»quotation marks used In France.This series of commits is lengthy, but all but a few are exclusively about the tables used by
regexec.cto determine if a break of the given type is permissible at this position in the parse string. The first many are to improve our infrastructure to make it easier to update these properties in the future. As a result, the code actually doing the version updates is minimal. The tables affected by the improvement commits is under source control, and can be seen to be unaffected by them, hence with no behavior change at all, or explained in the corresponding commit message as mostly being bug fixes.The crux of the first set of commits is to make the code in `mk_invlists' closely resemble the text in the Unicode documents that spell out the rules. In essence, those documents define pseudo-code. This makes it far easier to compare what we have with what they say to do. I have WIP to actually parse the text of the documents, but that is for later. The documents are https://www.unicode.org/reports/tr29 and https://www.unicode.org/reports/tr14/. The culmination of that is to change this file to have the rules here ordered the same way as the documents do. There are certain advantages to doing things in reverse order, so that the lowest priority rule is applied first, and then overwritten by higher priority rules as needed. But it made it hard to compare what we have with what Unicode has. So these commits use the Unicode ordering and do the extra bookkeeping that was avoided by the old ordering.
Most of the rules for deciding if a break is possible don't require context. You don't break a word or a line between alphabetic characters (in Western languages anyway). You don't need to look further than the two characters on either side of the current position. Other pairs require more context to decide. In "123." do the digits form a word? The answer is maybe. If the character after the dot is a space, yes they are considered to form a word and the dot is probably indicating the end of a sentence. If the character after the dot is 4, the dot is part of the word '123.4" For rules that require context, small DFAs are needed, and are placed in
regexec.cUnicode writes its rules assuming that the DFAs are stackable, that at any given position, you can look behind, and if you find a quote you do one thing; if instead you find a parenthesis, you do another things, etc. Perl got away without having to implement this full generality until 15.1 really made it necessary. The crux of this branch is changing to use stacked DFAs. This removed the need for the existing workarounds to compensate for the lack of that generality, nd makes adding new DFAs in the same spot much easier.