@@ -1311,53 +1311,50 @@ sub _Perl_CCC_non0_non230 {
13111311# BREAK_PROPERTIES
13121312
13131313# All but the Sentence Break properties are implemented by two-dimensional
1314- # tables. (That one does not lend itself to tabular lookup, and is rarely
1315- # changed, so it is all done in code in regexec.c.) Unicode publishes
1316- # properties which assign a break class to every Unicode code point, even ones
1317- # that haven't been assigned to be characters. (Perl uses that class for all
1318- # non-Unicode code points.) Unicode also publishes rules for breaking based
1319- # on those break classes. Here we create tables for each break property that
1320- # for a string xy, which have break classes x' and y', we tell whether a break
1321- # is allowed between x and y or not. The rows of this table are the various
1322- # x'; the columns, the y'. Often the table entry will be just 0 or 1. But
1323- # increasingly in newer Unicode versions, more context is needed to make this
1324- # determination, and the table entry will be an enum (packed with other
1325- # information) that corresponds to a hand-crafted DFA in regexec.c that gets
1326- # executed.
1314+ # tables, with additional small DFAs for when the tables are insufficient.
1315+ # (SB does not lend itself to tabular lookup, and is rarely changed, so it is
1316+ # all done in code in regexec.c.) Unicode publishes properties which assign a
1317+ # break class to every Unicode code point, even ones that haven't been
1318+ # assigned to be characters. (Perl uses that class for all non-Unicode code
1319+ # points.) Unicode also publishes rules for breaking based on those break
1320+ # classes. Here we create tables for each break property that for a string
1321+ # xy, which have break classes x' and y', we tell whether a break is allowed
1322+ # between x and y or not. The rows of this table are the various x'; the
1323+ # columns, the y'. Often the table entry will be just the booleans 0 or 1.
1324+ # But increasingly in newer Unicode versions, more context is needed to make
1325+ # this determination. Looking around at the context requires a DFA. Each of
1326+ # these is hand-coded in regexec.c, and is identified by a number which is
1327+ # a case: in a switch() statement there. This program creates #defines for
1328+ # those DFA numbers. XXX an enhancement would be to make these enums. The
1329+ # (x,y) cell contents when a DFA is needed are described below.
13271330#
1328- # Unicode used to publish a table itself for the Line Break property, but
1329- # abandoned it as it got more complicated. However, on their website in the
1330- # UCD data files, in the subdirectory 'auxiliary', there are files like
1331- # 'LineBreakTest.html' that do show annotated pairwise tables. Unicode no
1332- # longer feels constrained to make their rules easy to implement this way.
1333- # Perl wants to keep using the table, as it makes it easier to find the break
1334- # status in the middle of the string instead of having to start each time at
1335- # the beginning, and a goodly number of the possibilities are 0 or 1 anyway,
1336- # without needing the DFA. But this makes it a pain to update to a new
1337- # Unicode release when they add rules. An example is in Unicode 15.1, where
1338- # new GCB rules make use of a new property, Indic_Conjunct_Break that is
1339- # unrelated to GCB. In order for Perl to continue using the table, we have to
1340- # make new equivalence classes in GCB for the Indic property values. This
1341- # would mean we need all combinations of the intersections
1342- # GCB1_Indic1, GCB1_Indic2, ... GCBn_Indic1, # GCB2_Indic1, ...
1343- # Fortunately all but 4 of these intersections are empty in 15.1. But a
1344- # future release might change that, and this would have to be manually
1345- # compensated for. The rules that involve GCB1 now have to change to also
1346- # include GCB1_Indic1, GCB1_Indic2, ...
1331+ # The Unicode rules are listed in UAX #14 and UAX #29 in priority order for
1332+ # each type of break. When context is needed, more than one DFA may apply to
1333+ # a given cell. For example, in the Line Break property, when x is a space,
1334+ # and y is almost anything else, we have to look behind to see what came
1335+ # before the space. (Usualy we have to back up to the first non-space when
1336+ # there are multiple spaces in a row.) If that non-space is a quote we likely
1337+ # will have a different rule than if it is a right parenthesis. For all cells
1338+ # in this type of situation, this program creates a chain of DFAs to apply in
1339+ # priority order. The first one that matches the situation is used; if none
1340+ # do, there is a fallback 0 or 1 that ends the chain.
13471341#
1348- # The code in this file populate the tables based on data output from
1349- # mktables. The Unicode rules are listed in UAX #14 and UAX #29 in priority
1350- # order for each type of line break. Suppose you want to determine if there
1351- # is a break between x and y. You start at rule #1, and see if it applies.
1352- # If not, you proceed to rule #2, and so on, stopping at the first match.
1342+ # This program creates a linear array of all the chains strung together. What
1343+ # gets stored in the (x,y) cell of the main table is the index into this array
1344+ # where the first DFA number for its chain is stored.
13531345#
1354- # This works well when the cells unconditionally return break/no-break (1 or
1355- # 0). But consider the case that we apply rule #a which requires a DFA. If
1356- # that fails to match we're supposed to try rule #a+1, #a+2, ..., stopping at
1357- # the first match. The table is constructed so that the final rule matches
1358- # everything, so the process is guaranteed to halt. And it likely will halt
1359- # earlier at the first unconditional match. Now this generates a chain of
1360- # DFAs for regexec.c to follow, stopping at the first successful match.
1346+ # Unicode no longer feels constrained to make their rules easy to implement
1347+ # in a pair-wise table. An example is in Unicode 15.1, where new GCB rules
1348+ # make use of a new property, Indic_Conjunct_Break that is unrelated to GCB.
1349+ # In order for Perl to continue using the table, we have to make new
1350+ # equivalence classes in GCB for the Indic property values. Thus we would
1351+ # need to split the code points in class GCBx into the ones that are in
1352+ # GCBx-nonIndic, the ones that are in GCBx-Indic1, the ones that are in
1353+ # GCBx-Indic2, .... And class GCBx would be subdivided into the appropriate
1354+ # subclasses. (It turns out that many of these don't contain any code points,
1355+ # so aren't actually needed) It is now possible to tell mktables what a split
1356+ # should be, and it takes care of the rest, passing to this program the
1357+ # results, in a data structure.
13611358
13621359# These functions access the cells of a break table, converting any mnemonics
13631360# to numeric. They need $enums to be able to do this.
@@ -2436,23 +2433,6 @@ ()
24362433 }
24372434 );
24382435
2439- # The result is really just true or false. But we follow along with tr14,
2440- # creating a rule which is false for something like X SP* X. That gets
2441- # encoding 2. The rest of the dfas are synthetic ones that indicate
2442- # some context handling is required. These each are added to the
2443- # underlying 0, 1, or 2, instead of replacing them, so that the underlying
2444- # value can be retrieved. Actually only rules from 7 through 18 (which
2445- # are the ones where space matter) are possible to have 2 added to them.
2446- # The others below add just 0 or 1. It might be possible for one
2447- # synthetic rule to be added to another, yielding a larger value. This
2448- # doesn't happen in the Unicode 8.0 rule set, and as you can see from the
2449- # names of the middle grouping below, it is impossible for that to occur
2450- # for them because they all start with mutually exclusive classes. That
2451- # the final rule can't be added to any of the others isn't obvious from
2452- # its name, so it is assigned a power of 2 higher than the others can get
2453- # to so any addition would preserve all data. (And the code will reach an
2454- # assert(0) on debugging builds should this happen.)
2455-
24562436 my $lb_enum = 2;
24572437 my %lb_dfas = (
24582438 LB_NOBREAK => {
0 commit comments