Skip to content

Commit 9a48315

Browse files
committed
mk_invlists: Update comments
1 parent 1b7d992 commit 9a48315

File tree

5 files changed

+45
-65
lines changed

5 files changed

+45
-65
lines changed

charclass_invlists.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -456716,5 +456716,5 @@ static const U8 WB_dfa_table[] = {
456716456716
* 63f771c327e92574fbd77919586079c38f669058a5e6b67ccec385ef8fcde882 lib/unicore/version
456717456717
* 0a6b5ab33bb1026531f816efe81aea1a8ffcd34a27cbea37dd6a70a63d73c844 regen/charset_translations.pl
456718456718
* c7ff8e0d207d3538c7feb4a1a152b159e5e902d20293b303569ea8323e84633e regen/mk_PL_charclass.pl
456719-
* 271cf09abfa390b652f60dd7b6a2769ea1fecc80d74cc68d02dfe8678a43da62 regen/mk_invlists.pl
456719+
* 6f140fe16685fe5d0e81e2984af81342aff5eaba309991002eaca94d032b2ecc regen/mk_invlists.pl
456720456720
* ex: set ro ft=c: */

lib/unicore/uni_keywords.pl

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

regen/mk_invlists.pl

Lines changed: 41 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1311,53 +1311,50 @@ sub _Perl_CCC_non0_non230 {
13111311
# BREAK_PROPERTIES
13121312

13131313
# All but the Sentence Break properties are implemented by two-dimensional
1314-
# tables. (That one does not lend itself to tabular lookup, and is rarely
1315-
# changed, so it is all done in code in regexec.c.) Unicode publishes
1316-
# properties which assign a break class to every Unicode code point, even ones
1317-
# that haven't been assigned to be characters. (Perl uses that class for all
1318-
# non-Unicode code points.) Unicode also publishes rules for breaking based
1319-
# on those break classes. Here we create tables for each break property that
1320-
# for a string xy, which have break classes x' and y', we tell whether a break
1321-
# is allowed between x and y or not. The rows of this table are the various
1322-
# x'; the columns, the y'. Often the table entry will be just 0 or 1. But
1323-
# increasingly in newer Unicode versions, more context is needed to make this
1324-
# determination, and the table entry will be an enum (packed with other
1325-
# information) that corresponds to a hand-crafted DFA in regexec.c that gets
1326-
# executed.
1314+
# tables, with additional small DFAs for when the tables are insufficient.
1315+
# (SB does not lend itself to tabular lookup, and is rarely changed, so it is
1316+
# all done in code in regexec.c.) Unicode publishes properties which assign a
1317+
# break class to every Unicode code point, even ones that haven't been
1318+
# assigned to be characters. (Perl uses that class for all non-Unicode code
1319+
# points.) Unicode also publishes rules for breaking based on those break
1320+
# classes. Here we create tables for each break property that for a string
1321+
# xy, which have break classes x' and y', we tell whether a break is allowed
1322+
# between x and y or not. The rows of this table are the various x'; the
1323+
# columns, the y'. Often the table entry will be just the booleans 0 or 1.
1324+
# But increasingly in newer Unicode versions, more context is needed to make
1325+
# this determination. Looking around at the context requires a DFA. Each of
1326+
# these is hand-coded in regexec.c, and is identified by a number which is
1327+
# a case: in a switch() statement there. This program creates #defines for
1328+
# those DFA numbers. XXX an enhancement would be to make these enums. The
1329+
# (x,y) cell contents when a DFA is needed are described below.
13271330
#
1328-
# Unicode used to publish a table itself for the Line Break property, but
1329-
# abandoned it as it got more complicated. However, on their website in the
1330-
# UCD data files, in the subdirectory 'auxiliary', there are files like
1331-
# 'LineBreakTest.html' that do show annotated pairwise tables. Unicode no
1332-
# longer feels constrained to make their rules easy to implement this way.
1333-
# Perl wants to keep using the table, as it makes it easier to find the break
1334-
# status in the middle of the string instead of having to start each time at
1335-
# the beginning, and a goodly number of the possibilities are 0 or 1 anyway,
1336-
# without needing the DFA. But this makes it a pain to update to a new
1337-
# Unicode release when they add rules. An example is in Unicode 15.1, where
1338-
# new GCB rules make use of a new property, Indic_Conjunct_Break that is
1339-
# unrelated to GCB. In order for Perl to continue using the table, we have to
1340-
# make new equivalence classes in GCB for the Indic property values. This
1341-
# would mean we need all combinations of the intersections
1342-
# GCB1_Indic1, GCB1_Indic2, ... GCBn_Indic1, # GCB2_Indic1, ...
1343-
# Fortunately all but 4 of these intersections are empty in 15.1. But a
1344-
# future release might change that, and this would have to be manually
1345-
# compensated for. The rules that involve GCB1 now have to change to also
1346-
# include GCB1_Indic1, GCB1_Indic2, ...
1331+
# The Unicode rules are listed in UAX #14 and UAX #29 in priority order for
1332+
# each type of break. When context is needed, more than one DFA may apply to
1333+
# a given cell. For example, in the Line Break property, when x is a space,
1334+
# and y is almost anything else, we have to look behind to see what came
1335+
# before the space. (Usualy we have to back up to the first non-space when
1336+
# there are multiple spaces in a row.) If that non-space is a quote we likely
1337+
# will have a different rule than if it is a right parenthesis. For all cells
1338+
# in this type of situation, this program creates a chain of DFAs to apply in
1339+
# priority order. The first one that matches the situation is used; if none
1340+
# do, there is a fallback 0 or 1 that ends the chain.
13471341
#
1348-
# The code in this file populate the tables based on data output from
1349-
# mktables. The Unicode rules are listed in UAX #14 and UAX #29 in priority
1350-
# order for each type of line break. Suppose you want to determine if there
1351-
# is a break between x and y. You start at rule #1, and see if it applies.
1352-
# If not, you proceed to rule #2, and so on, stopping at the first match.
1342+
# This program creates a linear array of all the chains strung together. What
1343+
# gets stored in the (x,y) cell of the main table is the index into this array
1344+
# where the first DFA number for its chain is stored.
13531345
#
1354-
# This works well when the cells unconditionally return break/no-break (1 or
1355-
# 0). But consider the case that we apply rule #a which requires a DFA. If
1356-
# that fails to match we're supposed to try rule #a+1, #a+2, ..., stopping at
1357-
# the first match. The table is constructed so that the final rule matches
1358-
# everything, so the process is guaranteed to halt. And it likely will halt
1359-
# earlier at the first unconditional match. Now this generates a chain of
1360-
# DFAs for regexec.c to follow, stopping at the first successful match.
1346+
# Unicode no longer feels constrained to make their rules easy to implement
1347+
# in a pair-wise table. An example is in Unicode 15.1, where new GCB rules
1348+
# make use of a new property, Indic_Conjunct_Break that is unrelated to GCB.
1349+
# In order for Perl to continue using the table, we have to make new
1350+
# equivalence classes in GCB for the Indic property values. Thus we would
1351+
# need to split the code points in class GCBx into the ones that are in
1352+
# GCBx-nonIndic, the ones that are in GCBx-Indic1, the ones that are in
1353+
# GCBx-Indic2, .... And class GCBx would be subdivided into the appropriate
1354+
# subclasses. (It turns out that many of these don't contain any code points,
1355+
# so aren't actually needed) It is now possible to tell mktables what a split
1356+
# should be, and it takes care of the rest, passing to this program the
1357+
# results, in a data structure.
13611358

13621359
# These functions access the cells of a break table, converting any mnemonics
13631360
# to numeric. They need $enums to be able to do this.
@@ -2436,23 +2433,6 @@ ()
24362433
}
24372434
);
24382435

2439-
# The result is really just true or false. But we follow along with tr14,
2440-
# creating a rule which is false for something like X SP* X. That gets
2441-
# encoding 2. The rest of the dfas are synthetic ones that indicate
2442-
# some context handling is required. These each are added to the
2443-
# underlying 0, 1, or 2, instead of replacing them, so that the underlying
2444-
# value can be retrieved. Actually only rules from 7 through 18 (which
2445-
# are the ones where space matter) are possible to have 2 added to them.
2446-
# The others below add just 0 or 1. It might be possible for one
2447-
# synthetic rule to be added to another, yielding a larger value. This
2448-
# doesn't happen in the Unicode 8.0 rule set, and as you can see from the
2449-
# names of the middle grouping below, it is impossible for that to occur
2450-
# for them because they all start with mutually exclusive classes. That
2451-
# the final rule can't be added to any of the others isn't obvious from
2452-
# its name, so it is assigned a power of 2 higher than the others can get
2453-
# to so any addition would preserve all data. (And the code will reach an
2454-
# assert(0) on debugging builds should this happen.)
2455-
24562436
my $lb_enum = 2;
24572437
my %lb_dfas = (
24582438
LB_NOBREAK => {

regexp_constants.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,5 +83,5 @@
8383
* 63f771c327e92574fbd77919586079c38f669058a5e6b67ccec385ef8fcde882 lib/unicore/version
8484
* 0a6b5ab33bb1026531f816efe81aea1a8ffcd34a27cbea37dd6a70a63d73c844 regen/charset_translations.pl
8585
* c7ff8e0d207d3538c7feb4a1a152b159e5e902d20293b303569ea8323e84633e regen/mk_PL_charclass.pl
86-
* 271cf09abfa390b652f60dd7b6a2769ea1fecc80d74cc68d02dfe8678a43da62 regen/mk_invlists.pl
86+
* 6f140fe16685fe5d0e81e2984af81342aff5eaba309991002eaca94d032b2ecc regen/mk_invlists.pl
8787
* ex: set ro ft=c: */

uni_keywords.h

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)