Skip to content

Commit b3b9b25

Browse files
committed
Support casing characters which map into multiple code points (bug#24603)
Implement unconditional special casing rules defined in Unicode standard. Among other things, they deal with cases when a single code point is replaced by multiple ones because single character does not exist (e.g. ‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning into SS). * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode standard distribution. * admin/unidata/README: Mention SpecialCasing.txt. * admin/unidata/unidata-get.el (unidata-gen-table-special-casing, unidata-gen-table-special-casing--do-load): New functions generating ‘special-uppercase’, ‘special-lowercase’ and ‘special-titlecase’ character Unicode properties built from the SpecialCasing.txt Unicode data file. * src/casefiddle.c (struct casing_str_buf): New structure for representing short strings used to handle one-to-many character mappings. (case_character_imlp): New function which can handle one-to-many character mappings. (case_character, case_single_character): Wrappers for the above functions. The former may map one character to multiple (or no) code points while the latter does what the former used to do (i.e. handles one-to-one mappings only). (do_casify_natnum, do_casify_unibyte_string, do_casify_unibyte_region): Use case_single_character. (do_casify_multibyte_string, do_casify_multibyte_region): Support new features of case_character. * (do_casify_region): Updated to reflact do_casify_multibyte_string changes. (casify_word): Handle situation when one character-length of a word can change affecting where end of the word is. (upcase, capitalize, upcase-initials): Update documentation to mention limitations when working on characters. * test/src/casefiddle-tests.el (casefiddle-tests-char-properties): Add test cases for the newly introduced character properties. (casefiddle-tests-casing): Update test cases which are now passing. * test/lisp/char-fold-tests.el (char-fold--ascii-upcase, char-fold--ascii-downcase): New functions which behave like old ‘upcase’ and ‘downcase’. (char-fold--test-match-exactly): Use the new functions. This is needed because otherwise fi and similar characters are turned into their multi- -character representation. * doc/lispref/strings.texi: Describe issue with casing characters versus strings. * doc/lispref/nonascii.texi: Describe the new character properties.
1 parent 2c87dab commit b3b9b25

File tree

9 files changed

+679
-128
lines changed

9 files changed

+679
-128
lines changed

admin/unidata/README

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,7 @@ http://www.unicode.org/Public/8.0.0/ucd/Blocks.txt
2424
NormalizationTest.txt
2525
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
2626
2016-07-16
27+
28+
SpecialCasing.txt
29+
http://unicode.org/Public/UNIDATA/SpecialCasing.txt
30+
2016-03-03

admin/unidata/SpecialCasing.txt

Lines changed: 281 additions & 0 deletions
Large diffs are not rendered by default.

admin/unidata/unidata-gen.el

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,42 @@ Property value is a character or nil.
268268
The value nil means that the actual property value of a character
269269
is the character itself."
270270
string)
271+
(special-uppercase
272+
2 unidata-gen-table-special-casing "uni-special-uppercase.el"
273+
"Unicode unconditional special casing mapping.
274+
275+
Property value is (possibly empty) string or nil. The value nil denotes that
276+
`uppercase' property should be consulted instead. A string denotes what
277+
sequence of characters given character maps into.
278+
279+
This mapping includes language- and context-independent special casing rules
280+
defined by Unicode only. It also does not include association which would
281+
duplicate information from `uppercase' property."
282+
nil)
283+
(special-lowercase
284+
0 unidata-gen-table-special-casing "uni-special-lowercase.el"
285+
"Unicode unconditional special casing mapping.
286+
287+
Property value is (possibly empty) string or nil. The value nil denotes that
288+
`lowercase' property should be consulted instead. A string denotes what
289+
sequence of characters given character maps into.
290+
291+
This mapping includes language- and context-independent special casing rules
292+
defined by Unicode only. It also does not include association which would
293+
duplicate information from `lowercase' property."
294+
nil)
295+
(special-titlecase
296+
1 unidata-gen-table-special-casing "uni-special-titlecase.el"
297+
"Unicode unconditional special casing mapping.
298+
299+
Property value is (possibly empty) string or nil. The value nil denotes that
300+
`titlecase' property should be consulted instead. A string denotes what
301+
sequence of characters given character maps into.
302+
303+
This mapping includes language- and context-independent special casing rules
304+
defined by Unicode only. It also does not include association which would
305+
duplicate information from `titlecase' property."
306+
nil)
271307
(mirroring
272308
unidata-gen-mirroring-list unidata-gen-table-character "uni-mirrored.el"
273309
"Unicode bidi-mirroring characters.
@@ -1083,6 +1119,51 @@ Property value is a symbol `o' (Open), `c' (Close), or `n' (None)."
10831119
table))
10841120

10851121

1122+
1123+
1124+
(defvar unidata-gen-table-special-casing--cache nil
1125+
"Cached value for `unidata-gen-table-special-casing' function.")
1126+
1127+
(defun unidata-gen-table-special-casing--do-load ()
1128+
(let (result)
1129+
(with-temp-buffer
1130+
(insert-file-contents (expand-file-name "SpecialCasing.txt" unidata-dir))
1131+
(goto-char (point-min))
1132+
(while (not (eobp))
1133+
;; Ignore empty lines and comments.
1134+
(unless (or (eq (char-after) ?\n) (eq (char-after) ?#))
1135+
(let ((line (split-string
1136+
(buffer-substring (point) (progn (end-of-line) (point)))
1137+
";" "")))
1138+
;; Ignore entries with conditions, i.e. those with six values.
1139+
(when (= (length line) 5)
1140+
(let ((ch (string-to-number (pop line) 16)))
1141+
(setcdr (cddr line) nil) ; strip comment
1142+
(push
1143+
(cons ch
1144+
(mapcar (lambda (entry)
1145+
(mapcar (lambda (n) (string-to-number n 16))
1146+
(split-string entry)))
1147+
line))
1148+
result)))))
1149+
(forward-line)))
1150+
result))
1151+
1152+
(defun unidata-gen-table-special-casing (prop &rest ignore)
1153+
(let ((table (make-char-table 'char-code-property-table))
1154+
(prop-idx (unidata-prop-index prop)))
1155+
(set-char-table-extra-slot table 0 prop)
1156+
(mapc (lambda (entry)
1157+
(let ((ch (car entry)) (v (nth prop-idx (cdr entry))))
1158+
;; If character maps to a single character, the mapping is already
1159+
;; covered by regular casing property. Don’t store those.
1160+
(when (/= (length v) 1)
1161+
(set-char-table-range table ch (apply 'string v)))))
1162+
(or unidata-gen-table-special-casing--cache
1163+
(setq unidata-gen-table-special-casing--cache
1164+
(unidata-gen-table-special-casing--do-load))))
1165+
table))
1166+
10861167

10871168
(defun unidata-describe-general-category (val)
10881169
(cdr (assq val

doc/lispref/nonascii.texi

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -619,6 +619,29 @@ Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
619619
character of a word needs to be capitalized. The value of this
620620
property is a single character. For unassigned codepoints, the value
621621
is @code{nil}, which means the character itself.
622+
623+
@item special-uppercase
624+
Corresponds to Unicode language- and context-independent special upper-casing
625+
rules. The value of this property is a string (which may be empty). For
626+
example mapping for @code{U+00DF} (@sc{latin small letter sharp s}) is
627+
@code{"SS"}. For characters with no special mapping, the value is @code{nil}
628+
which means @code{uppercase} property needs to be consulted instead.
629+
630+
@item special-lowercase
631+
Corresponds to Unicode language- and context-independent special lower-casing
632+
rules. The value of this property is a string (which may be empty). For
633+
example mapping for @code{U+0130} (@sc{latin capital letter i with dot above})
634+
the value is @code{"i\u0307"} (i.e. 2-character string consisting of @sc{latin
635+
small letter i} followed by @sc{combining dot above}). For characters with no
636+
special mapping, the value is @code{nil} which means @code{lowercase} property
637+
needs to be consulted instead.
638+
639+
@item special-titlecase
640+
Corresponds to Unicode unconditional special title-casing rules. The value of
641+
this property is a string (which may be empty). For example mapping for
642+
@code{U+FB01} (@sc{latin small ligature fi}) the value is @code{"Fi"}. For
643+
characters with no special mapping, the value is @code{nil} which means
644+
@code{titlecase} property needs to be consulted instead.
622645
@end table
623646

624647
@defun get-char-code-property char propname

doc/lispref/strings.texi

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1177,6 +1177,33 @@ When the argument to @code{upcase-initials} is a character,
11771177
@end example
11781178
@end defun
11791179

1180+
Note that case conversion is not a one-to-one mapping of codepoints
1181+
and length of the result may differ from length of the argument.
1182+
Furthermore, because passing a character forces return type to be
1183+
a character, functions are unable to perform proper substitution and
1184+
result may differ compared to treating a one-character string. For
1185+
example:
1186+
1187+
@example
1188+
@group
1189+
(upcase "fi") ; note: single character, ligature "fi"
1190+
@result{} "FI"
1191+
@end group
1192+
@group
1193+
(upcase ?fi)
1194+
@result{} 64257 ; i.e. ?fi
1195+
@end group
1196+
@end example
1197+
1198+
To avoid this, a character must first be converted into a string,
1199+
using @code{string} function, before being passed to one of the casing
1200+
functions. Of course, no assumptions on the length of the result may
1201+
be made.
1202+
1203+
Mapping for such special cases are taken from
1204+
@code{special-uppercase}, @code{special-lowercase} and
1205+
@code{special-titlecase} @xref{Character Properties}.
1206+
11801207
@xref{Text Comparison}, for functions that compare strings; some of
11811208
them ignore case differences, or can optionally ignore case differences.
11821209

etc/NEWS

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -355,12 +355,17 @@ same as in modes where the character is not whitespace.
355355
Instead of only checking the modification time, Emacs now also checks
356356
the file's actual content before prompting the user.
357357

358-
** Title case characters are properly cased (from and into).
359-
'upcase', 'upcase-region' et al. convert title case characters (such
360-
as the single character "Dz") into their upper case form (such as "DZ").
361-
As a downside, 'capitalize' and 'upcase-initials' produce awkward
362-
words where first character is upper rather than title case, e.g.,
363-
"DŽungla" instead of "Džungla".
358+
** Various casing improvements.
359+
360+
*** 'upcase', 'upcase-region' et al. convert title case characters
361+
(such as Dz) into their upper case form (such as DZ).
362+
363+
*** 'capitalize', 'upcase-initials' et al. make use of title-case forms
364+
of initial characters (correctly producing for example Džungla instead
365+
of incorrect DŽungla).
366+
367+
*** Characters which turn into multiple ones when cased are correctly handled.
368+
For example, fi ligature is converted to FI when upper cased.
364369

365370

366371
* Changes in Specialized Modes and Packages in Emacs 26.1

0 commit comments

Comments
 (0)