Ucdxml and TR42 by jowilco · Pull Request #859 · unicode-org/unicodetools

jowilco · 2024-06-06T23:38:16Z

PR to make it easy to see what changes have been made to support UCDXML.

jowilco · 2024-10-16T21:14:44Z

Comment on June 6 is no longer valid - we're now ready for review.

jowilco · 2024-10-16T21:16:42Z

@macchiati @eggrobin @markusicu - Please can you review?

* Shadda on a half-form * Regenerate UCD

…g#962)

unicodetools/src/main/resources/org/unicode/props/IndexUnicodeProperties.txt

unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt

* UnicodeData.txt from the proposal * lb=AL like nearby noons * Script=Arabic * ArabicShaping.txt * Regenerate UCD * Comparison with other نs * New syntax * Ignore Unicode_1_Name * Ignore Block too * Expect what the comment says we expect.

docs/ucdxml.md

unicodetools/src/main/java/org/unicode/props/PropertyParsingInfo.java

markusicu · 2025-02-04T20:23:38Z

unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt

 EDCM ; Emoji_DCM
 EKDDI ; Emoji_KDDI
 ESB ; Emoji_SB
+EVS ; emoji_variation_sequence


Are these additional property names all specified/documented? Are they visible in the generated UCDXML files?

We should avoid inventing public names that could become hard to change, without UTC approval.

I'm not quite sure how to answer your question.

Yes, the variation sequences for emoji have been displayed in the UCDXML files. For example,
<standardized-variant cps="0023 FE0E" desc="text style" when=""/>
<standardized-variant cps="0023 FE0F" desc="emoji style" when=""/>

Neither EKDDI nor Emoji_KDDI are defined in TR51. Nor are they defined in EmojiSources.txt. I'm not sure where else these names would be defined. I just tried to come up with an alias + name to support the values that are in ucd/emoji/emoji-variation-sequences.txt.

Most important: We should not publish names that are not approved by the UTC. Or at least not more than what UAX42 published already.

Second: If we can avoid parsing files like EmojiSources.txt and NormalizationCorrections.txt, and especially inventing properties for them, then let's do so.

Or at least not more than what UAX42 published already.
Both of these (EmojiSources.txt -> variation sequence, NormalizationCorrections.txt) are already part of what Eric had defined in UAX42. Not sure what you want me to do at this point.

markusicu · 2025-02-04T20:24:02Z

unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt

+ncCorrected ; NC_Corrected
+ncOriginal ; NC_Original
+ncVersion ; NC_Version


what are these?

Similar to your question about EVS/emoji_variation_sequence, these were added to support a small section of Normalization Corrections (https://www.unicode.org/reports/tr42/#d1e4038). For example,
<normalization-correction cp="F951" old="96FB" new="964B" version="3.2.0"/>

The source for these elements is NormalizationCorrections.txt, and that is where the names come from (Original (erroneous) decomposition, Corrected decomposition, Version of Unicode for...)

Again, I accept that I'm creating new aliases, and perhaps this is the single use-case for them.

As Asmus has been pointing out, we have a history of publishing data files with "data" but without formally defining corresponding properties. We should skip them if possible. We should definitely not publish unsanctioned property and value names.

I understand that this file is internal to unicodetools. If we don't need these novelties for UCDXML, nor for other Unicode Tools work, then I prefer we don't add them even here.

They are needed for UAX42. See https://www.unicode.org/reports/tr42/#d1e4038

If we don't need these novelties for UCDXML,

As John points out, this ship has sailed, but

nor for other Unicode Tools work,

in any case I have been adding such things pretty systematically because it is useful to have them in the tools in some form, even if that form is unstable and not fully designed.

I am not sure we need to invent a short alias (e.g., EVS, ncWhatever) though, and if we do not need it, I think we should avoid it; just make it the same as the long name (so, in this file, Something ; Something rather than sth ; Something).

@eggrobin That seems reasonable to me.
so:
emoji_variation_sequence ; emoji_variation_sequence

and
normalization_correction_original ; normalization_correction_original
normalization_correction_corrected ; normalization_correction_corrected
normalization_correction_version ; normalization_correction_version

I'll wait for @markusicu to agree before making the change.

@eggrobin That seems reasonable to me. so: emoji_variation_sequence ; emoji_variation_sequence

and normalization_correction_original ; normalization_correction_original normalization_correction_corrected ; normalization_correction_corrected normalization_correction_version ; normalization_correction_version

I'll wait for @markusicu to agree before making the change.

ok

Fixed in #1030

unicodetools/src/main/resources/org/unicode/uax42/fragments/block.xml

unicodetools/src/main/java/org/unicode/xml/XMLProperties.java

markusicu · 2025-02-05T00:31:52Z

unicodetools/src/main/java/org/unicode/xml/UcdPropertyDetail.java

FYI: Wow, this is long and tedious!
Is this by chance generated (if so, how?) -- or else generatable?

This does seem a bit worrisome from a maintenance standpoint: we add properties reasonably often (especially provisional Unihan ones), and we will likely forget about this.
Some of it seems derivable: the cjkness of a property is tested passim in the tools by seeing if the short alias starts with "cjk" (Yes, the short alias. "cjk" is short for "k". We're all mad here.).
Some of it seems independent, but ideally it would be moved into one of the data files (IndexUnicodeProperties?).

It wasn't exactly generated, but I automated it somewhat via a spreadsheet. There was some research needed to determine the correct version number.

I'm hoping that we'll now only need to append, not regenerate.

@eggrobin

Some of it seems independent, but ideally it would be moved

I was somewhat surprised that the version when each property was added was not defined elsewhere.

The rest of the values are to a certain extent convenience, but isCJKShowIfEmpty is to handle somewhat arbitrary decisions that Eric made when deciding whether an empty cjk property should be shown.

I was somewhat surprised that the version when each property was added was not defined elsewhere.

Yes, I have been annoyed by that. VersionedProperty checks whether the property assignment is the same for all code points as a proxy for the property not actually being defined...

unicodetools/unicodetools/src/main/java/org/unicode/text/UCD/VersionedProperty.java

Line 136 in f068b26

if (property == null || property.isTrivial()) {

unicodetools/src/main/java/org/unicode/xml/UCDXMLWriter.java

unicodetools/src/main/java/org/unicode/xml/UCDDataResolver.java

unicodetools/src/main/java/org/unicode/xml/GeneratePropertyValues.java

… ucdxml

jowilco added 5 commits June 26, 2024 13:48

Rebase

b0656d8

Initial checkin for UcdXML

3ce611a

Interim checkin: implemented groups

0ba5996

Rebase

7764f6c

Ran GenerateEnums

7e161a6

jowilco force-pushed the ucdxml branch from 066326e to 7e161a6 Compare June 26, 2024 21:17

jowilco added 8 commits June 26, 2024 14:22

Fixing a broken rebase

d609d92

Fixing a broken rebase

cb314e8

Added support for comparing different ucdxml files

776e00e

Ran spotless

8b870a6

Added support for the generation of UAX42

d612e96

Added note about NFD

f552e63

Spotless code cleanup

242f22b

Merge branch 'unicode-org:main' into ucdxml

e625ff0

jowilco requested review from eggrobin, macchiati and markusicu October 16, 2024 21:11

jowilco changed the title ~~Ucdxml preview~~ Ucdxml and TR42 Oct 16, 2024

jowilco marked this pull request as ready for review October 16, 2024 21:13

eggrobin and others added 6 commits October 17, 2024 16:48

Support remap rules in the segmenter (unicode-org#949)

109fcb4

We are not using Java 1.4 anymore. (unicode-org#950)

bf10f7d

Test InCB=Extend for Gujarati Shadda (unicode-org#957)

250884c

* Shadda on a half-form * Regenerate UCD

Only run Maven cache workflow on the upstream repo (unicode-org#959)

fc59f0d

fix typo (unicode-org#960)

c5ad635

Allow redundant lines in data files to facilitate merging (unicode-or…

f32aee2

…g#962)

eggrobin reviewed Nov 12, 2024

View reviewed changes

jowilco and others added 2 commits November 12, 2024 07:55

Implemented review comments from eggrobin

6ee2467

markusicu reviewed Feb 5, 2025

View reviewed changes

jowilco added 27 commits February 6, 2025 10:22

Rebase

daadb73

Initial checkin for UcdXML

d1a9a58

Interim checkin: implemented groups

a0be079

Rebase

4b138f4

Ran GenerateEnums

3e95155

Fixing a broken rebase

14ebb86

Added support for comparing different ucdxml files

b4d5d86

Ran spotless

a186e69

Added support for the generation of UAX42

c4e2513

Added note about NFD

1bf2b4b

Spotless code cleanup

f17d222

Implemented review comments from eggrobin

4b204a7

Updates from Marcus's review comments

3cd4ba4

Merged main

095095f

Rebase

cdb405a

Initial checkin for UcdXML

8757b50

Interim checkin: implemented groups

42dd048

Rebase

1bac8b4

Fixing a broken rebase

c12f06a

Added support for comparing different ucdxml files

1026cc0

Ran spotless

d9ff4e5

Added support for the generation of UAX42

531f33c

Added note about NFD

56c0bd8

Spotless code cleanup

a567758

Implemented review comments from eggrobin

36d40c6

Updates from Marcus's review comments

8216505

Merge branch 'ucdxml' of https://github.com/jowilco/unicodetools into…

d2d85ff

… ucdxml

jowilco closed this Jul 8, 2025

jowilco deleted the ucdxml branch July 8, 2025 19:47

Uh oh!

Conversation

jowilco commented Jun 6, 2024 • edited by eggrobin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jowilco commented Oct 16, 2024

Uh oh!

jowilco commented Oct 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eggrobin Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eggrobin Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jowilco commented Jun 6, 2024 •

edited by eggrobin

Loading

eggrobin Feb 7, 2025 •

edited

Loading

eggrobin Feb 7, 2025 •

edited

Loading