Skip to content

Ucdxml and TR42#859

Closed
jowilco wants to merge 134 commits intounicode-org:mainfrom
jowilco:ucdxml
Closed

Ucdxml and TR42#859
jowilco wants to merge 134 commits intounicode-org:mainfrom
jowilco:ucdxml

Conversation

@jowilco
Copy link
Contributor

@jowilco jowilco commented Jun 6, 2024

PR to make it easy to see what changes have been made to support UCDXML.

@jowilco jowilco changed the title Ucdxml preview Ucdxml and TR42 Oct 16, 2024
@jowilco jowilco marked this pull request as ready for review October 16, 2024 21:13
@jowilco
Copy link
Contributor Author

jowilco commented Oct 16, 2024

Comment on June 6 is no longer valid - we're now ready for review.

@jowilco
Copy link
Contributor Author

jowilco commented Oct 16, 2024

@macchiati @eggrobin @markusicu - Please can you review?

jowilco and others added 2 commits November 12, 2024 07:55
* UnicodeData.txt from the proposal

* lb=AL like nearby noons

* Script=Arabic

* ArabicShaping.txt

* Regenerate UCD

* Comparison with other نs

* New syntax

* Ignore Unicode_1_Name

* Ignore Block too

* Expect what the comment says we expect.
EDCM ; Emoji_DCM
EKDDI ; Emoji_KDDI
ESB ; Emoji_SB
EVS ; emoji_variation_sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these additional property names all specified/documented? Are they visible in the generated UCDXML files?

We should avoid inventing public names that could become hard to change, without UTC approval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure how to answer your question.

Yes, the variation sequences for emoji have been displayed in the UCDXML files. For example,
<standardized-variant cps="0023 FE0E" desc="text style" when=""/>
<standardized-variant cps="0023 FE0F" desc="emoji style" when=""/>

Neither EKDDI nor Emoji_KDDI are defined in TR51. Nor are they defined in EmojiSources.txt. I'm not sure where else these names would be defined. I just tried to come up with an alias + name to support the values that are in ucd/emoji/emoji-variation-sequences.txt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most important: We should not publish names that are not approved by the UTC. Or at least not more than what UAX42 published already.

Second: If we can avoid parsing files like EmojiSources.txt and NormalizationCorrections.txt, and especially inventing properties for them, then let's do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least not more than what UAX42 published already.
Both of these (EmojiSources.txt -> variation sequence, NormalizationCorrections.txt) are already part of what Eric had defined in UAX42. Not sure what you want me to do at this point.

Comment on lines +158 to +160
ncCorrected ; NC_Corrected
ncOriginal ; NC_Original
ncVersion ; NC_Version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to your question about EVS/emoji_variation_sequence, these were added to support a small section of Normalization Corrections (https://www.unicode.org/reports/tr42/#d1e4038). For example,
<normalization-correction cp="F951" old="96FB" new="964B" version="3.2.0"/>

The source for these elements is NormalizationCorrections.txt, and that is where the names come from (Original (erroneous) decomposition, Corrected decomposition, Version of Unicode for...)

Again, I accept that I'm creating new aliases, and perhaps this is the single use-case for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Asmus has been pointing out, we have a history of publishing data files with "data" but without formally defining corresponding properties. We should skip them if possible. We should definitely not publish unsanctioned property and value names.

I understand that this file is internal to unicodetools. If we don't need these novelties for UCDXML, nor for other Unicode Tools work, then I prefer we don't add them even here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need these novelties for UCDXML,

As John points out, this ship has sailed, but

nor for other Unicode Tools work,

in any case I have been adding such things pretty systematically because it is useful to have them in the tools in some form, even if that form is unstable and not fully designed.

I am not sure we need to invent a short alias (e.g., EVS, ncWhatever) though, and if we do not need it, I think we should avoid it; just make it the same as the long name (so, in this file, Something ; Something rather than sth ; Something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggrobin That seems reasonable to me.
so:
emoji_variation_sequence ; emoji_variation_sequence

and
normalization_correction_original ; normalization_correction_original
normalization_correction_corrected ; normalization_correction_corrected
normalization_correction_version ; normalization_correction_version

I'll wait for @markusicu to agree before making the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1049

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggrobin That seems reasonable to me. so: emoji_variation_sequence ; emoji_variation_sequence

and normalization_correction_original ; normalization_correction_original normalization_correction_corrected ; normalization_correction_corrected normalization_correction_version ; normalization_correction_version

I'll wait for @markusicu to agree before making the change.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1030

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Wow, this is long and tedious!
Is this by chance generated (if so, how?) -- or else generatable?

Copy link
Member

@eggrobin eggrobin Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem a bit worrisome from a maintenance standpoint: we add properties reasonably often (especially provisional Unihan ones), and we will likely forget about this.
Some of it seems derivable: the cjkness of a property is tested passim in the tools by seeing if the short alias starts with "cjk" (Yes, the short alias. "cjk" is short for "k". We're all mad here.).
Some of it seems independent, but ideally it would be moved into one of the data files (IndexUnicodeProperties?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't exactly generated, but I automated it somewhat via a spreadsheet. There was some research needed to determine the correct version number.

I'm hoping that we'll now only need to append, not regenerate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggrobin

Some of it seems independent, but ideally it would be moved

I was somewhat surprised that the version when each property was added was not defined elsewhere.

The rest of the values are to a certain extent convenience, but isCJKShowIfEmpty is to handle somewhat arbitrary decisions that Eric made when deciding whether an empty cjk property should be shown.

Copy link
Member

@eggrobin eggrobin Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was somewhat surprised that the version when each property was added was not defined elsewhere.

Yes, I have been annoyed by that. VersionedProperty checks whether the property assignment is the same for all code points as a proxy for the property not actually being defined...

if (property == null || property.isTrivial()) {

@jowilco jowilco closed this Jul 8, 2025
@jowilco jowilco deleted the ucdxml branch July 8, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants