Conversation
|
Comment on June 6 is no longer valid - we're now ready for review. |
|
@macchiati @eggrobin @markusicu - Please can you review? |
* Shadda on a half-form * Regenerate UCD
unicodetools/src/main/resources/org/unicode/props/IndexUnicodeProperties.txt
Show resolved
Hide resolved
unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt
Outdated
Show resolved
Hide resolved
unicodetools/src/main/resources/org/unicode/props/ExtraPropertyAliases.txt
Show resolved
Hide resolved
* UnicodeData.txt from the proposal * lb=AL like nearby noons * Script=Arabic * ArabicShaping.txt * Regenerate UCD * Comparison with other نs * New syntax * Ignore Unicode_1_Name * Ignore Block too * Expect what the comment says we expect.
| EDCM ; Emoji_DCM | ||
| EKDDI ; Emoji_KDDI | ||
| ESB ; Emoji_SB | ||
| EVS ; emoji_variation_sequence |
There was a problem hiding this comment.
Are these additional property names all specified/documented? Are they visible in the generated UCDXML files?
We should avoid inventing public names that could become hard to change, without UTC approval.
There was a problem hiding this comment.
I'm not quite sure how to answer your question.
Yes, the variation sequences for emoji have been displayed in the UCDXML files. For example,
<standardized-variant cps="0023 FE0E" desc="text style" when=""/>
<standardized-variant cps="0023 FE0F" desc="emoji style" when=""/>
Neither EKDDI nor Emoji_KDDI are defined in TR51. Nor are they defined in EmojiSources.txt. I'm not sure where else these names would be defined. I just tried to come up with an alias + name to support the values that are in ucd/emoji/emoji-variation-sequences.txt.
There was a problem hiding this comment.
Most important: We should not publish names that are not approved by the UTC. Or at least not more than what UAX42 published already.
Second: If we can avoid parsing files like EmojiSources.txt and NormalizationCorrections.txt, and especially inventing properties for them, then let's do so.
There was a problem hiding this comment.
Or at least not more than what UAX42 published already.
Both of these (EmojiSources.txt -> variation sequence, NormalizationCorrections.txt) are already part of what Eric had defined in UAX42. Not sure what you want me to do at this point.
| ncCorrected ; NC_Corrected | ||
| ncOriginal ; NC_Original | ||
| ncVersion ; NC_Version |
There was a problem hiding this comment.
Similar to your question about EVS/emoji_variation_sequence, these were added to support a small section of Normalization Corrections (https://www.unicode.org/reports/tr42/#d1e4038). For example,
<normalization-correction cp="F951" old="96FB" new="964B" version="3.2.0"/>
The source for these elements is NormalizationCorrections.txt, and that is where the names come from (Original (erroneous) decomposition, Corrected decomposition, Version of Unicode for...)
Again, I accept that I'm creating new aliases, and perhaps this is the single use-case for them.
There was a problem hiding this comment.
As Asmus has been pointing out, we have a history of publishing data files with "data" but without formally defining corresponding properties. We should skip them if possible. We should definitely not publish unsanctioned property and value names.
I understand that this file is internal to unicodetools. If we don't need these novelties for UCDXML, nor for other Unicode Tools work, then I prefer we don't add them even here.
There was a problem hiding this comment.
They are needed for UAX42. See https://www.unicode.org/reports/tr42/#d1e4038
There was a problem hiding this comment.
If we don't need these novelties for UCDXML,
As John points out, this ship has sailed, but
nor for other Unicode Tools work,
in any case I have been adding such things pretty systematically because it is useful to have them in the tools in some form, even if that form is unstable and not fully designed.
I am not sure we need to invent a short alias (e.g., EVS, ncWhatever) though, and if we do not need it, I think we should avoid it; just make it the same as the long name (so, in this file, Something ; Something rather than sth ; Something).
There was a problem hiding this comment.
@eggrobin That seems reasonable to me.
so:
emoji_variation_sequence ; emoji_variation_sequence
and
normalization_correction_original ; normalization_correction_original
normalization_correction_corrected ; normalization_correction_corrected
normalization_correction_version ; normalization_correction_version
I'll wait for @markusicu to agree before making the change.
There was a problem hiding this comment.
@eggrobin That seems reasonable to me. so:
emoji_variation_sequence ; emoji_variation_sequenceand
normalization_correction_original ; normalization_correction_originalnormalization_correction_corrected ; normalization_correction_correctednormalization_correction_version ; normalization_correction_versionI'll wait for @markusicu to agree before making the change.
ok
There was a problem hiding this comment.
FYI: Wow, this is long and tedious!
Is this by chance generated (if so, how?) -- or else generatable?
There was a problem hiding this comment.
This does seem a bit worrisome from a maintenance standpoint: we add properties reasonably often (especially provisional Unihan ones), and we will likely forget about this.
Some of it seems derivable: the cjkness of a property is tested passim in the tools by seeing if the short alias starts with "cjk" (Yes, the short alias. "cjk" is short for "k". We're all mad here.).
Some of it seems independent, but ideally it would be moved into one of the data files (IndexUnicodeProperties?).
There was a problem hiding this comment.
It wasn't exactly generated, but I automated it somewhat via a spreadsheet. There was some research needed to determine the correct version number.
I'm hoping that we'll now only need to append, not regenerate.
There was a problem hiding this comment.
Some of it seems independent, but ideally it would be moved
I was somewhat surprised that the version when each property was added was not defined elsewhere.
The rest of the values are to a certain extent convenience, but isCJKShowIfEmpty is to handle somewhat arbitrary decisions that Eric made when deciding whether an empty cjk property should be shown.
There was a problem hiding this comment.
I was somewhat surprised that the version when each property was added was not defined elsewhere.
Yes, I have been annoyed by that. VersionedProperty checks whether the property assignment is the same for all code points as a proxy for the property not actually being defined...
PR to make it easy to see what changes have been made to support UCDXML.