Skip to content

Commit 62c771b

Browse files
authored
UCDXML and TR42 v2 (#1030)
* Merged changes manually from ucdxml * Review changes from Markus * Merged changes manually from ucdxml * Review changes from Markus * Ran spotless * Ran GenerateEnums * More review changes from Markus * Ran spotless * Use default values where possible * Corrections from Markus's review
1 parent 4fced4d commit 62c771b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+15282
-34
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ perf-*.xml
4343
test-*.xml
4444

4545
# Directories
46+
.idea/
4647
.settings/
4748
.vs/
4849
.vscode/

docs/ucdxml.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# UCDXML
2+
3+
There are three separate processes for generating and validating UCDXML files and their corresponding UAX42 report.
4+
5+
1. Generate the UCDXML files.
6+
2. (Optional) You can compare the generated UCDXML files against each other (e.g., Flat vs Grouped) or against
7+
previous versions.
8+
3. Generate UAX42. There are three steps involved:
9+
10+
1. Generate the property value fragments. The updated versions should live in
11+
unicodetools/src/main/resources/org/unicode/uax42/fragments
12+
2. Generate the index.html and index.rnc files for UAX42.
13+
3. (Optional) Validate the UCDXML files using index.rnc.
14+
15+
## Generate UCDXML files
16+
17+
- You can generate flat or grouped versions of UCDXML.
18+
- You can generate UCDXML files for:
19+
- the full range of code points
20+
- the Unihan code points
21+
- code points that are not Unihan code points
22+
23+
```
24+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.UCDXML"' '-Dexec.args="--range ALL --output FLAT"' -DCLDR_DIR=$(cd ../cldr; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
25+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.UCDXML"' '-Dexec.args="--range UNIHAN --output FLAT"' -DCLDR_DIR=$(cd ../cldr; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
26+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.UCDXML"' '-Dexec.args="--range NOUNIHAN --output FLAT"' -DCLDR_DIR=$(cd ../cldr; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
27+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.UCDXML"' '-Dexec.args="--range ALL --output GROUPED"' -DCLDR_DIR=$(cd ../cldr; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
28+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.UCDXML"' '-Dexec.args="--range UNIHAN --output GROUPED"' -DCLDR_DIR=$(cd ../cldr; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
29+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.UCDXML"' '-Dexec.args="--range NOUNIHAN --output GROUPED"' -DCLDR_DIR=$(cd ../cldr; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
30+
```
31+
32+
## Compare UCDXML files
33+
34+
After generating UCDXML files, you can compare:
35+
36+
- Different versions of the same type (range and output) of UCDXML file
37+
- Grouped and flat versions of the same code point range
38+
39+
```
40+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.CompareUCDXML"' '-Dexec.args="-a {path to file} -b {path to file}"'
41+
```
42+
43+
## Generating TR42
44+
45+
### Step 1 - Generate property value fragments
46+
47+
```
48+
mvn compile exec:java '-Dexec.mainClass="org.unicode.xml.GeneratePropertyValues"' -DCLDR_DIR=$(cd ../cldr ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
49+
```
50+
51+
UAX42 fragments live in unicodetools/src/main/resources/org/unicode/uax42/fragments
52+
53+
### Step 2 - Generate TR42 index.html and index.rnc
54+
55+
```
56+
mvn xml:transform -f $(cd ./unicodetools/src/main/resources/org/unicode/uax42; pwd) -Doutputdir=$(cd ../Generated/uax42; pwd)
57+
```
58+
59+
### Step 3 - Validate generated UAX XML files
60+
61+
You'll need a [RELAX NG](https://relaxng.org/) schema validator.
62+
We'll use [jing-trang](https://github.com/relaxng/jing-trang) in this example.
63+
64+
1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang)
65+
2. Run the following:
66+
```
67+
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>
68+
```
69+
Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD.

unicodetools/src/main/java/org/unicode/props/UcdProperty.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ public enum UcdProperty {
9191
Named_Sequences_Prov(PropertyType.Miscellaneous, "NSP"),
9292
Standardized_Variant(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "SV"),
9393
Unicode_1_Name(PropertyType.Miscellaneous, "na1"),
94+
emoji_variation_sequence(PropertyType.Miscellaneous, "emoji_variation_sequence"),
9495
kAlternateHanYu(PropertyType.Miscellaneous, "cjkAlternateHanYu"),
9596
kAlternateJEF(PropertyType.Miscellaneous, "cjkAlternateJEF"),
9697
kAlternateKangXi(PropertyType.Miscellaneous, "cjkAlternateKangXi"),
@@ -242,6 +243,12 @@ public enum UcdProperty {
242243
kZhuang(PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkZhuang"),
243244
kZhuangNumeric(
244245
PropertyType.Miscellaneous, null, ValueCardinality.Unordered, "cjkZhuangNumeric"),
246+
normalization_correction_corrected(
247+
PropertyType.Miscellaneous, "normalization_correction_corrected"),
248+
normalization_correction_original(
249+
PropertyType.Miscellaneous, "normalization_correction_original"),
250+
normalization_correction_version(
251+
PropertyType.Miscellaneous, "normalization_correction_version"),
245252

246253
// Catalog
247254
Age(PropertyType.Catalog, Age_Values.class, null, "age"),

unicodetools/src/main/java/org/unicode/props/UcdPropertyValues.java

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -766,6 +766,7 @@ public static East_Asian_Width_Values forName(String name) {
766766
// Emoji_DCM
767767
// Emoji_KDDI
768768
// Emoji_SB
769+
// emoji_variation_sequence
769770
// Equivalent_Unified_Ideograph
770771
// FC_NFKC_Closure
771772
public enum General_Category_Values implements Named {
@@ -1789,6 +1790,9 @@ public static NFKD_Quick_Check_Values forName(String name) {
17891790
}
17901791
}
17911792

1793+
// normalization_correction_corrected
1794+
// normalization_correction_original
1795+
// normalization_correction_version
17921796
public enum Numeric_Type_Values implements Named {
17931797
Decimal("De"),
17941798
Digit("Di"),

unicodetools/src/main/java/org/unicode/tools/emoji/LoadImage.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -891,7 +891,8 @@ public static void doSb(String outputDir) throws IOException {
891891
// try {
892892
// copy(new URL(url), new File(outputDir + "/sb","sb_" + code + ".gif"));
893893
//// BufferedImage sourceImage = ImageIO.read(new URL(url));
894-
//// writeImage(sourceImage,outputDir + "/sb","sb_" + code, "gif");
894+
//// writeImage(sourceImage,outputDir + "/sb","sb_" + code,
895+
// "gif");
895896
// System.out.println(code);
896897
// } catch (Exception e) {
897898
// System.out.println("Skipping " + code);

0 commit comments

Comments
 (0)