Skip to content

Commit 5f2992b

Browse files
committed
more updates for 15.1 data workflow
1 parent 3ab01ec commit 5f2992b

File tree

4 files changed

+50
-44
lines changed

4 files changed

+50
-44
lines changed

docs/build.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -217,9 +217,7 @@ See the top level `pom.xml` under `<properties>`.
217217

218218
The input data files for the Unicode Tools are checked into the repo since
219219
2012-dec-21:
220-
221-
* <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd>
222-
* <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd>
220+
* https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/
223221

224222
This is inside the unicodetools file tree, and the Java code has been updated to
225223
assume that. Any old Eclipse setup needs its path variables checked.
@@ -242,7 +240,9 @@ Starting with Unicode 15, we are developing most of the Unicode data files
242240
in this Unicode Tools project, and publish them to the Public folder
243241
only for alpha/beta/final releases.
244242
That is, we are reversing the flow of files.
245-
(See [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)
243+
244+
See [data workflow](data-workflow.md). (Based on
245+
[issue #144](https://github.com/unicode-org/unicodetools/issues/144).)
246246

247247
We are also no longer generating and posting files with version suffixes.
248248
(We now generate files into an output folder with the Unicode version number.)
@@ -255,7 +255,7 @@ unversioned "dev" folders in this repo.
255255

256256
#### Unicode 15.1+ workflow
257257

258-
See data-workflow.md .
258+
See [data workflow](data-workflow.md).
259259

260260
### Unicode 15.0.0 changes
261261

@@ -374,10 +374,10 @@ to generate new files). For all the new ones:
374374
Make a pull request to incorporate these updates, and upload the generated files
375375
in a way that can be shared with ucd-dev.
376376

377-
Unicode 15 TODO:
378-
We plan to
377+
Unicode 15+:
379378
- make a commit for changes in input data files
380379
- copy the output files back into the input folders, review, and commit again
380+
381381
... instead of posting draft files elsewhere and re-ingesting them later.
382382

383383
Ideally, diff the files to check for any discrepancies. The script will do this
@@ -530,13 +530,16 @@ If there are new break rules (or changes), see
530530
Unicode.
531531
4. On Windows you can run these BATs to compare files: TODO??
532532
533-
### Upload for Ken Whistler & editorial committee
533+
### Upload for Ken Whistler & other reviewers
534534
535-
Unicode 15 TODO: See above; commit new input data, run tools, review output, copy back to input, commit, pull request...
535+
Unicode 15+: See above; commit new input data, run tools, review output, copy back to input, commit, pull request...
536536
537537
1. Check diffs for problems
538-
2. First drop for a version: Upload **all** files
539-
3. Subsequent drop for a version: Upload *only modified* files
538+
2. Ask for reviews on the pull request.
539+
3. For & during alpha & beta we publish whole snapshots of multiple repo data folders
540+
using publication scripts: See [data workflow](data-workflow.md).
541+
542+
We no longer post files to FTP folders, nor publish individual files without consistent changes in others.
540543
541544
### Invariant Checking
542545

docs/inputdata.md

Lines changed: 32 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ Starting with Unicode 15, we are developing most of the Unicode data files
66
in this Unicode Tools project, and publish them to the Public folder
77
only for alpha/beta/final releases.
88
That is, we are reversing the flow of files.
9-
(See [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)
9+
10+
See [data workflow](data-workflow.md). (Based on
11+
[issue #144](https://github.com/unicode-org/unicodetools/issues/144).)
1012

1113
We are also no longer generating and posting files with version suffixes.
1214

@@ -15,6 +17,34 @@ and we continue to ingest them as before.
1517

1618
## Source Files
1719

20+
*Starting with Unicode 15.1, the “source of truth” for most data files is in the repo,
21+
and most of this section is obsolete. See [data workflow](data-workflow.md).
22+
The biggest exception is Unihan.zip, which we don't track in the repo; see the Unihan section below.
23+
Also, it's still useful to delete the BIN files/folders after changing data files.*
24+
25+
### Unihan
26+
27+
You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".
28+
29+
Unzip the Unihan.zip file into a "Unihan" subfolder.
30+
31+
Starting with Unicode 13, we split the Unihan data into single-property files
32+
and parse those.
33+
34+
Run the script that is checked in at
35+
[py/splitunihan.py](../py/splitunihan.py)
36+
with one argument, the path to the Unihan folder.
37+
38+
Ignore or delete the Unihan\*.txt files now. Do not check them into the tools
39+
any more.
40+
41+
Check for new and no-longer-present files (Unihan properties).
42+
`git add` and `git rm` as necessary.
43+
44+
### Fetching files from Public
45+
46+
Only for Unicode 15.0 and earlier:
47+
1848
The source files that you will need for a release such as 8.0.0 are in:
1949

2050
* [ftp://unicode.org/Public/8.0.0/ucd](ftp://unicode.org/Public/8.0.0/ucd)
@@ -68,6 +98,7 @@ files have the version suffix.
6898
### Removing Suffixes
6999

70100
Only for Unicode 14 and earlier:
101+
71102
For the ucd and uca files, you will have to remove the suffixes.
72103

73104
Tip: On Linux, you can remove version suffixes on the command line like this:
@@ -134,25 +165,6 @@ $ cd {workspace}/unicodetools/data/ucd/staging
134165
$ ../../desuffixucd.py .
135166
```
136167

137-
### Unihan
138-
139-
You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".
140-
141-
Unzip the Unihan.zip file into a "Unihan" subfolder.
142-
143-
Starting with Unicode 13, we split the Unihan data into single-property files
144-
and parse those.
145-
146-
Run the script that is checked in at
147-
[py/splitunihan.py](../py/splitunihan.py)
148-
with one argument, the path to the Unihan folder.
149-
150-
Ignore or delete the Unihan\*.txt files now. Do not check them into the tools
151-
any more.
152-
153-
Check for new and no-longer-present files (Unihan properties).
154-
`git add` and `git rm` as necessary.
155-
156168
## Original data file setup instructions
157169

158170
### 2. Download all of the UnicodeData files for each version into UCD_DIR.

docs/security.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,6 @@
22

33
## Modifying
44

5-
Create new revision directory, such as .../unicodetools/data/security/6.3.0. The
6-
folder will match the version of the UCD used (perhaps with an incrementing 3rd
7-
field).
8-
9-
* As usual, use `git cp` to copy the previous directory to the new one. Do not
10-
just "mkdir" and copy the files!
11-
125
To add or fix xidmodifications, look at source/removals.txt.
136

147
To add or fix confusables, there are multiple source files. Many were

docs/uca/index.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,12 @@ the character properties are pretty stable (coming up on the beta),
77
Ken inserts all of the new characters into the default sort order.
88

99
For a few releases, he has documented his incremental progress with valuable notes
10-
sent to the ucd-dev mailing list.
10+
sent to the properties mailing list (formerly the ucd-dev list).
1111
Markus has been taking the incremental file changes, and the notes, into this repo.
1212

1313
See the history of commits that changed decomps.txt and allkeys.txt.
1414
(We lost some of that history in the Unicode server crash of 2020.)
15+
- For UCA 15.1 see https://github.com/unicode-org/unicodetools/pull/403
1516
- For UCA 15 see https://github.com/unicode-org/unicodetools/pull/246
1617
- For UCA 14 see https://github.com/unicode-org/unicodetools/pull/71
1718
- For the collection of notes for UCA 10 see ducet.md.
@@ -34,12 +35,9 @@ for the CLDR/ICU FractionalUCA.txt data.
3435
2. We also need the UCA/DUCET files in
3536
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/uca/dev
3637
When they become first available for a new version, or when they are updated:
37-
1. Note that the following steps are probably no longer necessary.
38-
Instead, we get the updated files from Ken, or we run the sifter tool, and
38+
1. We get the updated files from Ken, or we run the sifter tool, and
3939
update the files in .../data/uca/dev.
40-
1. Download UCA files (mostly allkeys.txt) from
41-
`https://www.unicode.org/Public/UCA/{beta version}/`
42-
1. Run `desuffixucd.py` (see the [inputdata](../inputdata.md) page)
40+
1. Download Ken's UCA files (allkeys.txt & decomps.txt).
4341
1. Update the input files for the UCA tools, at
4442
{this repo}/unicodetools/data/uca/dev
4543
3. You will use `org.unicode.text.UCA.Main` as your main class.

0 commit comments

Comments
 (0)