-
Notifications
You must be signed in to change notification settings - Fork 156
Description
Attached is data_make.py
, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14. Also utf8proc.c.patch
, a small change to utf8proc.c needed to support UCD 14.
Here are some of its features:
- Written in Python (easier to read & support?), only uses (a little) sed. Tested on Python 3.7.4.
- Doesn’t use an unspecified version of Ruby.
- Doesn’t use an unspecified version of Julia.
- Doesn’t require a previously built, unspecified, version of libutf8proc.
- Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
- Passes all utf8proc tests.
- No changes to the public API.
- Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
- Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
- Writes informative header comments to the generated file.
- Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
[Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires a small patch to utf8proc.c.] - Can half the size of utf8proc_stage1table. (Saves 4352 bytes.)
- Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
- Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.
To build with (the still in development) UCD 14 requires a new Makefile. I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing. (I can supply one if requested.)
UCD 14 has increased the size of the generated data. I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires the small patch to utf8proc.c contained in utf8proc.c.patch. With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.
To use:
- Download & unpack a clean copy of
utf8proc-2.6.1.tar.gz
. - Unpack & copy the attached
data_make.py
&utf8proc.c.patch
into theutf8proc-2.6.1
dir. - Run
make -kC data
to download the UCD 13 data files. [It’s OK if CharWidths.txt is not made.] - Run
patch < utf8proc.c.patch
. - Run
./data_make.py --verbose --format=1 --output=utf8proc_data.c
- Run
make check
.
Usage is:
data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]
If unspecified the output file is utf8proc_data.out.c
.
If unspecified the input data-dir file is ./data
.
If --format=0
alone is used (the default) then the output file should be identical to the original utf8proc_data.c
file.
If --fix26
is used then the fixes described in issue #226 are applied to the tables.
If --cmap
is used then the utf8proc_sequences
table is split & the utf8proc_casemap
table added. This requires the utf8proc.c.patch to be applied.
If --format=1
is used then --fix26
& --cmap
are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces --format=1
(thus --fix26
& --cmap
too).
Using --verbose
reports the options in effect & successful generation of the output file.