New data file generator with support for UCD 13 & 14

Attached is `data_make.py`, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14.  Also `utf8proc.c.patch`, a small change to utf8proc.c needed to support UCD 14.

# Here are some of its features:
* Written in Python (easier to read & support?), only uses (a little) sed.  Tested on Python 3.7.4.
* Doesn’t use an unspecified version of Ruby.
* Doesn’t use an unspecified version of Julia.
* Doesn’t require a previously built, unspecified, version of libutf8proc.
* Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
* Passes all utf8proc tests.
* No changes to the public API.
* Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
* Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
* Writes informative header comments to the generated file.
* Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
	  [Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow.  This requires a small patch to utf8proc.c.]
* Can half the size of utf8proc_stage1table.  (Saves 4352 bytes.)
* Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
* Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.

To build with (the still in development) UCD 14 requires a new Makefile.  I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing.  (I can supply one if requested.)
UCD 14 has increased the size of the generated data.  I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow.  This requires the small patch to utf8proc.c contained in utf8proc.c.patch.  With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.

# To use:
1. Download & unpack a clean copy of `utf8proc-2.6.1.tar.gz`.
2. Unpack & copy the attached `data_make.py` & `utf8proc.c.patch` into the `utf8proc-2.6.1` dir.
3. Run `make -kC data` to download the UCD 13 data files.  [It’s OK if CharWidths.txt is not made.]
4. Run `patch < utf8proc.c.patch`.
5. Run `./data_make.py --verbose --format=1 --output=utf8proc_data.c`
6. Run `make check`.

# Usage is:
```
data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]
```
If unspecified the output file is `utf8proc_data.out.c`.
If unspecified the input data-dir file is `./data`.
If `--format=0` alone is used (the default) then the output file should be identical to the original `utf8proc_data.c` file.
If `--fix26` is used then the fixes described in issue #226 are applied to the tables.
If `--cmap` is used then the `utf8proc_sequences` table is split & the `utf8proc_casemap` table added.  This requires the utf8proc.c.patch to be applied.
If `--format=1` is used then `--fix26` & `--cmap` are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces `--format=1` (thus `--fix26` & `--cmap` too).
Using `--verbose` reports the options in effect & successful generation of the output file.  

[data_make.zip](https://github.com/JuliaStrings/utf8proc/files/7108551/data_make.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New data file generator with support for UCD 13 & 14 #227

Here are some of its features:

To use:

Usage is:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New data file generator with support for UCD 13 & 14 #227

Description

Here are some of its features:

To use:

Usage is:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions