Skip to content

New data file generator with support for UCD 13 & 14 #227

@chris0e3

Description

@chris0e3

Attached is data_make.py, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14. Also utf8proc.c.patch, a small change to utf8proc.c needed to support UCD 14.

Here are some of its features:

  • Written in Python (easier to read & support?), only uses (a little) sed. Tested on Python 3.7.4.
  • Doesn’t use an unspecified version of Ruby.
  • Doesn’t use an unspecified version of Julia.
  • Doesn’t require a previously built, unspecified, version of libutf8proc.
  • Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
  • Passes all utf8proc tests.
  • No changes to the public API.
  • Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
  • Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
  • Writes informative header comments to the generated file.
  • Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
    [Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires a small patch to utf8proc.c.]
  • Can half the size of utf8proc_stage1table. (Saves 4352 bytes.)
  • Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
  • Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.

To build with (the still in development) UCD 14 requires a new Makefile. I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing. (I can supply one if requested.)
UCD 14 has increased the size of the generated data. I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires the small patch to utf8proc.c contained in utf8proc.c.patch. With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.

To use:

  1. Download & unpack a clean copy of utf8proc-2.6.1.tar.gz.
  2. Unpack & copy the attached data_make.py & utf8proc.c.patch into the utf8proc-2.6.1 dir.
  3. Run make -kC data to download the UCD 13 data files. [It’s OK if CharWidths.txt is not made.]
  4. Run patch < utf8proc.c.patch.
  5. Run ./data_make.py --verbose --format=1 --output=utf8proc_data.c
  6. Run make check.

Usage is:

data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]

If unspecified the output file is utf8proc_data.out.c.
If unspecified the input data-dir file is ./data.
If --format=0 alone is used (the default) then the output file should be identical to the original utf8proc_data.c file.
If --fix26 is used then the fixes described in issue #226 are applied to the tables.
If --cmap is used then the utf8proc_sequences table is split & the utf8proc_casemap table added. This requires the utf8proc.c.patch to be applied.
If --format=1 is used then --fix26 & --cmap are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces --format=1 (thus --fix26 & --cmap too).
Using --verbose reports the options in effect & successful generation of the output file.

data_make.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions