Reduce size of Unicode tables #145219

Kmeakin · 2025-08-10T17:42:29Z

Follow up to #145027.
Shave a few bytes from the tables by:

Removing ASCII characters from the sets: 31446 bytes to 31420 bytes
Replacing Cased with Titlecase_letter: 31420 bytes to 31050 bytes
Using match expressions for sufficiently small sets 31050 bytes to 30754 bytes

rustbot · 2025-08-10T17:42:33Z

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-08-10T17:42:35Z

library/core/src/unicode/unicode_data.rs is generated by
src/tools/unicode-table-generator via ./x run src/tools/unicode-table-generator. If you want to modify unicode_data.rs,
please modify the tool then regenerate the library source file with the tool
instead of editing the library source file manually.

joshtriplett · 2025-08-10T19:58:40Z

@Kmeakin For that last change to use match, do you have any way to check how the code size change balances with the table size change? Could you build a standalone program that includes the tables and code, calls all the functions once to ensure they're used, and check the total size of that program, code and data?

Kmeakin · 2025-08-10T22:28:22Z

@Kmeakin For that last change to use match, do you have any way to check how the code size change balances with the table size change? Could you build a standalone program that includes the tables and code, calls all the functions once to ensure they're used, and check the total size of that program, code and data?

https://godbolt.org/z/ef5ExG5Eo

is_whitespace grows slightly larger, but is offset by getting rid of the 256 bytes of static data
is_control gets slightly worse: LLVM doesn't seem to realise that matches!(c, 0x00..=0x1f | 0x7f | 0x80..=0x9f) can be optimized to matches!(c, 0x00..=0x1f | 0x7f..=0x9f)

I will open an issue against LLVM asking them to fix the latter. In the meantime, can we at least merge the commits to remove ASCII characters from the tables?

Commit 15acb0e introduced a panic when running `./x run tools/unicode-table-generator`. Fix it by undoing one of the refactors.

To make changes in table size obvious from git diffs

Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.

The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.

Rewrite `generate_tests` to be more idiomatic.

The ASCII subset of Unicode is fixed and will never change, so we don't need to generate tables for it with every new Unicode version. This saves a few bytes of static data and speeds up `char::is_control` and `char::is_grapheme_extended` on ASCII inputs. Since the table lookup functions exported from the `unicode` module will give nonsensical errors on ASCII input (and in fact will panic in debug mode), I had to add some private wrapper methods to `char` which check for ASCII-ness first.

`Cased` is a derived property - it is the union of the `Lowercase` property, the `Uppercase` property, and the `Titlecase_Letter` generaral category. We already have lookup tables for `Lowercase` and `Uppercase`, and `Titlecase_Letter` is very small. So instead of duplicating a lookup table for `Cased`, just test each of those properties in turn. This probably will be slower than the old approach, but it is not a public API: it is only used in `string::to_lower` when deciding when a Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare case, so should not be performance sensitive.

okaneco · 2025-08-10T23:36:15Z

This was the PR that added a lookup table for is_whitespace #99487

The trade-off is execution speed versus the 256 byte table. Unfortunately, there aren't any benches in tree but the author of that PR provided a repo they instrumented with criterion. Some of those examples could be included as well as benching on the current corpora.

I suspect the performance of the match still is a bit slower than the current table-based implementation for is_whitespace.

If the number of codepoint ranges in a set is sufficiently small, it may be better to simply use a `match` expression rather than a lookup table. The instructions to implement the `match` may be slightly bigger than the table that it replaced (hard to predict, depends on architecture and whatever optimzations LLVM applies), but in return we elimate the lookup tables and avoid the slower binary search.

scottmcm · 2025-08-10T23:54:24Z

Since it's changing approach somewhat, nominating for team discussion -- especially in hopes that someone remembers the past work done on the tables and static size and such and can comment whether things were tried in the past.

According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.

Kmeakin · 2025-08-11T01:06:02Z

is_control gets slightly worse: LLVM doesn't seem to realise that matches!(c, 0x00..=0x1f | 0x7f | 0x80..=0x9f) can be optimized to matches!(c, 0x00..=0x1f | 0x7f..=0x9f)

Actually, Cc is guaranteed not to change in the future, so we can just hardcode it

the8472 · 2025-08-13T16:14:02Z

@bors2 try @rust-timer queue

Reduce size of Unicode tables

Kmeakin · 2025-08-13T21:59:23Z

This PR has accumulated quite a few separate optimizations:

filtering out ASCII characters
hard-coding char::is_control
using match for char::is_whitespace
replacing cased with lowercase || uppercase || titlecase

I think it would be best to split into separate PRs so the impact of each optimization can be evaluated separately. @scottmcm , @joshtriplett is that ok with you?

bors · 2025-08-14T15:14:26Z

☔ The latest upstream changes (presumably #145388) made this pull request unmergeable. Please resolve the merge conflicts.

the8472 · 2025-08-14T22:34:53Z

We discussed this during this week's T-libs meeting. We'd like to see the refactorings of the table generator split out, they seem useful on their own.

Regarding the optimizations, we'd like to see evidence that they provide measurable size benefits in programs that actually exercise that code and that they improve or at least not significantly regress performance. For ASCII-fastpaths we also want to see the perf impact on non-ascii inputs.

Hard-code `char::is_control` Split off from #145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.

Hard-code `char::is_control` Split off from rust-lang/rust#145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.

…, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from rust-lang#145219

Rollup merge of #145414 - Kmeakin:km/unicode-table-refactors, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from #145219

Hard-code `char::is_control` Split off from rust-lang/rust#145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.

Don't include ASCII characters in Unicode tables Split off from #145219

… r=joboet Hard-code `char::is_control` Split off from rust-lang#145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.

…, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from rust-lang#145219

…jhpratt Don't include ASCII characters in Unicode tables Split off from rust-lang#145219

Don't include ASCII characters in Unicode tables Split off from rust-lang/rust#145219

Hard-code `char::is_control` Split off from rust-lang/rust#145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.

… r=joboet Hard-code `char::is_control` Split off from rust-lang#145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.

rustbot assigned scottmcm Aug 10, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Aug 10, 2025

This comment has been minimized.

Sign in to view

Kmeakin force-pushed the km/optimize-unicode-tables branch from c18f085 to 39ae3b7 Compare August 10, 2025 21:42

Kmeakin force-pushed the km/optimize-unicode-tables branch from 39ae3b7 to d172dba Compare August 10, 2025 23:17

Kmeakin added 8 commits August 10, 2025 23:35

fix: Fix panic in unicode-table-generator

7a03b28

Commit 15acb0e introduced a panic when running `./x run tools/unicode-table-generator`. Fix it by undoing one of the refactors.

refactor: Include table sizes in comment at top of unicode_data.rs

55420b1

To make changes in table size obvious from git diffs

refactor: Include size of case conversion tables

25d1876

Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.

refactor: rewrite ranges_from_set

4494509

The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.

refactor: generate_tests

9ecbc4c

Rewrite `generate_tests` to be more idiomatic.

refactor: Add tests for case conversions

1aec3b8

Kmeakin force-pushed the km/optimize-unicode-tables branch from d172dba to b7fa8ef Compare August 10, 2025 23:37

scottmcm added the I-libs-nominated Nominated for discussion during a libs team meeting. label Aug 10, 2025

Kmeakin mentioned this pull request Aug 11, 2025

Missed fold: x == c || (c + 1 <= x && x <= c2) => (c <= x && x <= c2) llvm/llvm-project#152948

Closed

refactor: Hard-code char::is_control

3d5b2b8

According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.

This comment has been minimized.

Sign in to view

rust-bors bot added a commit that referenced this pull request Aug 13, 2025

Auto merge of #145219 - Kmeakin:km/optimize-unicode-tables, r=<try>

ce72b92

Reduce size of Unicode tables

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Aug 13, 2025

This was referenced Aug 15, 2025

unicode-table-generator refactors #145414

Merged

Hard-code char::is_control #145479

Merged

Amanieu removed the I-libs-nominated Nominated for discussion during a libs team meeting. label Aug 20, 2025

scottmcm assigned the8472 and unassigned scottmcm Aug 29, 2025

tgross35 added a commit to tgross35/rust that referenced this pull request Sep 3, 2025

Rollup merge of rust-lang#145414 - Kmeakin:km/unicode-table-refactors…

17e2d38

…, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from rust-lang#145219

Zalathar added a commit to Zalathar/rust that referenced this pull request Sep 3, 2025

Rollup merge of rust-lang#145414 - Kmeakin:km/unicode-table-refactors…

4eb6212

…, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from rust-lang#145219

Zalathar added a commit to Zalathar/rust that referenced this pull request Sep 3, 2025

Rollup merge of rust-lang#145414 - Kmeakin:km/unicode-table-refactors…

8b790d7

…, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from rust-lang#145219

rust-timer added a commit that referenced this pull request Sep 3, 2025

Unrolled build for #145414

6348776

Rollup merge of #145414 - Kmeakin:km/unicode-table-refactors, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from #145219

This was referenced Sep 3, 2025

Don't include ASCII characters in Unicode tables #146173

Merged

Remove Cased Unicode table #146180

Open

Kmeakin closed this Sep 3, 2025

rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 3, 2025

bors added a commit that referenced this pull request Sep 8, 2025

Auto merge of #146173 - Kmeakin:km/unicode-data/no-ascii, r=jhpratt

beeb8e3

Don't include ASCII characters in Unicode tables Split off from #145219

github-actions bot pushed a commit to model-checking/verify-rust-std that referenced this pull request Sep 9, 2025

Rollup merge of rust-lang#145414 - Kmeakin:km/unicode-table-refactors…

6eb0378

…, r=joshtriplett,tgross35 unicode-table-generator refactors Split off from rust-lang#145219

github-actions bot pushed a commit to model-checking/verify-rust-std that referenced this pull request Sep 10, 2025

Auto merge of rust-lang#146173 - Kmeakin:km/unicode-data/no-ascii, r=…

1ac413d

…jhpratt Don't include ASCII characters in Unicode tables Split off from rust-lang#145219

github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Sep 11, 2025

Auto merge of #146173 - Kmeakin:km/unicode-data/no-ascii, r=jhpratt

733de1b

Don't include ASCII characters in Unicode tables Split off from rust-lang/rust#145219

Kmeakin deleted the km/optimize-unicode-tables branch October 6, 2025 22:15

Reduce size of Unicode tables #145219

Reduce size of Unicode tables #145219

Uh oh!

Conversation

Kmeakin commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Aug 10, 2025

Uh oh!

rustbot commented Aug 10, 2025

Uh oh!

This comment has been minimized.

joshtriplett commented Aug 10, 2025

Uh oh!

Kmeakin commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

okaneco commented Aug 10, 2025

Uh oh!

scottmcm commented Aug 10, 2025

Uh oh!

Kmeakin commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the8472 commented Aug 13, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Kmeakin commented Aug 13, 2025

Uh oh!

bors commented Aug 14, 2025

Uh oh!

the8472 commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Kmeakin commented Aug 10, 2025 •

edited

Loading

Kmeakin commented Aug 10, 2025 •

edited

Loading

Kmeakin commented Aug 11, 2025 •

edited

Loading