-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Reduce size of Unicode tables #145219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Reduce size of Unicode tables #145219
Conversation
|
This comment has been minimized.
This comment has been minimized.
@Kmeakin For that last change to use |
c18f085
to
39ae3b7
Compare
https://godbolt.org/z/ef5ExG5Eo
I will open an issue against LLVM asking them to fix the latter. In the meantime, can we at least merge the commits to remove ASCII characters from the tables? |
39ae3b7
to
d172dba
Compare
Commit 15acb0e introduced a panic when running `./x run tools/unicode-table-generator`. Fix it by undoing one of the refactors.
To make changes in table size obvious from git diffs
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.
The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.
Rewrite `generate_tests` to be more idiomatic.
The ASCII subset of Unicode is fixed and will never change, so we don't need to generate tables for it with every new Unicode version. This saves a few bytes of static data and speeds up `char::is_control` and `char::is_grapheme_extended` on ASCII inputs. Since the table lookup functions exported from the `unicode` module will give nonsensical errors on ASCII input (and in fact will panic in debug mode), I had to add some private wrapper methods to `char` which check for ASCII-ness first.
`Cased` is a derived property - it is the union of the `Lowercase` property, the `Uppercase` property, and the `Titlecase_Letter` generaral category. We already have lookup tables for `Lowercase` and `Uppercase`, and `Titlecase_Letter` is very small. So instead of duplicating a lookup table for `Cased`, just test each of those properties in turn. This probably will be slower than the old approach, but it is not a public API: it is only used in `string::to_lower` when deciding when a Greek "sigma" should be mapped to `ς` or to `σ`. This is a very rare case, so should not be performance sensitive.
This was the PR that added a lookup table for The trade-off is execution speed versus the 256 byte table. Unfortunately, there aren't any benches in tree but the author of that PR provided a repo they instrumented with criterion. Some of those examples could be included as well as benching on the current corpora. I suspect the performance of the match still is a bit slower than the current table-based implementation for |
If the number of codepoint ranges in a set is sufficiently small, it may be better to simply use a `match` expression rather than a lookup table. The instructions to implement the `match` may be slightly bigger than the table that it replaced (hard to predict, depends on architecture and whatever optimzations LLVM applies), but in return we elimate the lookup tables and avoid the slower binary search.
d172dba
to
b7fa8ef
Compare
Since it's changing approach somewhat, nominating for team discussion -- especially in hopes that someone remembers the past work done on the tables and static size and such and can comment whether things were tried in the past. |
According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.
Actually, |
@bors2 try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Reduce size of Unicode tables
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (ce72b92): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -2.9%, secondary -2.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -2.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.0%, secondary -0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 468.291s -> 469.372s (0.23%) |
This PR has accumulated quite a few separate optimizations:
I think it would be best to split into separate PRs so the impact of each optimization can be evaluated separately. @scottmcm , @joshtriplett is that ok with you? |
☔ The latest upstream changes (presumably #145388) made this pull request unmergeable. Please resolve the merge conflicts. |
We discussed this during this week's T-libs meeting. We'd like to see the refactorings of the table generator split out, they seem useful on their own. Regarding the optimizations, we'd like to see evidence that they provide measurable size benefits in programs that actually exercise that code and that they improve or at least not significantly regress performance. For ASCII-fastpaths we also want to see the perf impact on non-ascii inputs. |
Follow up to #145027.
Shave a few bytes from the tables by:
Cased
withTitlecase_letter
: 31420 bytes to 31050 bytesmatch
expressions for sufficiently small sets 31050 bytes to 30754 bytes