-
Notifications
You must be signed in to change notification settings - Fork 0
compare trie with packtab #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tajp-icu4x-data
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These binaries were used to compare sizes
|
|
||
| vec![benchmark_fn("Composite lookup", move |b| { | ||
| b.iter(|| { | ||
| black_box(lookup::checksum_packtab(samples)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to checksum_trie to compare performance with tango
| use icu_provider::{DataMarker, DataRequest, DynamicDataProvider}; | ||
|
|
||
| #[test] | ||
| fn packtab_matches_trie() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test to ensure that both packtab and trie return the same values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the raw data fed to PackTab
|
I studied the ICU CodePointTrie (aka UCPTrie) a bit at: https://unicode-org.github.io/icu/design/struct/utrie#ucptrie--codepointtrie My observations about how the two designs compare:
I think, ideally, the packTab algorithm should be contributed to ICU and possibly replace the UCPTrie implementation. The builder then can run the packTab algorithm and store optimal table. One reason this has not been pursued by ICU team might be that for this to be fast, you need to compile your final expression with a compiler. Ie. you want the shifts and masks values to be known at compile time. So we can't really store and load the shape of the partition and have lookup tables for this shape from provided data. That is why the UCPTrie partition shapes are fixed by the design. Because supporting arbitrary partitions is significantly slower than code for a fixed partition. That is, we have to compile the data through the compiler to get performant code, and that's not possible in a code vs data model cleanly. packTab solves this by being a code generator. Since UCPTrie can also be used for code-generation (as in the case at Parley), then there should be an ICU builder that generates code that needs to be compiled to access the data tables. In other words: for each data table, if you are compiling it into code anyway, we might as well generate the optimial function code to access this data. Another way to look at it is that UCPTrie doesn't make use of the fact that the table access code can be optimized by the compiler based on this specific data table, whereas packTab does. It would be interesting to see a Java implementation that can use the JRE's compiler to compile a table access function code from UCPTrie-builder. :D My point is, I think I understand why UCPTrie performs so badly, and yes, for the case of code-generation, packTab code can be faster because 1. it is based on the optimal partition, which minimizes the data memory, which becomes more cache-friendly, 2. branch-free arithmetic operations of variable and a constant, translating to one instruction each. Excuse my thinking aloud. |
@sffc - I'm curious about your thoughts here and whether a change like this would be accepted by ICU4X (and how that might work). (Also happy to schedule a call to chat amongst ourselves sometime in the several weeks). You can read more about PackTab at these links: http://github.com/harfbuzz/packtab TLDR: PackTab generates code that finds a solution to minimise binary size (fully bitpacked code): |
This PR tries to understand the differences in performance and binary size between ICU4X's CodePointTrie and PackTab generated code.
Results
Binary Size (🏆 PackTab - 56 kB cheaper)
Details:
The PackTab raw data is ~40 kB, the
.postcardICU4X trie is ~70 kB. I think the 16 kB difference in binary size for the binaries is due to ICU4X pulling in more of the std library (which should be absorbed by any compellingly complex consumer).Lookup (🏆 PackTab - ~37% faster)
Lookup w/ unsafe PackTab: (🏆 PackTab - ~64% faster)
We also tested against an unsafe variant (harfbuzz/packtab#6) and produced even better results.
NOTE: Results may vary with different lookup ranges. This benchmark was fairly simple.
Steps to reprod
lookup::checksum_trie(samples, composite)parley/parley_bench/src/benches.rs
Line 140 in 8dbcb14
cd parley_bench && cargo export target/benchmarks -- bench --bench=maincargo bench -q --bench=main -- compare target/benchmarks/mainNOTE: The current commit uses the unsafe PackTab variant. I think we would use this because of its significant improved performance over performing the bounds checking.