Skip to content

Conversation

@taj-p
Copy link
Owner

@taj-p taj-p commented Nov 18, 2025

This PR tries to understand the differences in performance and binary size between ICU4X's CodePointTrie and PackTab generated code.

Results

Binary Size (🏆 PackTab - 56 kB cheaper)

image

Details:

The PackTab raw data is ~40 kB, the .postcard ICU4X trie is ~70 kB. I think the 16 kB difference in binary size for the binaries is due to ICU4X pulling in more of the std library (which should be absorbed by any compellingly complex consumer).

Lookup (🏆 PackTab - ~37% faster)

image

Lookup w/ unsafe PackTab: (🏆 PackTab - ~64% faster)

We also tested against an unsafe variant (harfbuzz/packtab#6) and produced even better results.

image

NOTE: Results may vary with different lookup ranges. This benchmark was fairly simple.

Steps to reprod

  1. Checkout branch
  2. Set this line to lookup::checksum_trie(samples, composite)

black_box(lookup::checksum_packtab(samples));

  1. Export the bench for use by tango cd parley_bench && cargo export target/benchmarks -- bench --bench=main
  2. Revert step 2
  3. Compare packtab with trie: cargo bench -q --bench=main -- compare target/benchmarks/main

NOTE: The current commit uses the unsafe PackTab variant. I think we would use this because of its significant improved performance over performing the bounds checking.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These binaries were used to compare sizes


vec![benchmark_fn("Composite lookup", move |b| {
b.iter(|| {
black_box(lookup::checksum_packtab(samples));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to checksum_trie to compare performance with tango

use icu_provider::{DataMarker, DataRequest, DynamicDataProvider};

#[test]
fn packtab_matches_trie() {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test to ensure that both packtab and trie return the same values

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the raw data fed to PackTab

@behdad
Copy link

behdad commented Nov 19, 2025

I studied the ICU CodePointTrie (aka UCPTrie) a bit at:

https://unicode-org.github.io/icu/design/struct/utrie#ucptrie--codepointtrie

My observations about how the two designs compare:

  • CodePointTrie has a direct-access array for ASCII, then a "faster" path for BMP, then a fallback path over all codepoints. If the faster speed of lower codepoints is desired similarly, packTab can be run three times with the truncated parts of the codepoint space, and optimized more aggressively in each, and the results overlaid. However,
  • I think one of the slow points of the CodePointTrie in your testing is that most codepoints fall in the slow-path of the CPTrie and behind two conditionals, whereas the entire packTab code is branchfree.
  • Finally, the packTab table sizes are optimized for size. Smaller size tables also interact with the CPU caches better, making the branchless code quite fast.

I think, ideally, the packTab algorithm should be contributed to ICU and possibly replace the UCPTrie implementation. The builder then can run the packTab algorithm and store optimal table. One reason this has not been pursued by ICU team might be that for this to be fast, you need to compile your final expression with a compiler. Ie. you want the shifts and masks values to be known at compile time. So we can't really store and load the shape of the partition and have lookup tables for this shape from provided data. That is why the UCPTrie partition shapes are fixed by the design. Because supporting arbitrary partitions is significantly slower than code for a fixed partition.

That is, we have to compile the data through the compiler to get performant code, and that's not possible in a code vs data model cleanly. packTab solves this by being a code generator. Since UCPTrie can also be used for code-generation (as in the case at Parley), then there should be an ICU builder that generates code that needs to be compiled to access the data tables. In other words: for each data table, if you are compiling it into code anyway, we might as well generate the optimial function code to access this data.

Another way to look at it is that UCPTrie doesn't make use of the fact that the table access code can be optimized by the compiler based on this specific data table, whereas packTab does. It would be interesting to see a Java implementation that can use the JRE's compiler to compile a table access function code from UCPTrie-builder. :D

My point is, I think I understand why UCPTrie performs so badly, and yes, for the case of code-generation, packTab code can be faster because 1. it is based on the optimal partition, which minimizes the data memory, which becomes more cache-friendly, 2. branch-free arithmetic operations of variable and a constant, translating to one instruction each.

Excuse my thinking aloud.

@taj-p
Copy link
Owner Author

taj-p commented Nov 19, 2025

behdad

@sffc - I'm curious about your thoughts here and whether a change like this would be accepted by ICU4X (and how that might work). (Also happy to schedule a call to chat amongst ourselves sometime in the several weeks).

You can read more about PackTab at these links:

http://github.com/harfbuzz/packtab
https://docs.google.com/document/d/1Xq3owVt61HVkJqbLFHl73il6pcTy6PdPJJ7bSouQiQw/preview

TLDR: PackTab generates code that finds a solution to minimise binary size (fully bitpacked code):

# Example: Unicode character categories with repeated patterns
# Values are all multiples of 5 in range [100, 135]
data = [
    100, 105, 110, 115, 120, 125, 130, 135,  # 0-7
    100, 105, 110, 115, 120, 125, 130, 135,  # 8-15 (repeat)
    105, 105, 105, 105, 120, 120, 120, 120,  # 16-23 (patterns)
    100, 100, 135, 135, 115, 115, 125, 125,  # 24-31 (patterns)
    110, 110, 110, 110, 110, 110, 110, 110,  # 32-39 (all same)
]

// PackTab generated code:

static category_u8: [u8; 20]=
[
   16, 50, 84,118, 16, 50, 84,118, 17, 17, 68, 68,  0,119, 51, 85,
   34, 34, 34, 34,
];

fn category_b4 (a: &[u8], i: usize) -> u8
{
  (a[i>>1]>>((i&1)<<2))&15
}
pub(crate) fn category_get (u: usize) -> u8
{
  if u<40 { 100+5*category_b4(&category_u8,(u) as usize) } else { 100 }
}

// What packTab Discovered:
// All values are multiples of 5 (100, 105, 110, 115, ...)
// All values ≥ 100 (minimum is 100)
 
// Formula: original_value = 100 + 5 * stored_value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants