Skip to content

Conversation

@alexdowad
Copy link
Contributor

mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables. However, for small codepoint values, we bypass the hashtable lookup and just use hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate ASCII range).

For upcasing and downcasing, we had already optimized the conditional which sends execution down this fast path, to use the fast path for as many codepoint values as possible. However, for case folding, this had not been done.

This will give a small performance boost for case-folding Unicode text which includes non-breaking spaces, symbols like ¥ or ™, or accented Latin characters (used in many European languages).

FYA @youkidearitai @ndossche @cmb69

mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables.
However, for small codepoint values, we bypass the hashtable lookup and just use
hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate
ASCII range).

For upcasing and downcasing, we had already optimized the conditional which sends
execution down this fast path, to use the fast path for as many codepoint values
as possible. However, for case folding, this had not been done.

This will give a small performance boost for case-folding Unicode text which
includes non-breaking spaces, symbols like ¥ or ™, or accented Latin
characters (used in many European languages).
Copy link
Contributor

@youkidearitai youkidearitai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkidearitai
Copy link
Contributor

However, 00B5; C; 03BC; # MICRO SIGN is special, but maybe is it seems possible to +0x20 after capital sign with accent mark.
What do you think?

00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE
00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE
00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE
00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6; C; 00E6; # LATIN CAPITAL LETTER AE
00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA
00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE
00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE
00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE
00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0; C; 00F0; # LATIN CAPITAL LETTER ETH
00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE
00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE
00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE
00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE
00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS
00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE
00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE
00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE
00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE
00DE; C; 00FE; # LATIN CAPITAL LETTER THORN

@alexdowad
Copy link
Contributor Author

However, 00B5; C; 03BC; # MICRO SIGN is special, but maybe is it seems possible to +0x20 after capital sign with accent mark. What do you think?

That's an interesting idea. It does seem that there are more ranges which could possibly be handled without doing a hashtable lookup. The problem is that as we keep adding more conditional tests to select different "fast paths", we make the "slow path" slower, because it has to go through all those tests before finally falling back to the hashtable.

As with everything, if you are interested in adding another fast path, I think testing would be a good idea.

@alexdowad alexdowad closed this Jan 10, 2026
@alexdowad alexdowad deleted the fold branch January 10, 2026 04:11
@alexdowad
Copy link
Contributor Author

Thanks to @youkidearitai for review. I merged this tweak.

I have another optimization for case folding, which I hope to open another PR for soon (maybe tomorrow?)

@ndossche
Copy link
Member

Pretty neat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants