Use fast path in more cases when doing case folding with mb_convert_case #20889

alexdowad · 2026-01-10T01:04:56Z

mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables. However, for small codepoint values, we bypass the hashtable lookup and just use hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate ASCII range).

For upcasing and downcasing, we had already optimized the conditional which sends execution down this fast path, to use the fast path for as many codepoint values as possible. However, for case folding, this had not been done.

This will give a small performance boost for case-folding Unicode text which includes non-breaking spaces, symbols like ¥ or ™, or accented Latin characters (used in many European languages).

FYA @youkidearitai @ndossche @cmb69

mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables. However, for small codepoint values, we bypass the hashtable lookup and just use hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate ASCII range). For upcasing and downcasing, we had already optimized the conditional which sends execution down this fast path, to use the fast path for as many codepoint values as possible. However, for case folding, this had not been done. This will give a small performance boost for case-folding Unicode text which includes non-breaking spaces, symbols like ¥ or ™, or accented Latin characters (used in many European languages).

youkidearitai

Confirmed from https://www.unicode.org/Public/17.0.0/ucd/CaseFolding.txt . Looks good to me.

youkidearitai · 2026-01-10T03:35:13Z

However, 00B5; C; 03BC; # MICRO SIGN is special, but maybe is it seems possible to +0x20 after capital sign with accent mark.
What do you think?

00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE
00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE
00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE
00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6; C; 00E6; # LATIN CAPITAL LETTER AE
00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA
00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE
00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE
00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE
00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0; C; 00F0; # LATIN CAPITAL LETTER ETH
00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE
00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE
00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE
00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE
00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS
00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE
00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE
00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE
00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE
00DE; C; 00FE; # LATIN CAPITAL LETTER THORN

alexdowad · 2026-01-10T04:10:35Z

However, 00B5; C; 03BC; # MICRO SIGN is special, but maybe is it seems possible to +0x20 after capital sign with accent mark. What do you think?

That's an interesting idea. It does seem that there are more ranges which could possibly be handled without doing a hashtable lookup. The problem is that as we keep adding more conditional tests to select different "fast paths", we make the "slow path" slower, because it has to go through all those tests before finally falling back to the hashtable.

As with everything, if you are interested in adding another fast path, I think testing would be a good idea.

alexdowad · 2026-01-10T04:11:59Z

Thanks to @youkidearitai for review. I merged this tweak.

I have another optimization for case folding, which I hope to open another PR for soon (maybe tomorrow?)

ndossche · 2026-01-10T09:46:09Z

Pretty neat!

alexdowad requested a review from youkidearitai as a code owner January 10, 2026 01:04

github-actions bot added the Extension: mbstring label Jan 10, 2026

youkidearitai approved these changes Jan 10, 2026

View reviewed changes

alexdowad closed this Jan 10, 2026

alexdowad deleted the fold branch January 10, 2026 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use fast path in more cases when doing case folding with mb_convert_case #20889

Use fast path in more cases when doing case folding with mb_convert_case #20889

alexdowad commented Jan 10, 2026

Uh oh!

youkidearitai left a comment

Uh oh!

youkidearitai commented Jan 10, 2026

Uh oh!

alexdowad commented Jan 10, 2026

Uh oh!

alexdowad commented Jan 10, 2026

Uh oh!

ndossche commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use fast path in more cases when doing case folding with mb_convert_case #20889

Use fast path in more cases when doing case folding with mb_convert_case #20889

Conversation

alexdowad commented Jan 10, 2026

Uh oh!

youkidearitai left a comment

Choose a reason for hiding this comment

Uh oh!

youkidearitai commented Jan 10, 2026

Uh oh!

alexdowad commented Jan 10, 2026

Uh oh!

alexdowad commented Jan 10, 2026

Uh oh!

ndossche commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants