-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[RFC] Add a locale for grapheme case-insensitive functions #18792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I have a probably stupid question but I try to understand the use-case: Do you have an example where the locale is relevant to determine uppercase or lowercase characters? From what I understand for example the lower-case letter So what do we need a locale for if we want to find Or is the idea to also find If the later, then IMO that is faaaaar more than "just" adding a locale to grapheme_functions. |
@heiglandreas
Yes, right.
Yes, later it is. I think below:
Thanks for give to me to example. From https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#full-language-specific-case-mapping , Turkish |
It's midnight. I'll fix this morning. thanks. |
I find this highly problematic as it completely changes the way the This is not to say that it isn't a valuable addition! But I think this should be a separate (set of) functions. So far the This is what most people expect when they pass a string to an This change here though not only compares without considering the case, it also does some character replacements based on the locale. So it is basically doing the following under the hood: $string = 'Aarhus';
$comparison = 'år';
$normalizedString = Transliterator::create('dk-lower')->transliterate($string);
$normalizedComparison = Transliterator::create('dk-lower')->transliterate($comparison);
grapheme_strpos($normalizedString, $normalizedComparison); Where all I would expect is grapheme_strpos(
mb_strtolower('Aarhus'),
mb_strtolower('år')
); To give you an example why that might become really messy: In german we have two distinct different words: When comparing them with a german locale they would not be identical. Fine. But when comparing them with a for example US-locale they would suddenly be the same. Why? Because in languages that do not have the By nonchalantly doing more than a case-insensitive comparison suddenly Or - for the Octoberfest fans: A This is a hard no from my side! |
@heiglandreas I saw your example code, so I understand this PR is many problem.
I'm going to trying to ICU depends, However, this means many problems. Your information is valuable. This RFC moved back to "Under Discussion". |
Dear @heiglandreas (or anyone) Are you satisfied with the current behavior of I think need to locale from your comment. Because changes to behavior of regions. |
I do see the issue with especially the turkish alphabet where the Capital letter "I" (U+0049) is associated with the lower case letter "ı" (U+0131) and the Capital letter "İ" (U+0131) is associated with the lower case letter "i" (U+0069) - so essentially matching the "wrong" letters to one another. TBH I am not sure this is something that can be handled by a standard locale as "en-EN" as that is region-specific and does - in it's general form - not say anything about the used alphabet. So for example "sr-RS" (Serbian as spoken in Serbia) does not say whether to use the cyrilic or the latin alphabet. One would need to be explicit and use "sr-Latn-RS" to also include the alphabet. Similarily would you need to explicitly specify whether to use the Latin, the Arabic or the Cyrilic alphabet in Azerbaijani like via "az-Latn-AZ" or "az-Cyr-AZ". The best option IMO with the current tools would be something like this: $t = Transliterator::create('tr-Lower');
var_dump(grapheme_strpos(
$t->transliterate('İ'),
$t->transliterate('i')
)); But this uses an ID from the ICUs Transliterator which has only very little to do with a locale. So from my point of view a new set of functions (names something like So in this case a Sidenote: mb_strtolower returns grapheme_stripos('İ', 'i̇'); will output |
@heiglandreas Just a moment, please. |
ccc1392
to
10b43c2
Compare
I add the strength for grapheme functions. Also affect to Ideographic characters. So that means affect to CJK characters(Japanese says Kanji(漢字)). |
This RFC is accepted. So ready for review. |
acfcfde
to
7602b2e
Compare
@php/release-managers-85 Is this need your approvals? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the ping. RM wise, no objections. Leaving the implemention review to others.
Feels worth it to add a single line into NEWS as well before merging.
7602b2e
to
9d79710
Compare
No objections from me I may have a better look tomorrow. |
@youkidearitai please note that my review is not technical, just RM process. So getting a technical approval is also required :) |
ext/intl/grapheme/grapheme_string.c
Outdated
@@ -200,17 +202,19 @@ PHP_FUNCTION(grapheme_stripos) | |||
PHP_FUNCTION(grapheme_strrpos) | |||
{ | |||
char *haystack, *needle; | |||
size_t haystack_len, needle_len; | |||
char *locale = ""; | |||
size_t haystack_len, needle_len, locale_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
little inconsistency here, maybe locale_len = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added.
intl | ||
--FILE-- | ||
<?php | ||
var_dump(grapheme_stripos("abc", "abc", 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add more tests with invalid locales please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine by me. let s wait another approval before merging though.
Add
$locale
parameter for grapheme case insensitive functions.RFC: https://wiki.php.net/rfc/grapheme_add_locale_for_case_insensitive