This repository was archived by the owner on Jul 22, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
Unicode word count #3
Copy link
Copy link
Open
Description
The code is using str_word_count so many times, the problem with this function is that when Unicode strings are provided it returns a excessively high number which will make the spam detector miss some spam situations.
The only substitute function that worked fine for me was this one:
/**
* Returns number of words in a unicode string
* @param $string
* @param int $mode
* @return array|int
*/
function utf8WordCount($string, $mode = 0) {
static $it = NULL;
if (is_null($it)) {
$it = IntlBreakIterator::createWordInstance(ini_get('intl.default_locale'));
}
$l = 0;
$it->setText($string);
$ret = $mode == 0 ? 0 : array();
if (IntlBreakIterator::DONE != ($u = $it->first())) {
do {
if (IntlBreakIterator::WORD_NONE != $it->getRuleStatus()) {
$mode == 0 ? ++$ret : $ret[] = substr($string, $l, $u - $l);
}
$l = $u;
} while (IntlBreakIterator::DONE != ($u = $it->next()));
}
return $ret;
}
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels