Skip to content
This repository was archived by the owner on Jul 22, 2024. It is now read-only.

Unicode word count #3

@creativefctr

Description

@creativefctr

The code is using str_word_count so many times, the problem with this function is that when Unicode strings are provided it returns a excessively high number which will make the spam detector miss some spam situations.
The only substitute function that worked fine for me was this one:

/**
     * Returns number of words in a unicode string
     * @param $string
     * @param int $mode
     * @return array|int
     */
    function utf8WordCount($string, $mode = 0) {
        static $it = NULL;

        if (is_null($it)) {
            $it = IntlBreakIterator::createWordInstance(ini_get('intl.default_locale'));
        }

        $l = 0;
        $it->setText($string);
        $ret = $mode == 0 ? 0 : array();
        if (IntlBreakIterator::DONE != ($u = $it->first())) {
            do {
                if (IntlBreakIterator::WORD_NONE != $it->getRuleStatus()) {
                    $mode == 0 ? ++$ret : $ret[] = substr($string, $l, $u - $l);
                }
                $l = $u;
            } while (IntlBreakIterator::DONE != ($u = $it->next()));
        }

        return $ret;
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions