-
Notifications
You must be signed in to change notification settings - Fork 267
Description
Basic Information
The problem here is hard to see: long words with multi-byte characters don't make it into log_search_words, they are dropped.
Lots of subtleties here, but the core issue is a non-mb-safe substring is taken.
The sequence of events:
- Given a long multi-byte word in a new topic subject, like this: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
- Gets passed thru text2words, Subs.php, line 5354
- From there, it's passed to truncate, Load.php, line 225
- In line 225, a (non-mb) substr is taken, resulting in a corrupt, invalid utf8 final char: 三藩市道德委�
- The now-invalid utf8 string is passed to $smcFunc['strlen'], Load.php, line 182
- The preg_match on line 182 fails, passing null to php's strlen(), also on line 182
- strlen issues a warning "Passing null to parameter # 1 ($string) of type string is deprecated", which is suppressed
- The word is never stored.
Note, if a text2words is called during a background task, an error is logged:
Cron error: 8192: strlen(): Passing null to parameter # 1 ($string) of type string is deprecated (load.php, line 182)
This error is suppressed in the app, as deprecation errors are still suppressed in index.php. But not in cron.php.
Similar (but different) report: #6405
Bigger issue? The above term isn't actually a word, it's a sentence...
This issue exists both in 2.1 & 3.0. Even when cutting over to UTF8MB4 in 3.0, it may still exist, depending on whether/how the smf truncate function is rewritten.
Steps to reproduce
- Create a new post with this in the subject: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
- Post it
Expected result
A word in log_search_words
Actual result
No words in log_search_words
Version/Git revision
3.0 alpha 2 & 2.1.4
Database Engine
All
Database Version
8.4
PHP Version
8.3.8
Logs
No response
Additional Information
No response