Skip to content

[2.1]: Long multi-byte words dropped in log_search_words #8312

@sbulen

Description

@sbulen

Basic Information

The problem here is hard to see: long words with multi-byte characters don't make it into log_search_words, they are dropped.

Lots of subtleties here, but the core issue is a non-mb-safe substring is taken.

The sequence of events:

  • Given a long multi-byte word in a new topic subject, like this: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
  • Gets passed thru text2words, Subs.php, line 5354
  • From there, it's passed to truncate, Load.php, line 225
  • In line 225, a (non-mb) substr is taken, resulting in a corrupt, invalid utf8 final char: 三藩市道德委�
  • The now-invalid utf8 string is passed to $smcFunc['strlen'], Load.php, line 182
  • The preg_match on line 182 fails, passing null to php's strlen(), also on line 182
  • strlen issues a warning "Passing null to parameter # 1 ($string) of type string is deprecated", which is suppressed
  • The word is never stored.

Note, if a text2words is called during a background task, an error is logged:
Cron error: 8192: strlen(): Passing null to parameter # 1 ($string) of type string is deprecated (load.php, line 182)

This error is suppressed in the app, as deprecation errors are still suppressed in index.php. But not in cron.php.

Similar (but different) report: #6405

Bigger issue? The above term isn't actually a word, it's a sentence...

This issue exists both in 2.1 & 3.0. Even when cutting over to UTF8MB4 in 3.0, it may still exist, depending on whether/how the smf truncate function is rewritten.

Steps to reproduce

  1. Create a new post with this in the subject: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
  2. Post it

Expected result

A word in log_search_words

Actual result

No words in log_search_words

Version/Git revision

3.0 alpha 2 & 2.1.4

Database Engine

All

Database Version

8.4

PHP Version

8.3.8

Logs

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Charset/EncodingUTF8 & mb4 encoding related issues

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions