Skip to content

Search issue - multi-byte words truncated #6405

@sbulen

Description

@sbulen

Description

If there are multi-byte terms in a post, they are translated to html entities, which can take up to 8 or 9 bytes per character. The problem is that search words are truncated at 20 characters. Thus, any string of multibyte characters can result in truncation - mid html-entity, e.g.:
image

Note the truncated html entities, e.g., &# and &#66. This is in the log_search_subjects table.

This causes further issues down the road, e.g., executing an html entity to utf8 conversion, you can get:
image
...as that word really isn't unique once it has been truncated.

This issue exists in 2.0 as well.

In 2.1, this issue is restricted to 4-byte character usage, as anything <4-bytes is no longer converted to html entities - though they may be brought forward during an upgrade.

Steps to reproduce

  1. Post this: 𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸.
  2. Look at log_search_subjects for that post

Environment (complete as necessary)

  • Version/Git revision: current
  • Database Type: mysql
  • Database Version: 5.7
  • PHP Version: 7.4

Additional information/references

4-byte characters are not common outside the use of emojis, certain symbols, and ancient texts....
But the SMF crowd is exactly the kinda crowd to use emojis, certain symbols, and ancient texts...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions