Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes incorrect segment splitting when using
max_lenparameter with multi-byte UTF-8 characters (e.g., Chinese, Japanese, Arabic).Problem
The current implementation in
whisper_wrap_segment()usesstrlen()to count bytes, not UTF-8 characters. When splitting segments atmax_len, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as�(U+FFFD replacement character).Example (Chinese text)
Before fix:
{"Text": "这个时候面试官会给应聘者一定的时间,由应�"}, {"Text": "�者面试结束之后,面试人立即整理记录,根据求"}After fix:
In Addition
This does change the meaning of the "max_len" parameter.
And to be honest, I just find out the problem, code modification is Claude's recommendation and I tested it.