Skip to content

Conversation

@lordofriver
Copy link

Description

Fixes incorrect segment splitting when using max_len parameter with multi-byte UTF-8 characters (e.g., Chinese, Japanese, Arabic).

Problem

The current implementation in whisper_wrap_segment() uses strlen() to count bytes, not UTF-8 characters. When splitting segments at max_len, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as (U+FFFD replacement character).

Example (Chinese text)

Before fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应�"},
{"Text": "�者面试结束之后,面试人立即整理记录,根据求"}

After fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应聘"},
{"Text": "者面试结束之后,面试人立即整理记录,根据求"}

In Addition

This does change the meaning of the "max_len" parameter.
And to be honest, I just find out the problem, code modification is Claude's recommendation and I tested it.

The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant