Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

lordofriver · 2026-01-05T10:19:07Z

Description

Fixes incorrect segment splitting when using max_len parameter with multi-byte UTF-8 characters (e.g., Chinese, Japanese, Arabic).

Problem

The current implementation in whisper_wrap_segment() uses strlen() to count bytes, not UTF-8 characters. When splitting segments at max_len, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as � (U+FFFD replacement character).

Example (Chinese text)

Before fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应�"},
{"Text": "�者面试结束之后,面试人立即整理记录,根据求"}

After fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应聘"},
{"Text": "者面试结束之后,面试人立即整理记录,根据求"}

In Addition

This does change the meaning of the "max_len" parameter.
And to be honest, I just find out the problem, code modification is Claude's recommendation and I tested it.

The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Uh oh!

lordofriver commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Are you sure you want to change the base?

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Uh oh!

Conversation

lordofriver commented Jan 5, 2026

Description

Problem

Example (Chinese text)

In Addition

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant