bug-fix: handle broken UTF-8 sequences in common_chat_parse() #14937

kallewoof · 2025-07-29T07:44:13Z

When the model hits the token limit when generating multibyte UTF-8 content, the server crashes due to an assert failure nlohmann::json_abi_v3_12_0::detail::type_error:

terminate called after throwing an instance of 'nlohmann::json_abi_v3_12_0::detail::type_error'
  what():  [json.exception.type_error.316] incomplete UTF-8 string; last byte: 0x99

Thread 1 "llama-server" received signal SIGABRT, Aborted.

The trace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736686145536) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140736686145536) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140736686145536, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fffef242476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fffef2287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fffef6a2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffef6ae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fffef6ae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fffef6ae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x0000555555641997 in nlohmann::json_abi_v3_12_0::detail::serializer<nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> >, void> >::dump_escaped (this=0x7fffffff8dd0, 
    s="ぶらぶらぶら", <incomplete sequence \351\231>, ensure_ascii=false)
    at /home/me/workspace/llama.cpp/common/../vendor/nlohmann/json.hpp:19326
#10 0x000055555561f0f9 in nlohmann::json_abi_v3_12_0::detail::serializer<nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> >, void> >::dump (this=0x7fffffff8dd0, val=..., pretty_print=false, ensure_ascii=false, indent_step=0, current_indent=0)
    at /home/me/workspace/llama.cpp/common/../vendor/nlohmann/json.hpp:18971
#11 0x000055555561ec3e in nlohmann::json_abi_v3_12_0::detail::serializer<nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> >, void> >::dump (this=0x7fffffff8dd0, val=..., pretty_print=false, ensure_ascii=false, indent_step=0, current_indent=0)
    at /home/me/workspace/llama.cpp/common/../vendor/nlohmann/json.hpp:18901
#12 0x00005555555fc8af in nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> >, void>::dump (this=0x55556ea26240, indent=-1, indent_char=32 ' ', ensure_ascii=false, error_handler=nlohmann::json_abi_v3_12_0::detail::error_handler_t::strict)
    at /home/me/workspace/llama.cpp/common/../vendor/nlohmann/json.hpp:21327
#13 0x0000555555759656 in common_chat_parse (input="ぶらぶらぶら", <incomplete sequence \351\231>, is_partial=true,

Specifically the crash happens on this line:

LOG_DBG("Parsed message: %s\n", common_chat_msgs_to_json_oaicompat<json>({msg}).at(0).dump().c_str());

(i.e. it is possible that this is a debug-only crash).

~~This code adds a utf8 truncator helper and truncates all unfinished sequences.~~

Note: this breaks non-utf8 encoded strings. If llama.cpp allows e.g. utf-16 encoded strings as well, and there's no way to distinguish between these, then this approach needs to be tweaked to e.g. check if the string is utf-8 before performing truncation.

Code now simply checks if is_partial is set, and skips the debug log if it is. This avoids the hassle of trying to determine if the string (which may or may not be a UTF-8 encoded string) is or isn't truncated mid-sequence.

kallewoof · 2025-07-29T07:47:00Z

Sorry, accidentally included some unrelated python code hence the github-actions python tag there.

CISC · 2025-07-29T08:33:06Z

Can you check if this is a problem if you disable that logging and letting msg propagate? If not, maybe only log when the msg is not partial.

kallewoof · 2025-07-29T10:38:51Z

Can you check if this is a problem if you disable that logging and letting msg propagate? If not, maybe only log when the msg is not partial.

There is no crash. Instead the partial/invalid UTF-8 sequence is forwarded to the caller. E.g. "ぶらぶら�".
I guess we could skip the log and return the invalid UTF-8 as is when we think it's broken. For non-UTF8 the worst thing is a lack of debug log messages.

…8 sequence at the end

kallewoof · 2025-07-29T10:41:20Z

Pushed a commit which switches to simply dropping the log message and always leaving msg alone. This removes the risk of non-UTF-8 corruption.

CISC · 2025-07-29T12:20:36Z

I think you can remove truncate_incomplete_utf8 and just check is_partial.

kallewoof · 2025-07-29T13:01:56Z

GGML_ASSERT(is_partial == (msg.content != truncate_incomplete_utf8(msg.content)));

asserts for me, so the two are not entirely equivalent it seems?

Looking at uses of is_partial also do not seem related to unfinished UTF-8 sequences:

        auto new_msg = common_chat_parse(
            generated_text,
            /* is_partial= */ stop != STOP_TYPE_EOS,
            params.oaicompat_chat_syntax);

CISC · 2025-07-29T13:08:34Z

the two are not entirely equivalent it seems?

Correct, but I think (I may be wrong) that a truncated sequence will always result in is_partial being true.

CISC · 2025-07-29T13:11:59Z

Anyway, what I'm saying is that the worst-case scenario is disabling the log, it's not that important, and I'd rather not have it than doing special handling for it.

…due to unfinished UTF-8 sequences

kallewoof · 2025-07-29T13:22:18Z

Good point. Adding that entire function just to see if we want to debug-log is way overkill.

github-actions bot added the python python script changes label Jul 29, 2025

bug-fix: handle broken UTF-8 sequences in common_chat_parse()

4a293ba

kallewoof force-pushed the 202507-broken-utf8 branch from 4c678da to 4a293ba Compare July 29, 2025 07:46

common/chat.cpp: skip debug log if there is a risk of unfinished UTF-…

ff97352

…8 sequence at the end

bug-fix: don't attempt to log partial parsed messages to avoid crash …

066541f

…due to unfinished UTF-8 sequences

CISC approved these changes Jul 29, 2025

View reviewed changes

CISC merged commit 1a67fcc into ggml-org:master Jul 29, 2025
47 checks passed

kallewoof deleted the 202507-broken-utf8 branch July 30, 2025 11:51

codingl2k1 mentioned this pull request Aug 22, 2025

[json.exception.type_error.316] invalid UTF-8 byte at index 359: 0xE5 xorbitsai/inference#3986

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug-fix: handle broken UTF-8 sequences in common_chat_parse() #14937

bug-fix: handle broken UTF-8 sequences in common_chat_parse() #14937

Uh oh!

kallewoof commented Jul 29, 2025 •

edited

Loading

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025 •

edited

Loading

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bug-fix: handle broken UTF-8 sequences in common_chat_parse() #14937

bug-fix: handle broken UTF-8 sequences in common_chat_parse() #14937

Uh oh!

Conversation

kallewoof commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

kallewoof commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kallewoof commented Jul 29, 2025 •

edited

Loading

kallewoof commented Jul 29, 2025 •

edited

Loading