-
Notifications
You must be signed in to change notification settings - Fork 13.3k
common: Yet another add GLM-4.5 tool calling support #15904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Use |
Got a runtime error:
Looks like it is happening because of the "<10" characters in the generated text during a function call parsing. Probably it is trying to parse <10 as the beginning of an xml tag? |
@sbrnaderi From the log you provided, there isn’t anything unexpected. The JSON parse error occurs because I first try to parse arg_value as JSON; if that fails, it is parsed as a raw string. The failure log cannot be suppressed due to the design of llama.cpp. |
@hksdpc255 so, you are trying to parse the xml format from the GLM model to JSON, but I think what goes wrong here is that the "<10" part of the text is recognised as an xml tag. No?
|
@sbrnaderi Would you be able to share more logs or your prompt? The current log you shared doesn’t seem to have any problem, and additional details would help me figure out what’s going wrong. |
@sbrnaderi I guess your issue is fixed by latest commit. |
@hksdpc255 thanks, I will try your new commit. |
I'm running this PR with the supplied chat template and it is working 👍 |
Also checked this PR and everything works perfect with provided jinja template |
Parsing json in <arg_value> is pretty much broken on current branch. Original patch will crash while streaming response ends with This patch will fix the crash --- a/common/json-partial.cpp 2025-10-01 03:17:14.681184368 +0800
+++ b/common/json-partial.cpp 2025-10-01 03:15:35.623175731 +0800
@@ -183,7 +183,7 @@
} else if (can_parse(str + "\"" + closing)) {
// Was inside an object value string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"" + closing;
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
// Was inside an object value string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"" + closing;
} else {
@@ -202,7 +202,7 @@
} else if (can_parse(str + "\"" + closing)) {
// Was inside an array value string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"" + closing;
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
// Was inside an array value string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"" + closing;
} else if (!was_maybe_number() && can_parse(str + ", 1" + closing)) {
@@ -227,7 +227,7 @@
} else if (can_parse(str + "\": 1" + closing)) {
// Was inside an object key string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\": 1" + closing;
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\": 1" + closing)) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\": 1" + closing)) {
// Was inside an object key string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\": 1" + closing;
} else {
@@ -253,7 +253,7 @@
if (can_parse(str + "\"")) {
// Was inside an string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"";
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
// Was inside an string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"";
} else {
Besides, thanks for your PR. It's special because it works with complex json schema, with a quick hack like this: --- a/common/json-schema-to-grammar.cpp 2025-10-01 00:22:00.744098340 +0800
+++ b/common/json-schema-to-grammar.cpp 2025-10-01 00:19:48.692716944 +0800
@@ -944,6 +944,9 @@
return _add_rule(rule_name, out.str());
} else if (schema.empty() || schema_type == "object") {
return _add_rule(rule_name, _add_primitive("object", PRIMITIVE_RULES.at("object")));
+ } else if (schema_type.is_null() && schema.contains("not") && schema["not"].is_object() && schema["not"].empty()) {
+ // librechat returns not:{}, which does nothing.
+ return "";
} else {
if (!schema_type.is_string() || PRIMITIVE_RULES.find(schema_type.get<std::string>()) == PRIMITIVE_RULES.end()) {
_errors.push_back("Unrecognized schema: " + schema.dump()); LibreChat passed scrambled schema including {"not":{}}, this patch will ignore that. |
@DKingAlpha Thanks for pointing that out! It seems my compiler adds some extra padding to the string object, which ends up masking the string array underflow crash. |
@DKingAlpha I took a deeper look at your patch and had a question. It seems to modify some sections I hadn’t touched in the original code |
No I am using clang-20, if that helps to reproduce. Either this function(try_consume_json) is designed to run on non-empty string, which means you need to change your code, or its a bug in that part and never triggered before. I prefer the latter one. |
@DKingAlpha Would it still crash if you only patched the sections that my PR actually changed? @@ -253,7 +253,7 @@
if (can_parse(str + "\"")) {
// Was inside an string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"";
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
// Was inside an string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"";
} else { |
Line 253 is exactly the location that crashed on my side. But I patched all other I mean even without running into it, only by static manual reviewing, it shall be checked before access |
@DKingAlpha I believe llama.cpp/common/json-partial.cpp Line 140 in 4f15759
str is not empty. I’m considering changing my code from
if (!healing_marker.empty() && err_loc.stack.empty()) to if (err_loc.position != 0 && !healing_marker.empty() && err_loc.stack.empty()) . What do you think about this change? |
Sorry for the delay, was on vacation. int main(int argc, char ** argv) {
std::string str = "";
auto it = str.begin();
auto end = str.end();
json_error_locator err_loc;
auto start = it;
json::sax_parse(it, end, &err_loc);
printf("found error: %d\n", err_loc.found_error); // 1
printf("position: %zu\n", err_loc.position); // 0
printf("stack size: %zu\n", err_loc.stack.size()); // 0
return 0;
}
|
You know what, I guess we need to @ochafik Please take a look if that's possible. |
@hksdpc255, thanks for working on this. How can I test if tool calling is working correctly? I tried using
Here are my llama.cpp arguments:
Here are llama.cpp logs from that opencode attempt:
Any ideas what might be missing here? UPDATE: |
The chat template should only affect the request phase — llama.cpp uses the template to format your input prompt. If there’s an issue with the Jinja template, my guess is that there might be a missing In my patch, the tool parser only recognizes |
This PR and the chat template in the first message were the only solution I've been able to find that allowed Opencode/Crush to work with GLM-4.5-Air and llama.cp (although diff edits failed at an astonishing rate, other calls seem to work well enough). Trust me, I've tried everything and none of the chat templates I found worked as well as this one for both Crush and OpenCode. I had applied the patch to the llama.cpp source code back when I found the PR and it built fine and things seemed ok. Now that I've switched to GLM 4.6, it requires a newer llama.cpp, but this patch will no longer build with the server. Cline still works fine with unsloth's builtin chat template but Opencode can't do any tool calls at all. Any recommendations on how to get the patch working in the latest llama.cpp release or any advice on how to get Opencode working with GLM 4.6 tools calls? |
@aaronnewsome This PR should already merged the GLM 4.6 support patch from bartowski1182. Any further info for your GLM 4.6 tool call issue? |
uname:
nvidia-smi:
docker run command:
start-llama - llama-server command (in container):
opencode:
Notice the tool call xml appear in the model response inline in the chat, and it continues all the way to the end of the file it's supposed to be creating, closes with another </tool_call> but never actually writes anything. I've also tried the chat template at the top of this thread (which I did have some success with 4.5-air, after applying the patch) and the chat template from the z.ai huggingface page for GLM 4.6, neither allow tool calls to work with opencode or crush. opencode config: "provider": {
"hawk": {
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://hawk.swift.local:8080/v1",
"includeUsage": true,
"timeout": 1000
},
"models": {
"GLM-4.6-UD-IQ2_XXS": {
"name": "glm-4.6",
"tool_call": true,
"reasoning": true,
"cost": {
"input": 0.10,
"output": 1.20
},
"options": {
"num_ctx": 120000,
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40
}
}
}
}
} |
@aaronnewsome Is it work for you now? |
I rebuilt the container using Off topic, I tried claude code router (ccr) with claude code, and edits were pretty reliable and actually, most tool calls were pretty reliable, but holy wow is it slow. Now my main issue is that 4.6 IQ2_XXS is too big to run on the dual pro 6000 without flash attention enabled. enabling flash attention causes the entire system to reset when the gpus are cooking up responses. So I'll be going back to GLM 4.5 where the Q4 fits with 128K context, and flash attention off. I appreciate your work on creating this branch with much better tool calling than the main branch. As it works so much better for tool calling (at least for me), I wonder why these fixes haven't made it into the main project?? |
The open-source GLM model may not be well-tuned for the diff format used by Opencode. You could try the official Opencode API provided by z.ai. It might yield better results than a locally quantized deployment. If the official API performs well, there might also be something wrong with my patch. You can paste the failing diff edits from your log — that would help me figure out what’s going on.
I think this should be reviewed by a maintainer before being merged into master. Unfortunately, it seems that no maintainer has had time to review it yet. |
I don't have any detailed logs from the llama.cpp crash that opencode causes, but here's the tail of the container log on the latest crash today. this is with unsloths Q4 GLM-4.5-Air and the hksdpc255:master llama.cpp:
Not every failed diff edit causes the server to crash, but when the server does crash, this is the most common error I see, |
@aaronnewsome You can run |
opencode edits are failing because its edit tool does not return any content and then the template replaces that empty result with I replaced
with
and so far I didn't notice any issues. |
Here's what I see with -lv 1 before the crash:
|
@matbrez Thanks for the suggestion! I’ve applied your changes and they look good. |
@aaronnewsome Could you provide more logs so that they include three occurrences of |
This PR introduces an enhanced implementation of tool calling for GLM-4.5, building upon the existing contributions by @dhandhalyabhavik and @susmitds (see PR #15186).
Key improvements include:
Grammar-constrained tool-call outputs
The model’s tool-call messages are now rigorously enforced by a defined grammar, ensuring that generated calls are always well-formed and reliably parsed.
Streaming support for tool-call parsing
I have added streaming capabilities to the parser to handle tool-call messages as they’re generated. This enhancement enables more responsive and real-time interactions during inference.
Use this Jinja template while testing:
Although not yet implemented, I‘m planning the following improvements:
Patch jinja template in
common_chat_params_init_glm_4_5
to make it compatible with the original Unsloth GGUF chat template, and potentially even with the official chat template.Add dedicated unit tests for grammar enforcement and streaming parsing.
Testing and feedback are welcome.
Suggested commit message after squash commits: