-
Notifications
You must be signed in to change notification settings - Fork 154
add jinja template support #677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can someone please check if this works? Especially people using function/tool calling. Thanks! |
I have a version of Gemma3 27B here which can see tools in mainline. Could be related to how mainline does tools, enabling using a custom template through GLM-air does not appear to be able to see tools with
and in the cli the message besides that the model appears to be working, not having any jinja flag appears to trigger the error Qwen3-30B-A3B-Instruct-2507 it does see the tools without |
"tools param is not fully supported yet" is expected because this pr only adds support for jinja template. Tool calls support from mainline has not been fully integrated here. |
Seing similar behaviour, using this gist https://gist.github.com/RodriMora/099913a7cea971d1bd09c623fc12c7bf I tested against this PR and the result with
Without
Mainline llama.cpp with
|
If you remove tool_choice, does it still report error? |
I had to remove booth the call of tools=tools and tool_choice="auto". But this was the response:
So |
hmmm... I can't even get it to load my model with the
This is the quant here - I removed all flags, included |
FYI model loads fine using the latest mainline branch that merged GLM-4.5 support. No |
After pull the latest commit with GLM4.5 support, GLM-4.5-Air-UD-Q2_K_XL.gguf works for me with --jinja flag. messages = [ tools = [ result_raw = asyncio.run(aclient.chat.completions.create( Response: |
I'm confused @firecoperana, how are you using the |
OK I rebased your branch locally and re-compiled. I successfully was able to pass the --jinja command and load the model. Tool calling works fine w/ the latest commits from main merged with this PR branch. As stated, please update your remote branch so others can easily test. |
@firecoperana I took the liberty of rebasing on the latest main branch for easier testing (including GLM4-MoE). |
Tested some agents with tool calling: -Cline works with tool calling and MCP's
|
it looks like it's having problems doing tool calls in json format. As far as I know the chat_template uses xml tags for tool calling. It's possible this is an opencode problem not liking tool calls using xml and expecting something like json natively.
I would look into opencode and see what tool calling format it expects and if it supports xml. I have seen this problem pop up on other apps... |
Text completion looks broken to me. Request URL [json.exception.out_of_range.403] key 'messages' not found ![]() It works fine on Commit 7117c23 which was just before jinja merge |
I believe you are sending the Example of qwen3-coder (has good support for tools, seems like glm and gpt-oss don't atm) response with mainline llama.cpp:
Response with ik_llama.cpp (pulled last commit in main branch):
The openai compliant tool call response should be something like this?:
|
just tested, can confirm. Chat completions is fine, but text completions is broken |
Can you give an example of how you do text completions? I never used it. Tool calls are not expected to work like mainline. |
curl -X POST http://localhost:8080/completion -H "Content-Type: application/json" -d "{"prompt": "Once upon a time", "n_predict": 128}" |
Fixed in #684 |
Thanks for digging into all this folks and @firecoperana I noticed another open mainline lcpp PR discussing GLM-4.5 Tool Calling here: ggml-org/llama.cpp#15186 Not sure if that effects anything here as I tend to stick to |
…n parsing - Add new oaicompat_completion_params_parse() for simple completions - Rename existing function to oaicompat_chat_completion_params_parse() - Update /completions endpoint to use simple parser (no chat templates) - Update /chat/completions endpoint to use chat parser (with tools support) - Fixes compatibility issue introduced in tool calling PR ikawrakow#677 Incorporates fixes from upstream PR ikawrakow#684
I made some changes to make tool calling compliant with openai's api, so it should be now compatible with more agents. If someone can test it it's here (it also has the text completion fix): I tested it with Qwen3-Coder-30B-A3B-Instruct (note: the og model uploaded 10 days ago had the chat template wrong for tool calling, and the update was pushed 3 days ago). So I'm using my own quants to test as most HF quants are outdated i think Testing with both:
With my fork I have the same results as with mainline llama.cpp I posted before and works fine with Opencode, and still works with Cline and Roocode |
I am using Qwen3-235 and the opening tags are no longer returned, which breaks the jinja template when trying to do function calls 'Value is not callable: null at row 39, column 78:\n {%- if '' in content %}\n {%- set reasoning_content = ((content.split('')|first).rstrip('\n').split('')|last).lstrip('\n') %} Anyone else? |
"no longer returned" seems to imply that you had this working before? I don't use tool/function call stuff, but which llama-server API endpoint are you using with your client e.g. the fwiw this PR was merged about 8 hours ago which effects some code possibly in the path for tool/function calling and JSON stuff: #684 So you could try rolling back one commit e.g. |
Using chat/completions. I should specify that this affects non-function calling queries too. Before the jinja template update, the qwen3 model would properly send out the tag ( I checked out the commit you mentioned and it behaves the same as current HEAD). I deleted my previous ik_llama folder but my last pull must have been last week and it was behaving as expected then. Using mainline llama.cpp correctly outputs the token. I will try to take a look at how llama.cpp handles this tonight. This definitely feels like a consequence of using the jinja template too strictly. Current ik_llama output to "What is the capital of France?": INFO [format_partial_response_oaicompat] DEBUG: Streaming finish_reason check | tid="139886531170304" timestamp=1754925900 generated_text="Okay, the user asked, "What is the capital of France?" Hmm, this seems like a very basic geography question. Maybe they're a student doing homework, or someone testing if I know simple facts. \n\nBut wait—why would anyone ask this in 2024? It's one of the most well-known capitals globally. Could it be a trick? Like, maybe they're checking if I'll overcomplicate it? Or perhaps they're very young or new to learning geography. \n\nI should just answer directly: Paris. No need for fluff. But since they might be learning, I'll add one extra fact—like the Seine River—to make it slightly educational without overwhelming them. \n\n...Though part of me wonders if this is a bot testing response accuracy. Either way, short and correct is safest. \n\nTypes "Paris" then pauses \nShould I say "definitely Paris" to sound confident? Nah, that's overkill. Just clean and factual.\n\n\nThe capital of France is Paris. \n\nIt has been the political, cultural, and economic center of France for centuries and is renowned worldwide for landmarks like the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. 🇫🇷 \n\nFun fact: Paris is built along the Seine River and is often called "La Ville Lumière" (The City of Light) due to its early adoption of street lighting and its role as a center of education and ideas during the Age of Enlightenment." model_name="qwen3" tool_calls_count=0 |
Do you use jinja flag when the openning tag is not returned? What do you send to chat completions? |
This is my full command. Happens with or without --jinja ./build/bin/llama-server --model /mnt/home_extend/models/unsloth_Qwen3-235B-A22B-Thinking-2507-GGUF/Q6_K/Qwen3-235B-A22B-Thinking-2507-Q6_K-00001-of-00004.gguf --temp 0.6 --top-k 20 --top-p 0.8 --min-p 0 --repeat-penalty 1.0 --presence-penalty 2.0 -fmoe -fa --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 99 --alias qwen3 --host 0.0.0.0 --port 8099 --no-mmap --ctx-size 120000 --override-tensor "blk.(?:[1-9]?[01235789]).ffn_.*_exps.weight=CPU" --jinja -ub 4096 -amb 4096 |
Do you have this issue when using the built-in webui? |
Hello, Reporting that GLM 4.5 Air's template didn't work only by using --jinja flag on claude code (using claude code router). I managed to fix it using the template below:
Found this working template in ggml-org/llama.cpp#15186 My final running command:
Funny story: I have a RTX 3090 and a RTX 2060 with 32GB DDR5 on a i7 14700F but I manage to make glm 4.5 air runnable on my setup using this draft model I mentioned above:
< Without Draft model >
Hope this helps. |
Thanks for the link to a working jinja chat template for Air, I opened a discussion on the huggingface repo here: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/discussions/6 with your notes and some comments and feedback for your command. Cheers! |
Test with the example from ggml-org/llama.cpp#11016
It looks ok. Not sure whether it will have conflict with existing function call features.