Function calling support for Kimi-K2 #628

iSevenDays · 2025-07-18T08:10:43Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

The implementation adds support for tool calls.

The reason why I think the feature is important is that it allows users of ik_llama.cpp to use this backend with apps like Claude Code that requires tool calls.

By using simple proxy like this one https://github.com/1rgs/claude-code-proxy (I just found it in github), I could connect Claude Code to ik_llama.cpp using Kimi-K2 Q2 LLM provided by ubergarm.
In claude-code-proxy you just have to change .env OPENAI_API_BASE="http://192.168.0.24:8080/v1"

I had to port llama.cpp function tool calls support. The most difficult part was to port streaming and json healing.

ikawrakow

Thank you for this! People have been asking for function calling support, but that is not something I'm very familiar with.

LGTM, but I would appreciate at least one other person testing.

I see your location is Leipzig. Have fond memories of this place, having spent 11 years there studying physics, doing a PhD, and staying for my first postdoc position.

iSevenDays · 2025-07-18T10:43:28Z

LGTM, but I would appreciate at least one other person testing.

Thanks! I've done the basic tests, but the model loads too slow from my hdd, so I will test different use cases over the weekend.
I could make it work for the first request, but it seems that multiple requests don't work currently or Kimi-K2 requires a different prompting. I'll debug this more over the weekend and update the PR.

I see your location is Leipzig. Have fond memories of this place, having spent 11 years there studying physics, doing a PhD, and staying for my first postdoc position.

I live in a beautiful city, thanks! I've been living here for 3 years and have absolutely no regrets!

ubergarm · 2025-07-18T16:38:14Z

I could make it work for the first request, but it seems that multiple requests don't work currently or Kimi-K2 requires a different prompting. I'll debug this more over the weekend and update the PR.>

Oh hej this is exciting! I believe we have a PR open for this #407 (comment) where some folks were trying to use a reverse proxy / wrapper to handle it similar to claude-code-proxy perhaps.

I don't use tool calling myself, but did notice when adding Kimi-K2-Instruct PR that I left out one section for the chat endpoint for the "role": "tool": ggml-org/llama.cpp#14654 (comment)

So if it expects llama-server to handle the template internally that "role": "tool" might not be applied. But if you're using the text completions endpoint and doing your own template it might not matter.

sousekd · 2025-07-18T23:10:28Z

@iSevenDays This seems relevant:

We've just fixed 2 bugs in Kimi-K2-Instruct huggingface repo. Please update the following files to apply the fix:

tokenizer_config.json: update chat-template so that it works for multi-turn tool calls.

tokenization_kimi.py: update encode method to enable encoding special tokens.

https://x.com/Kimi_Moonshot/status/1945050874067476962

mtcl · 2025-07-19T16:30:45Z

This is very exciting! I would much rather use a native function calling!

iSevenDays · 2025-07-19T17:10:18Z

I took a look at how llama.cpp implements tool calling support and the task is much more complicated than I thought. Especially, the streaming part.
I'll keep you updated.

mtcl · 2025-07-19T17:42:16Z

I took a look at how llama.cpp implements tool calling support and the task is much more complicated than I thought. Especially, the streaming part.
I'll keep you updated.

That would be really amazing! ik_llama + tool calling will be a dream come true for me!

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components

- Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation

- Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser

- Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse

iSevenDays · 2025-07-22T16:16:11Z

I had to port llama.cpp function tool calls support.

Here is branch of Claude Proxy that you can use with ik_llama.cpp and Claude code.

Steps to test this PR

Clone https://github.com/iSevenDays/claude-code-proxy
Run the proxy

uv run uvicorn server:app --host 0.0.0.0 --port 8082

Open .env inside claude proxy

OPENAI_API_BASE="http://192.168.0.24:8080/v1"
PREFERRED_PROVIDER="openai"
BIG_MODEL="Kimi-K2"
SMALL_MODEL="Kimi-K2"

The model name is important, so set it to kimi-k2 to enable tool parsing from ik_llama.cpp
Test with Claude Code

ANTHROPIC_BASE_URL=http://localhost:8082 claude "list files"

I'm doing more tests in the meantime.

- Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils

iSevenDays · 2025-07-23T09:00:50Z

I added Qwen3 tool calling support.
From my tests, Kimi-K2 uses tools better and Qwen3 fails to use tools for Claude Code.

iSevenDays · 2025-07-23T09:06:45Z

@ikawrakow I have backported tool calling support. I'm not sure if I can make PR smaller, because the feature in llama.cpp is quite complicated.
I'd be glad if somebody can also do real world tests.

I suggest using Kimi-K2 model with Claude Code using these steps #628 (comment)

It seems to work fine, at least it can call tools when I explicitly ask for it.

ikawrakow · 2025-07-23T09:13:50Z

I think there was a lot of interest for this, so hopefully we will have a few people testing the PR. Hopefully today, so I can merge before going on vacation tomorrow.

iSevenDays · 2025-07-23T09:17:20Z

@ikawrakow I'll be happy to work on your requests for this PR to get it merged.
I followed the strategy of porting llama.cpp as close as possible.

xldistance · 2025-07-23T09:27:45Z

Looking forward to qwen3's tool call

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support

iSevenDays · 2025-07-23T10:10:58Z

I have added DeepSeek-R1 tool calling support.
The following LLM works just fine. It takes often 2 iterations to do the tool call, but Claude Code handles that automatically.

numactl --interleave=all ./build/bin/llama-server \
                         --alias DeepSeek-R1T2 \
                         --model /root/models/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ3_KS/IQ3_KS/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf \
                         -rtr \
                         --ctx-size 102400 \
                         -ctk q8_0 \
                         -mla 3 -fa \
                         -amb 512 \
                         -fmoe \
                         --temp 0.6 \
                         --top_p 0.95 \
                         --n-gpu-layers 63 \
                         --override-tensor "blk\.([0-5])\.ffn_.*=CUDA0,exps=CPU" \
                         --parallel 1 \
                         --threads 16 \
                         --host 0.0.0.0 \
                         --port 8080 \
                         --min_p 0.01 \
                         --numa distribute \
                         --threads-batch 32 \
                         --no-mmap \
                         -b 8192 -ub 8192

xldistance · 2025-07-23T10:43:12Z

@iSevenDays json-partial.h
json-partial.cpp
regex-partial.h
regex-partial.cpp Missing documents

- json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality

iSevenDays · 2025-07-23T11:12:28Z

@xldistance thanks for the feedback, the files are there and can be compiled successfully.

For those who are testing with Claude Code, here are my suggestions:
Kimi-K2 works the best, and is very fast, uses tools.
DeepSeek-TNG-R1T2-Chimera works, but too often it times out on my Dell R740 48GB 4090D.
Qwen3-235B-A22B-Instruct-2507-GGUF (pure-IQ4_KS from ubergarm) doesn't want to use tools

xldistance · 2025-07-23T11:14:21Z

@iSevenDays I use qwen3-coder-480b on top of ccr code

iSevenDays · 2025-07-23T11:18:50Z

@xldistance just make sure to set the correct name of LLM in env and in llama-server.
I enabled name matching e.g. the following triggers additional tool calling in system prompt to let the model know how to use tools properly. I ported the behavior from llama.cpp. Llama.cpp uses more complex system btw.
The following names would work:
Qwen3-235b
DeepSeek-R1
Kimi-K2
Kimi_K2

I'll check qwen3-coder-480b that was recently uploaded https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ2_KS

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration.

Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array.

Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency.

ikawrakow · 2025-07-23T16:11:36Z

Well, I'll just merge it then.

iSevenDays · 2025-07-24T12:14:41Z

@xldistance I found one issue with function tool calls when using LLM with Claude Code.
Please check this PR #643 to have the latest updates. Now I can use Qwen3 with Claude Code Proxy as well.

randoentity · 2025-07-24T14:54:27Z

FWIW: tested and working with local qwen. Haven't run into the issue above yet. I'm not using the proxy/router from above though. Is there any way to make this work with jinja templates and not having the model name hardcoded?

mtcl · 2025-07-24T16:09:14Z

FWIW: tested and working with local qwen. Haven't run into the issue above yet. I'm not using the proxy/router from above though. Is there any way to make this work with jinja templates and not having the model name hardcoded?

What's the exact command that you used to start the server? Can you please share?

randoentity · 2025-07-24T21:15:54Z

@mtcl There's nothing special to it, look at isevendays' example above, just use --alias Qwen3-235b instead (but just qwen should be sufficient). Also check out the documentation added in this PR as it has an example of what the request should look like. Note that the model name is significant.

city96 · 2025-07-26T12:42:17Z

I did an update today and noticed token streaming wasn't working on latest master. I've tracked it down to this PR, with the commit right before it working.

When token streaming is disabled, the reply is generated as usual and appears once generation finishes. When I enable token streaming, the generation still finishes in the background, but I never get any output. I was testing with an old version of sillytavern, but it seems reproducible in mikupad which is probably easier to reproduce.

I get the same issue on Kimi, Deepseek V3, and even just random models like gemma:

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server -m /mnt/models/llm/gemma-3-27b-it-q6_k.gguf -c 16384 -ngl 99

iSevenDays · 2025-07-26T12:45:24Z

@city96 could you please check this PR #652 and could you please provide a minimum reproducible example? At best, using some small LLM. Then I could check and verify it quickly.

I'm currently testing the PR above and I use both streaming and non-streaming mode with Kimi-K2 model and I didn't notice any issues, but I would gladly help you resolve the issue if there was a regression.

city96 · 2025-07-26T13:14:50Z

I tested your linked PR, but still saw the same problem. I think I found the issue, though. It's this change that this PR makes:

On latest master that line is here. Changing it back fixes streaming.

ik_llama.cpp/examples/server/server.cpp

Line 1621 in 4e9c78c

{"content", ""}, // Empty - clean content provided via diffs

Not sure what the logic is in mainline llama.cpp for streaming, but I am using text completion instead of the chat completion endpoint. I assume this is likely why it wasn't caught, since most people probably use the openai compatible one.

For a reproducible example, you can start the ik_llama.cpp server example using any model (I used gemma 27B for testing, but any model should work). Connect to it via mikupad and enter a simple query, enable token streaming, then hit "predict" at the bottom. I can try and make a pure python example as well if it helps.

iSevenDays · 2025-07-26T14:45:12Z

@city96 could you please test the change in this PR #654 ?
I think you have correctly identified the issue, but I'll be able to test that change only later today.

city96 · 2025-07-26T18:31:30Z

@iSevenDays I can confirm that the newest PR does indeed fix token streaming on the text completion endpoint for me, thank you.

imweijh · 2025-07-28T09:13:04Z

I had to port llama.cpp function tool calls support.

Here is branch of Claude Proxy that you can use with ik_llama.cpp and Claude code.

Steps to test this PR

Clone https://github.com/iSevenDays/claude-code-proxy

Run the proxy
uv run uvicorn server:app --host 0.0.0.0 --port 8082
Open .env inside claude proxy
OPENAI_API_BASE="http://192.168.0.24:8080/v1"
PREFERRED_PROVIDER="openai"
BIG_MODEL="Kimi-K2"
SMALL_MODEL="Kimi-K2"
The model name is important, so set it to kimi-k2 to enable tool parsing from ik_llama.cpp

Test with Claude Code
ANTHROPIC_BASE_URL=http://localhost:8082 claude "list files"
I'm doing more tests in the meantime.

ANTHROPIC_AUTH_TOKEN="sk-localkey" ANTHROPIC_BASE_URL=http://localhost:8082 claude "list files"
this work for me

sayap · 2025-07-29T02:21:51Z

Hi, can you explain a bit what's the purpose of this:

        // Clean up extra whitespace
        content = std::regex_replace(content, std::regex(R"(\n\s*\n)"), "\n");
        
        // Trim leading/trailing whitespace
        content.erase(0, content.find_first_not_of(" \t\n\r"));
        content.erase(content.find_last_not_of(" \t\n\r") + 1);

It breaks the non-toolcall streaming output, e.g. with a simple prompt like this:

Write a python funcion that converts from Celcius to Fahrenheit. Write another python function that does the opposite. Return only a markdown codeblock with the 2 python functions. No intro, no explanation, no example. Follow PEP 8 style guide, i.e. with 2 empty lines between functions.

Due to the "Clean up extra whitespace" logic, there will be no empty lines between the 2 functions in the streaming output.

iSevenDays · 2025-07-29T07:32:26Z

@sayap thank you for the bug report for Qwen3 model!
There is an issue with Qwen3 when using tool calls.

Could you please check this PR #661 ? I've added additional tests to make sure we follow original llama.cpp logic for tool parsing and that it will not come up again as regression.

Note to other people: only qwen3 tool call parsing logic was affected.

iSevenDays added 2 commits July 17, 2025 09:16

Implement function calling / tools for ik_llama.cpp for Kimi K2

fb7d01f

Implement basic tool choice

7f54f55

ikawrakow approved these changes Jul 18, 2025

View reviewed changes

iSevenDays changed the title ~~Function calling support for Kimi-K2~~ [Draft] Function calling support for Kimi-K2 Jul 18, 2025

iSevenDays added 5 commits July 20, 2025 21:28

Backport llama.cpp tool calls support

e9e7fe6

Fix duplicate common_chat_parse definition

d230096

- Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse

iSevenDays added 3 commits July 22, 2025 16:50

Fix JSON assertion failure in function call parsing

3eff579

- Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures

Merge branch 'ikawrakow:main' into function_calling

cd0392f

iSevenDays changed the title ~~[Draft] Function calling support for Kimi-K2~~ Function calling support for Kimi-K2 Jul 23, 2025

Add partial parsing support for JSON and regex

0272064

- json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality

iSevenDays added 3 commits July 23, 2025 12:30

Fix Qwen3 tool call parsing - pass model name to parser

ff6be37

Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array.

Fix non-streaming path to use model-specific parsing

8726ae5

Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency.

ikawrakow merged commit 3701fb1 into ikawrakow:main Jul 23, 2025

iSevenDays mentioned this pull request Jul 26, 2025

Fix text generation endpoint #654

Merged

4 tasks

Function calling support for Kimi-K2 #628

Function calling support for Kimi-K2 #628

Uh oh!

Conversation

iSevenDays commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

iSevenDays commented Jul 18, 2025

Uh oh!

ubergarm commented Jul 18, 2025

Uh oh!

sousekd commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtcl commented Jul 19, 2025

Uh oh!

iSevenDays commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtcl commented Jul 19, 2025

Uh oh!

iSevenDays commented Jul 22, 2025

Uh oh!

iSevenDays commented Jul 23, 2025

Uh oh!

iSevenDays commented Jul 23, 2025

Uh oh!

ikawrakow commented Jul 23, 2025

Uh oh!

iSevenDays commented Jul 23, 2025

Uh oh!

xldistance commented Jul 23, 2025

Uh oh!

iSevenDays commented Jul 23, 2025

Uh oh!

xldistance commented Jul 23, 2025

Uh oh!

iSevenDays commented Jul 23, 2025

Uh oh!

xldistance commented Jul 23, 2025

Uh oh!

iSevenDays commented Jul 23, 2025

Uh oh!

ikawrakow commented Jul 23, 2025

Uh oh!

iSevenDays commented Jul 24, 2025

Uh oh!

randoentity commented Jul 24, 2025

Uh oh!

mtcl commented Jul 24, 2025

Uh oh!

randoentity commented Jul 24, 2025

Uh oh!

city96 commented Jul 26, 2025

Uh oh!

iSevenDays commented Jul 26, 2025

Uh oh!

city96 commented Jul 26, 2025

Uh oh!

iSevenDays commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

city96 commented Jul 26, 2025

Uh oh!

imweijh commented Jul 28, 2025

Uh oh!

sayap commented Jul 29, 2025

Uh oh!

iSevenDays commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

iSevenDays commented Jul 18, 2025 •

edited

Loading

sousekd commented Jul 18, 2025 •

edited

Loading

iSevenDays commented Jul 19, 2025 •

edited

Loading

iSevenDays commented Jul 26, 2025 •

edited

Loading