I'm confused on how to make this play nicely with other API's. #81

TFWol · 2025-08-05T12:01:07Z

TFWol
Aug 5, 2025

It doesn't function unless I'm hosting a model via llama.cpp itself.

For example, I have a Chat model being hosted via KoboldCPP, which can emulate OpenAI's API as well as its own. It even works fine with all the other tools I've tried, like Continue, but not with this extension.

When I want to leverage something like "Edit selected text with AI", it will just error out. Enabling the setting llama-vscode.use_openai_endpoint doesn't seem to do anything differently either.

Is there guidance on how to make this work?

Answered by igardev

Aug 5, 2025

@TFWol For the code completions you will need llama.cpp server. The reason is that llama.vscode uses /infill endpoint for better performance on local machines. The other providers don't provide the /infill endpoint as far as I know.
For the chat or agent (tools) you don't need obligatory llama.cpp. Any OpenAI compatible API should work. (Don't use llama-vscode.use_openai_endpoint. I have to remove it. Tt is for completion, but is very slow).
Currently the documentation is hier . I know it is not enough, will try to improve it.

How to make it work for chat related functionality

Set property endpoint_chat to the endpoint with the OpenAI API. For example https://openrouter.ai/api (for OpenR…

View full answer

igardev · 2025-08-05T13:51:54Z

igardev
Aug 5, 2025
Maintainer

@TFWol For the code completions you will need llama.cpp server. The reason is that llama.vscode uses /infill endpoint for better performance on local machines. The other providers don't provide the /infill endpoint as far as I know.
For the chat or agent (tools) you don't need obligatory llama.cpp. Any OpenAI compatible API should work. (Don't use llama-vscode.use_openai_endpoint. I have to remove it. Tt is for completion, but is very slow).
Currently the documentation is hier . I know it is not enough, will try to improve it.

How to make it work for chat related functionality

Set property endpoint_chat to the endpoint with the OpenAI API. For example https://openrouter.ai/api (for OpenRouter) or http://127.0.0.1:8011 (for llama.cpp). Note that the endpoint doesn't include "v1" part of the url.
Set API Key (if required) in api_key_chat
Set the model if the provider requires it in setting ai_mode (for example for OpenRouter - qwen/qwen3-235b-a22b-thinking-2507), for llama.cpp you don't need to set model
This is all.
Now you could go to the editor, select a text and press Ctrl+Shift+E (or right click and select from the context menu: llama.vscode: Edit Selected Text with AI), enter what do you want to change in the selected text, after that Enter and you will get a suggestion. Accept with Tab, refuse with Esc.

For the agent (tools) it is similar, just set properties Endpoint_tools (required), Api_key_tools, Ai_model.

As for Chat with AI - for this you will need llama.cpp server, running on the endpoint from endpoint_chat setting.

I hope in the next version it wil be easier to configure.

Thanks for asking.

10 replies

TFWol Aug 5, 2025
Author

Just tested the http://192.168.1.11:5001/v1/chat/completions endpoint and got good output by stripping away everything except temperature

curl -X 'POST' \
  'http://192.168.1.11:5001/v1/chat/completions' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{"messages": [{"role": "system", "content": "You are an expert coder."}, {"role": "user", "content": "Modify the following original code according to the instructions. Output only the modified code. No explanations.\n\ninstructions:\nall caps\n\noriginal code:\nThis is a test\n\nmodified code:"}], "temperature": 0.7}'

Response:

{
  "id": "chatcmpl-A1",
  "object": "chat.completion",
  "created": 1754408796,
  "model": "koboldcpp/mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_L",
  "usage": {
    "prompt_tokens": 46,
    "completion_tokens": 6,
    "total_tokens": 52
  },
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "THIS IS A TEST",
        "tool_calls": []
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ]
}

TFWol Aug 5, 2025
Author

Not so much when I add the original back in:

curl -X 'POST' \
  'http://192.168.1.11:5001/v1/chat/completions' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{"messages": [{"role": "system", "content": "You are an expert coder."}, {"role": "user", "content": "Modify the following original code according to the instructions. Output only the modified code. No explanations.\n\ninstructions:\nall caps\n\noriginal code:\nThis is a test\n\nmodified code:"}], "stream": false, "cache_prompt": true, "samplers": "edkypmxt", "temperature": 0.8, "dynatemp_range": 0, "dynatemp_exponent": 1, "top_k": 40, "top_p": 0.95, "min_p": 0.05, "typical_p": 1, "xtc_probability": 0, "xtc_threshold": 0.1, "repeat_last_n": 64, "repeat_penalty": 1, "presence_penalty": 0, "frequency_penalty": 0, "dry_multiplier": 0, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": -1, "max_tokens": -1, "timings_per_token": false}'

{
  "id": "chatcmpl-A1",
  "object": "chat.completion",
  "created": 1754409111,
  "model": "koboldcpp/mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_L",
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 0,
    "total_tokens": 45
  },
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": []
      },
      "finish_reason": "length",
      "logprobs": null
    }
  ]
}

TFWol Aug 5, 2025
Author

figured it out after removing each parameter one at a time. It's the "max_tokens": -1 part. When I remove that, it works

curl -X 'POST' \
  'http://192.168.1.11:5001/v1/chat/completions' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{"messages": [{"role": "system", "content": "You are an expert coder."}, {"role": "user", "content": "Modify the following original code according to the instructions. Output only the modified code. No explanations.\n\ninstructions:\nall caps\n\noriginal code:\nThis is a test\n\nmodified code:"}], "stream": false, "cache_prompt": true, "samplers": "edkypmxt", "temperature": 0.8, "dynatemp_range": 0, "dynatemp_exponent": 1, "top_k": 40, "top_p": 0.95, "min_p": 0.05, "typical_p": 1, "xtc_probability": 0, "xtc_threshold": 0.1, "repeat_last_n": 64, "repeat_penalty": 1, "presence_penalty": 0, "frequency_penalty": 0, "dry_multiplier": 0, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": -1, "timings_per_token": false}'

{
  "id": "chatcmpl-A1",
  "object": "chat.completion",
  "created": 1754409638,
  "model": "koboldcpp/mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_L",
  "usage": {
    "prompt_tokens": 46,
    "completion_tokens": 6,
    "total_tokens": 52
  },
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "THIS IS A TEST",
        "tool_calls": []
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ]
}

TFWol Aug 5, 2025
Author

Is it possible for me to tell the extension to omit/control that parameter?

Like expose this to the extension config instead of hardcoded:

llama.vscode/src/llama-server.ts

Line 217 in 9b96928

"max_tokens": -1,

EDIT:
I noticed you have stuff commented out for the tool use response, so for a workaround that's going great, I commented out the max_tokens part in the installed extension's code.

TFWol Aug 5, 2025
Author

I opened an Issue for the request

I'm confused on how to make this play nicely with other API's. #81

Uh oh!

Uh oh!

TFWol Aug 5, 2025

Replies: 1 comment · 10 replies

Uh oh!

igardev Aug 5, 2025 Maintainer

Uh oh!

TFWol Aug 5, 2025 Author

Uh oh!

TFWol Aug 5, 2025 Author

Uh oh!

Uh oh!

TFWol Aug 5, 2025 Author

Uh oh!

Uh oh!

TFWol Aug 5, 2025 Author

Uh oh!

TFWol Aug 5, 2025 Author

TFWol
Aug 5, 2025

Replies: 1 comment 10 replies

igardev
Aug 5, 2025
Maintainer

TFWol Aug 5, 2025
Author

TFWol Aug 5, 2025
Author

TFWol Aug 5, 2025
Author

TFWol Aug 5, 2025
Author

TFWol Aug 5, 2025
Author