Khoj is reverting to downloading a model even if I add my own llama-cpp hosted server

### Server

- [ ] Cloud (https://app.khoj.dev)
- [x] Self-Hosted Docker
- [ ] Self-Hosted Python package
- [ ] Self-Hosted source code

### Clients

- [x] Web browser
- [ ] Desktop/mobile app
- [ ] Obsidian
- [ ] Emacs
- [ ] WhatsApp

### OS

- [ ] Windows
- [ ] macOS
- [x] Linux
- [ ] Android
- [ ] iOS

### Khoj version

latest

### Describe the bug

Khoj tries to download a model from Hugginface with this local configuration, instead of using the local API.

### Current Behavior

<img width="841" height="776" alt="Image" src="https://github.com/user-attachments/assets/054b714b-a38a-4ae9-a624-d493a553cd66" />

This is the bielik configuration:

<img width="718" height="312" alt="Image" src="https://github.com/user-attachments/assets/ee864228-2afd-4996-b55c-9b741268520d" />

I used `admin` API key because I had no clue what to put here, since it is a local self-hosted llama-cpp-server.

I confirmed it works by doing:
```
curl -X POST http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Napisz puuuuu",
    "max_tokens": 32,
    "temperature": 0.2
  }'

{"index":0,"content":"uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu","tokens":[],"id_slot":3,"stop":true,"model":"Bielik-4.5B-v3.0-Instruct.Q8_0.gguf","tokens_predicted":32,"tokens_evaluated":8,"generation_settings":{"seed":4294967295,"temperature":0.20000000298023224,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":4096,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":32,"n_predict":32,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"thinking_forced_open":false,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<s>Napisz puuuuu","has_new_line":false,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":39,"timings":{"cache_n":1,"prompt_n":7,"prompt_ms":275.595,"prompt_per_token_ms":39.37071428571429,"prompt_per_second":25.399589978047494,"predicted_n":32,"predicted_ms":5506.741,"predicted_per_token_ms":172.08565625,"predicted_per_second":5.8110595722588005}}%   
```

Yet Khoj tries to download the online model from HF as evident from the logs:
```
ValueError: No file found in                         
[server]   |                            speakleash/Bielik-4.5B-v3.0-Instruct-                
[server]   |                            GGUF that match *Q4_K_M.gguf 
```

If I save the same local model with a different name, without the slash /, it also doesn't work with a different error.

```
ValueError: not enough values to                     
[server]   |                            unpack (expected 2, got 1)     
```

I tried again with this name, since it is a perfect match: `speakleash/Bielik-4.5B-v3.0-Instruct.Q8_0.gguf`

```
FileNotFoundError:                                   
[server]   |                            speakleash/Bielik-4.5B-v3.0-Instruct.                
[server]   |                            Q8_0.gguf (repository not found)  
```

### Expected Behavior

Khoj should use the localhost:8080 as API instead of downloading the HF model.

I followed the manual very closely:

> Install any Openai API compatible local ai model server like [llama-cpp-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server), Ollama, vLLM etc.
    Add an [ai model api](http://localhost:42110/server/admin/database/aimodelapi/add/) on the admin panel
        Set the api url field to the url of your local ai model provider like http://localhost:11434/v1/ for Ollama
    Restart the Khoj server to load models available on your local ai model provider
        If that doesn't work, you'll need to manually add available [chat model](http://localhost:42110/server/admin/database/chatmodel/add) in the admin panel.
    Set the newly added chat model as your preferred model in your [User chat settings](http://localhost:42110/settings)
    [Start chatting](http://localhost:42110/) with your local AI!

### Reproduction Steps

Install on arch using podman compose up. Run a separate llama-cpp-server instance (I installed this through AUR) using model https://huggingface.co/speakleash/Bielik-4.5B-v3.0-Instruct-GGUF - `llama-cpp-server -m local_path_model.gguf`. Copy the open port and add the model to the Khoj configuration.

### Possible Workaround

_No response_

### Additional Information

_No response_

### Link to Discord or Github discussion

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Khoj is reverting to downloading a model even if I add my own llama-cpp hosted server #1253

Server

Clients

OS

Khoj version

Describe the bug

Current Behavior

Expected Behavior

Reproduction Steps

Possible Workaround

Additional Information

Link to Discord or Github discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Khoj is reverting to downloading a model even if I add my own llama-cpp hosted server #1253

Description

Server

Clients

OS

Khoj version

Describe the bug

Current Behavior

Expected Behavior

Reproduction Steps

Possible Workaround

Additional Information

Link to Discord or Github discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions