Continue to call ollama's GPT-oss: 20b model problem #7061

main1015 · 2025-08-08T07:00:59Z

main1015
Aug 8, 2025

I'm currently experiencing an odd issue when using continue to connect to the gpt-oss:20b model on ollama. When ollama loads this model, it appears to load into the cpu rather than the gpu. However, when I connect to the gpt-oss:20b model through other tools or code, it loads into the gpu, which is quite puzzling. When I captured the data from continue' s call to Api.chat and sent it to ollama via postman, the model still loads into the gpu. Please help resolve this. The config.yaml configuration file for my continue is as follows:

name: Local Assistant
version: 1.0.0
schema: v1
models:
  - name: gpt-oss:20b
    provider: ollama
    model: gpt-oss:20b
    apiBase: http://localhost:11443
    contextLength: 32768
    roles:
      - chat
      # - edit
      # - apply
  - name: qwen3:14b
    provider: ollama
    model: qwen3:14b
    apiBase: http://localhost:11443
    roles:
      - edit
      - apply
      - autocomplete
  - name: bge-m3
    provider: ollama
    model: bge-m3
    apiBase: http://localhost:11443
    roles:
      - embed
# context:
#   - provider: code
#   - provider: docs
#   - provider: diff
#   - provider: terminal
#   - provider: problems
#   - provider: folder
#   - provider: codebase

main1015 · 2025-08-08T07:06:52Z

main1015
Aug 8, 2025
Author

This is another program calling ollama. ollama logs:

time=2025-08-08T02:38:22.695Z level=INFO source=server.go:135 msg="system memory" total="125.5 GiB" free="102.2 GiB" free_swap="2.0 GiB"
time=2025-08-08T02:38:22.695Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=24 layers.split="" memory.available="[21.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.8 GiB" memory.required.partial="20.7 GiB" memory.required.kv="1.1 GiB" memory.required.allocations="[20.7 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="8.0 GiB" memory.graph.partial="8.0 GiB"
time=2025-08-08T02:38:22.735Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --n-gpu-layers 24 --threads 8 --parallel 4 --port 39915"
time=2025-08-08T02:38:22.736Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-08T02:38:22.736Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-08T02:38:22.736Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-08T02:38:22.744Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-08T02:38:22.744Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:39915"
time=2025-08-08T02:38:22.787Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-08T02:38:22.841Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-08T02:38:22.930Z level=INFO source=ggml.go:367 msg="offloading 24 repeating layers to GPU"
time=2025-08-08T02:38:22.930Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-08T02:38:22.930Z level=INFO source=ggml.go:378 msg="offloaded 24/25 layers to GPU"
time=2025-08-08T02:38:22.930Z level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="10.7 GiB"
time=2025-08-08T02:38:22.930Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="2.2 GiB"
time=2025-08-08T02:38:22.938Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="8.1 GiB"
time=2025-08-08T02:38:22.938Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-08T02:38:22.988Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-08T02:38:26.001Z level=INFO source=server.go:637 msg="llama runner started in 3.27 seconds"
time=2025-08-08T02:38:32.591Z level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:39915/completion\": context canceled"

This is a continue call to ollama. The log of ollama:

time=2025-08-08T02:42:17.494Z level=INFO source=server.go:135 msg="system memory" total="125.5 GiB" free="102.0 GiB" free_swap="2.0 GiB"
time=2025-08-08T02:42:17.494Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[21.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.1 GiB" memory.required.partial="0 B" memory.required.kv="3.4 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB"
time=2025-08-08T02:42:17.535Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 45783"
time=2025-08-08T02:42:17.536Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-08T02:42:17.536Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-08T02:42:17.536Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-08T02:42:17.544Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-08T02:42:17.544Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:45783"
time=2025-08-08T02:42:17.587Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-08T02:42:17.640Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-08T02:42:17.740Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-08T02:42:17.740Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-08T02:42:17.740Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
time=2025-08-08T02:42:17.740Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
time=2025-08-08T02:42:17.787Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-08T02:42:18.760Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-08T02:42:18.760Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="32.0 GiB"
time=2025-08-08T02:42:19.794Z level=INFO source=server.go:637 msg="llama runner started in 2.26 seconds"

0 replies

metcalfc · 2025-08-08T17:09:21Z

metcalfc
Aug 8, 2025
Maintainer

@main1015 Hi. Thank you for the detailed issue. I am wondering if this has something to do with keep alive. How much vram and system ram do you have?

To test this I think we can try:

  models:
    - name: gpt-oss:20b
      provider: ollama
      model: gpt-oss:20b
      apiBase: http://localhost:11443
      contextLength: 32768
      completionOptions:
        keepAlive: 0  # Don't keep model loaded
      roles:
        - chat

3 replies

jfriv Aug 8, 2025

I was experiencing the same CPU only processing with continue calling gpt-oss via ollama. I changed my config to provider: openai with apiBase: http://localhost:11434/v1:

name: Local Assistant
version: 1.0.0
schema: v1
models:
  - name: gpt-oss:20b
    provider: openai
    model: gpt-oss:20b 
    apiBase: http://192.168.86.111:11434/v1
    apiKey: xxx
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use

(from this discussion thread I started)

Potential workaround while the continue team is working on better support for gpt-oss models.

metcalfc Aug 8, 2025
Maintainer

@jfriv Thanks for that detail. That seems to lend weight to the keep alive theory. If you're able to test the suggested config above and see if that resolves the loading issue that would be awesome. I'm trying to reproduce this here but it feels dependent on vRAM/RAM sizing.

@jfriv Would you mind sharing your vRAM / system RAM specs?

jfriv Aug 14, 2025

Running ollama on ubuntu, RTX 3090 24GB vRAM, 64GB system RAM

$ LANG=C nvidia-smi --query-gpu=memory.total --format=csv,noheader
24576 MiB
$ grep MemTotal /proc/meminfo
MemTotal:       65778056 kB
$ sudo lshw -short
H/W path           Device          Class          Description
=============================================================
                                   system         Z390 M GAMING (Default string)
/0                                 bus            Z390 M GAMING-CF
/0/0                               memory         64KiB BIOS
/0/3a                              memory         64GiB System Memory
/0/3a/0                            memory         16GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/3a/1                            memory         16GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/3a/2                            memory         16GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/3a/3                            memory         16GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/44                              memory         384KiB L1 cache
/0/45                              memory         1536KiB L2 cache
/0/46                              memory         12MiB L3 cache
/0/47                              processor      Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
/0/100                             bridge         8th Gen Core Processor Host Bridge/DRAM Registers
/0/100/1                           bridge         6th-10th Gen Core Processor PCIe Controller (x16)
/0/100/1/0         /dev/fb0        display        GA102 [GeForce RTX 3090]
...

main1015 · 2025-08-20T00:11:20Z

main1015
Aug 20, 2025
Author

Running ollama on ubuntu, RTX A5000 24GB vRAM, 128GB system RAM. keepAlive: 0 I added this parameter and tried it, but it didn't seem to work

0 replies

ShaunaGordon · 2025-09-09T23:25:27Z

ShaunaGordon
Sep 9, 2025

Ollama recently updated and implemented some new things under the hood relation to memory prediction and allocation. Try this:

Make sure ollama is on version 0.11.8 or newer
Create a .env file in ~/.continue if you don't already have one
Add OLLAMA_NEW_ESTIMATES=1 to it (this only works for gpt-oss right now, but the other models will just ignore it)
Restart the IDE and try a query again

I found this to work even through Continue, where similar-sized models would split when called from Continue, due to inflated memory allocation.

1 reply

faetschi Nov 3, 2025

This does seem to make Continue use Tool Calls more reliably, but gpt-oss:20b still fails to use sequential Tool Calls with this solution

Continue to call ollama's GPT-oss: 20b model problem #7061

Uh oh!

main1015 Aug 8, 2025

Replies: 4 comments · 4 replies

Uh oh!

main1015 Aug 8, 2025 Author

Uh oh!

metcalfc Aug 8, 2025 Maintainer

Uh oh!

jfriv Aug 8, 2025

Uh oh!

metcalfc Aug 8, 2025 Maintainer

Uh oh!

Uh oh!

jfriv Aug 14, 2025

Uh oh!

main1015 Aug 20, 2025 Author

Uh oh!

ShaunaGordon Sep 9, 2025

Uh oh!

Uh oh!

faetschi Nov 3, 2025

main1015
Aug 8, 2025

Replies: 4 comments 4 replies

main1015
Aug 8, 2025
Author

metcalfc
Aug 8, 2025
Maintainer

metcalfc Aug 8, 2025
Maintainer

main1015
Aug 20, 2025
Author

ShaunaGordon
Sep 9, 2025