Skip to content

Windows

igardev edited this page May 11, 2025 · 16 revisions

Setup llama.cpp servers for Windows

  1. Download file qwen2.5-coder-1.5b-q8_0.gguf from https://huggingface.co/ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF/blob/main/qwen2.5-coder-1.5b-q8_0.gguf
  2. Download the release files for Windows from https://github.com/ggerganov/llama.cpp/releases and extract them.
  3. Run llama.cpp server
    3.1 No GPUs
    In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
    llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    3.2 With Nvidia GPUs and installed latest cuda
    In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
    llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 --n-gpu-layers 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    Now you could start using llama-vscode extension.

More details about llama.cpp server

Clone this wiki locally