Skip to content
igardev edited this page Jan 21, 2025 · 27 revisions

Setup instructions for llama.cpp server

For Linux

  1. Download the release files for your OS from https://github.com/ggerganov/llama.cpp/releases (or build from source).
  2. Download the LLM model and run llama.cpp server (combined in one command)
    2.1. No GPUs - run from shell the following command:
    llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    2.2. With Nvidia GPUs and installed latest cuda
  • If you have more than 16GB VRAM run from shell the following command:
    llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
  • If you have less than 16GB VRAM run from shell the following command:
    llama-server --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --hf-file qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
    Now you could start using llams-vscode extension. Enjoy!

For Mac

Prerequisites - Homebrew

  1. Install llama.cpp with the command
    brew install llama.cpp
  2. Download the LLM model and run llama.cpp server (combined in one command)
  • If you have more than 16GB VRAM:
    llama-server --hf-repo ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF --hf-file qwen2.5-coder-7b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
  • If you have less than 16GB VRAM:
    llama-server --hf-repo ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --hf-file qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
    Now you could start using llams-vscode extension.
    Enjoy!

For Windows

  1. Download file qwen2.5-coder-1.5b-q8_0.gguf from https://huggingface.co/ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF/blob/main/qwen2.5-coder-1.5b-q8_0.gguf
  2. Download the release files for Windows from https://github.com/ggerganov/llama.cpp/releases and extract them.
  3. Run llama.cpp server 3.1 No GPUs
    In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
    llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    3.2 With Nvidia GPUs and installed latest cuda
    In the extracted files folder put the model qwen2.5-coder-1.5b-q8_0.gguf and start llama.cpp server from command window:
    llama-server.exe -m qwen2.5-coder-1.5b-q8_0.gguf --port 8012 -c 2048 --n-gpu-layers 99 -fa -ub 1024 -b 1024 -dt 0.1 --ctx-size 0 --cache-reuse 256
    Now you could start using llams-vscode extension.
    Enjoy!

For all OS - if you have better hardware (GPUs) you could use bigger models from https://huggingface.co/ggml-org like qwen2.5-coder-3b-q8_0.gguf , qwen2.5-coder-7b-q8_0.gguf or qwen2.5-coder-14b-q8_0.gguf. Any FIM-compatible model, supported by llama.cpp, could be used.
More details about llama.cpp server

Clone this wiki locally