-
Notifications
You must be signed in to change notification settings - Fork 93
Windows
Used for
- code completion
LLM type
- FIM (fill in the middle)
Instructions
winget install llama.cppOR
Download the release files for Windows for llama.cpp from releases. For CPU use llama--bin-win-cpu-.zip. For Nvidia: llama--bin-win-cuda-x64.zip and if you don't have cuda drivers installed also cudart-llama-bin-win-cuda*-x64.zip.
No GPUs
`llama-server.exe --fim-qwen-1.5b-default --port 8012` With Nvidia GPUs and installed latest cuda
`llama-server.exe --fim-qwen-1.5b-default --port 8012 -ngl 99` If you've installed llama.cpp with winget you could skip the .exe suffix and use just llama-server in the commands.
Now you could start using llama-vscode extension for code completion.
More details about llama.cpp server
Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generate commit message
LLM type
- Chat Models
Instructions
Same like code completion server, but use chat model and a little bit different parameters.
CPU-only:
`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011` With Nvidia GPUs and installed cuda drivers
- more than 16GB VRAM
`llama-server.exe -hf qwen2.5-coder-7b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99` - less than 16GB VRAM
`llama-server.exe -hf qwen2.5-coder-3b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99` - less than 8GB VRAM
`llama-server.exe -hf qwen2.5-coder-1.5b-instruct-q8_0.gguf --port 8011 -np 2 -ngl 99` Used for
- Chat with AI with project context
LLM type
- Embedding
Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.
`llama-server.exe -hf nomic-embed-text-v2-moe-q8_0.gguf --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`