Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions docs/hub/gguf-llamacpp.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,23 @@ Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
```

Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
Step 2: Move into the llama.cpp folder and build it. You can also add hardware-specific flags (for ex: `-DGGML_CUDA=1` for Nvidia GPUs).

```
cd llama.cpp && LLAMA_CURL=1 make
cd llama.cpp
cmake -B build # optionally, add -DGGML_CUDA=ON to activate CUDA
cmake --build build --config Release
```

Note: for other hardware support (for ex: AMD ROCm, Intel SYCL), please refer to [llama.cpp's build guide](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)

Once installed, you can use the `llama-cli` or `llama-server` as follows:

```bash
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
```

Note: You can remove `-cnv` to run the CLI in chat completion mode.
Note: You can explicitly add `-no-cnv` to run the CLI in raw completion mode (non-chat mode).

Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:

Expand Down Expand Up @@ -62,5 +66,3 @@ curl http://localhost:8080/v1/chat/completions \
```

Replace `-hf` with any valid Hugging Face hub repo name - off you go! 🦙

Note: Remember to `build` llama.cpp with `LLAMA_CURL=1` :)