Update gguf-llamacpp.md (#1438)

Vaibhavs10 · osanseviero · pcuenca · web-flow · commit 6bd39bb3f059 · 2024-10-03T18:01:00.000+02:00
* Update gguf-llamacpp.md

* up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero &lt;osanseviero@gmail.com&gt;
Co-authored-by: Pedro Cuenca &lt;pedro@huggingface.co&gt;

---------

Co-authored-by: Omar Sanseviero &lt;osanseviero@gmail.com&gt;
Co-authored-by: Pedro Cuenca &lt;pedro@huggingface.co&gt;
diff --git a/docs/hub/gguf-llamacpp.md b/docs/hub/gguf-llamacpp.md
@@ -1,9 +1,36 @@
 # GGUF usage with llama.cpp
 
-Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp download the model checkpoint and automatically caches it. The location of the cache is defined by `LLAMA_CACHE` environment variable, read more about it [here](https://github.com/ggerganov/llama.cpp/pull/7826):
+> [!TIP]
+> You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it [here](https://huggingface.co/docs/inference-endpoints/en/others/llamacpp_container)
+
+Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by `LLAMA_CACHE` environment variable; read more about it [here](https://github.com/ggerganov/llama.cpp/pull/7826).
+
+You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can [check in the official documentation](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage).
+
+ ### Option 1: Install with brew
+
+```bash
+brew install llama.cpp
+```
+
+### Option 2: build from source
+
+Step 1: Clone llama.cpp from GitHub.
+
+```
+git clone https://github.com/ggerganov/llama.cpp
+```
+
+Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
+
+```
+cd llama.cpp && LLAMA_CURL=1 make
+```
+
+Once installed, you can use the `llama-cli` or `llama-server` as follows:
 
 ```bash
-./llama-cli
+llama-cli
   --hf-repo lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF \
   --hf-file Meta-Llama-3-8B-Instruct-Q8_0.gguf \
   -p "You are a helpful assistant" -cnv
@@ -14,7 +41,7 @@ Note: You can remove `-cnv` to run the CLI in chat completion mode.
 Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:
 
 ```bash
-./llama-server \
+llama-server \
   --hf-repo lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF \
   --hf-file Meta-Llama-3-8B-Instruct-Q8_0.gguf
 ```