|
1 | 1 | --- |
2 | | -title: Launch LLM Server |
| 2 | +title: Launch the LLM Server |
3 | 3 | weight: 4 |
4 | 4 |
|
5 | 5 | ### FIXED, DO NOT MODIFY |
@@ -67,20 +67,20 @@ huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf do |
67 | 67 | The GGUF model format, introduced by the Llama.cpp team, uses compression and quantization to reduce weight precision to 4-bit integers, significantly decreasing computational and memory demands and making Arm CPUs effective for LLM inference. |
68 | 68 |
|
69 | 69 |
|
70 | | -### Re-quantize the model weights |
| 70 | +### Requantize the model weights |
71 | 71 |
|
72 | | -To re-quantize the model, run: |
| 72 | +To requantize the model, run: |
73 | 73 |
|
74 | 74 | ```bash |
75 | 75 | ./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8 |
76 | 76 | ``` |
77 | 77 |
|
78 | | -This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support. |
| 78 | +This outputs a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support. |
79 | 79 |
|
80 | 80 | This requantization is optimal specifically for Graviton3. For Graviton2, the optimal requantization should be performed in the `Q4_0_4_4` format, and for Graviton4, the `Q4_0_4_8` format is the most suitable for requantization. |
81 | 81 |
|
82 | 82 | ### Start the LLM Server |
83 | | -You can utilize the `llama.cpp` server program and send requests via an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network. |
| 83 | +You can utilize the `llama.cpp` server program and send requests through an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network. |
84 | 84 |
|
85 | 85 | Start the server from the command line, and it listens on port 8080: |
86 | 86 |
|
|
0 commit comments