Skip to content

Commit 14f64da

Browse files
authored
Merge branch 'ggerganov:master' into cuda-build-doc
2 parents f70b514 + 5555c0c commit 14f64da

38 files changed

+576
-556
lines changed

.github/workflows/server.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ jobs:
7979
# Setup nodejs (to be used for verifying bundled index.html)
8080
- uses: actions/setup-node@v4
8181
with:
82-
node-version: 22
82+
node-version: '22.11.0'
8383

8484
- name: Verify bundled index.html
8585
id: verify_server_index_html

common/arg.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -591,7 +591,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
591591
[](common_params & params) {
592592
params.ctx_shift = false;
593593
}
594-
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_NO_CONTEXT_SHIFT"));
594+
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_IMATRIX}).set_env("LLAMA_ARG_NO_CONTEXT_SHIFT"));
595595
add_opt(common_arg(
596596
{"--chunks"}, "N",
597597
string_format("max number of chunks to process (default: %d, -1 = all)", params.n_chunks),
@@ -1711,6 +1711,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
17111711
params.public_path = value;
17121712
}
17131713
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_STATIC_PATH"));
1714+
add_opt(common_arg(
1715+
{"--no-webui"},
1716+
string_format("Disable the Web UI (default: %s)", params.webui ? "enabled" : "disabled"),
1717+
[](common_params & params) {
1718+
params.webui = false;
1719+
}
1720+
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_NO_WEBUI"));
17141721
add_opt(common_arg(
17151722
{"--embedding", "--embeddings"},
17161723
string_format("restrict to only support embedding use case; use only with dedicated embedding models (default: %s)", params.embedding ? "enabled" : "disabled"),

examples/quantize/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ Several quantization methods are supported. They differ in the resulting model d
8181
- [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
8282
- [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
8383
- [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
84-
- [#4996 - k-qunats tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
84+
- [#4996 - k-quants tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
8585
- [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
8686
- [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
8787
- [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)

examples/server/README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ The project is under active development, and we are [looking for feedback and co
146146
| `--host HOST` | ip address to listen (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
147147
| `--port PORT` | port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
148148
| `--path PATH` | path to serve static files from (default: )<br/>(env: LLAMA_ARG_STATIC_PATH) |
149+
| `--no-webui` | disable the Web UI<br/>(env: LLAMA_ARG_NO_WEBUI) |
149150
| `--embedding, --embeddings` | restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
150151
| `--reranking, --rerank` | enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
151152
| `--api-key KEY` | API key to use for authentication (default: none)<br/>(env: LLAMA_API_KEY) |
@@ -302,23 +303,23 @@ mkdir llama-client
302303
cd llama-client
303304
```
304305

305-
Create a index.js file and put this inside:
306+
Create an index.js file and put this inside:
306307

307308
```javascript
308-
const prompt = `Building a website can be done in 10 simple steps:`;
309+
const prompt = "Building a website can be done in 10 simple steps:"
309310
310-
async function Test() {
311+
async function test() {
311312
let response = await fetch("http://127.0.0.1:8080/completion", {
312-
method: 'POST',
313+
method: "POST",
313314
body: JSON.stringify({
314315
prompt,
315-
n_predict: 512,
316+
n_predict: 64,
316317
})
317318
})
318319
console.log((await response.json()).content)
319320
}
320321
321-
Test()
322+
test()
322323
```
323324

324325
And run it:
@@ -380,7 +381,7 @@ Multiple prompts are also supported. In this case, the completion result will be
380381
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
381382
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
382383

383-
`stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
384+
`stream`: Allows receiving each predicted token in real-time instead of waiting for the completion to finish (uses a different response format). To enable this, set to `true`.
384385

385386
`stop`: Specify a JSON array of stopping strings.
386387
These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: `[]`
@@ -441,11 +442,11 @@ These words will not be included in the completion, so make sure to add them to
441442

442443
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature"]` - these are all the available values.
443444

444-
`timings_per_token`: Include prompt processing and text generation speed information in each response. Default: `false`
445+
`timings_per_token`: Include prompt processing and text generation speed information in each response. Default: `false`
445446

446447
**Response format**
447448

448-
- Note: When using streaming mode (`stream`), only `content` and `stop` will be returned until end of completion.
449+
- Note: In streaming mode (`stream`), only `content` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
449450

450451
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
451452

examples/server/public/index.html

Lines changed: 118 additions & 80 deletions
Large diffs are not rendered by default.

examples/server/server.cpp

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3815,20 +3815,24 @@ int main(int argc, char ** argv) {
38153815
// Router
38163816
//
38173817

3818-
// register static assets routes
3819-
if (!params.public_path.empty()) {
3820-
// Set the base directory for serving static files
3821-
bool is_found = svr->set_mount_point("/", params.public_path);
3822-
if (!is_found) {
3823-
LOG_ERR("%s: static assets path not found: %s\n", __func__, params.public_path.c_str());
3824-
return 1;
3825-
}
3818+
if (!params.webui) {
3819+
LOG_INF("Web UI is disabled\n");
38263820
} else {
3827-
// using embedded static index.html
3828-
svr->Get("/", [](const httplib::Request &, httplib::Response & res) {
3829-
res.set_content(reinterpret_cast<const char*>(index_html), index_html_len, "text/html; charset=utf-8");
3830-
return false;
3831-
});
3821+
// register static assets routes
3822+
if (!params.public_path.empty()) {
3823+
// Set the base directory for serving static files
3824+
bool is_found = svr->set_mount_point("/", params.public_path);
3825+
if (!is_found) {
3826+
LOG_ERR("%s: static assets path not found: %s\n", __func__, params.public_path.c_str());
3827+
return 1;
3828+
}
3829+
} else {
3830+
// using embedded static index.html
3831+
svr->Get("/", [](const httplib::Request &, httplib::Response & res) {
3832+
res.set_content(reinterpret_cast<const char*>(index_html), index_html_len, "text/html; charset=utf-8");
3833+
return false;
3834+
});
3835+
}
38323836
}
38333837

38343838
// register API routes

examples/server/tests/unit/test_basic.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import pytest
2+
import requests
23
from utils import *
34

45
server = ServerPreset.tinyllama2()
@@ -76,3 +77,20 @@ def test_load_split_model():
7677
})
7778
assert res.status_code == 200
7879
assert match_regex("(little|girl)+", res.body["content"])
80+
81+
82+
def test_no_webui():
83+
global server
84+
# default: webui enabled
85+
server.start()
86+
url = f"http://{server.server_host}:{server.server_port}"
87+
res = requests.get(url)
88+
assert res.status_code == 200
89+
assert "<html>" in res.text
90+
server.stop()
91+
92+
# with --no-webui
93+
server.no_webui = True
94+
server.start()
95+
res = requests.get(url)
96+
assert res.status_code == 404

examples/server/tests/utils.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ class ServerProcess:
7272
disable_ctx_shift: int | None = False
7373
draft_min: int | None = None
7474
draft_max: int | None = None
75+
no_webui: bool | None = None
7576

7677
# session variables
7778
process: subprocess.Popen | None = None
@@ -158,6 +159,8 @@ def start(self, timeout_seconds: int = 10) -> None:
158159
server_args.extend(["--draft-max", self.draft_max])
159160
if self.draft_min:
160161
server_args.extend(["--draft-min", self.draft_min])
162+
if self.no_webui:
163+
server_args.append("--no-webui")
161164

162165
args = [str(arg) for arg in [server_path, *server_args]]
163166
print(f"bench: starting server with: {' '.join(args)}")

examples/server/utils.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -333,7 +333,7 @@ static std::string llama_get_chat_template(const struct llama_model * model) {
333333
if (res < 2) {
334334
return "";
335335
} else {
336-
std::vector<char> model_template(res, 0);
336+
std::vector<char> model_template(res + 1, 0);
337337
llama_model_meta_val_str(model, template_key.c_str(), model_template.data(), model_template.size());
338338
return std::string(model_template.data(), model_template.size() - 1);
339339
}

examples/server/webui/index.html

Lines changed: 81 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
<!-- sidebar -->
1616
<div class="drawer-side h-screen lg:h-screen z-50 lg:max-w-64">
1717
<label for="toggle-drawer" aria-label="close sidebar" class="drawer-overlay"></label>
18-
<div class="flex flex-col bg-base-200 min-h-full max-w-[calc(100vw-2em)] py-4 px-4">
18+
<div class="flex flex-col bg-base-200 min-h-full max-w-64 py-4 px-4">
1919
<div class="flex flex-row items-center justify-between mb-4 mt-4">
2020
<h2 class="font-bold ml-4">Conversations</h2>
2121

@@ -120,51 +120,25 @@ <h2 class="font-bold ml-4">Conversations</h2>
120120
{{ messages.length === 0 ? 'Send a message to start' : '' }}
121121
</div>
122122
<div v-for="msg in messages" class="group">
123-
<div :class="{
124-
'chat': true,
125-
'chat-start': msg.role !== 'user',
126-
'chat-end': msg.role === 'user',
127-
}">
128-
<div :class="{
129-
'chat-bubble markdown': true,
130-
'chat-bubble-base-300': msg.role !== 'user',
131-
}">
132-
<!-- textarea for editing message -->
133-
<template v-if="editingMsg && editingMsg.id === msg.id">
134-
<textarea
135-
class="textarea textarea-bordered bg-base-100 text-base-content w-[calc(90vw-8em)] lg:w-96"
136-
v-model="msg.content"></textarea>
137-
<br/>
138-
<button class="btn btn-ghost mt-2 mr-2" @click="editingMsg = null">Cancel</button>
139-
<button class="btn mt-2" @click="editUserMsgAndRegenerate(msg)">Submit</button>
140-
</template>
141-
<!-- render message as markdown -->
142-
<vue-markdown v-else :source="msg.content" />
143-
</div>
144-
</div>
145-
146-
<!-- actions for each message -->
147-
<div :class="{'text-right': msg.role === 'user'}" class="mx-4 mt-2 mb-2">
148-
<!-- user message -->
149-
<button v-if="msg.role === 'user'" class="badge btn-mini show-on-hover" @click="editingMsg = msg" :disabled="isGenerating">
150-
✍️ Edit
151-
</button>
152-
<!-- assistant message -->
153-
<button v-if="msg.role === 'assistant'" class="badge btn-mini show-on-hover mr-2" @click="regenerateMsg(msg)" :disabled="isGenerating">
154-
🔄 Regenerate
155-
</button>
156-
<button v-if="msg.role === 'assistant'" class="badge btn-mini show-on-hover mr-2" @click="copyMsg(msg)" :disabled="isGenerating">
157-
📋 Copy
158-
</button>
159-
</div>
123+
<message-bubble
124+
:config="config"
125+
:msg="msg"
126+
:key="msg.id"
127+
:is-generating="isGenerating"
128+
:edit-user-msg-and-regenerate="editUserMsgAndRegenerate"
129+
:regenerate-msg="regenerateMsg"></message-bubble>
160130
</div>
161131

162132
<!-- pending (ongoing) assistant message -->
163-
<div id="pending-msg" class="chat chat-start">
164-
<div v-if="pendingMsg" class="chat-bubble markdown chat-bubble-base-300">
165-
<span v-if="!pendingMsg.content" class="loading loading-dots loading-md"></span>
166-
<vue-markdown v-else :source="pendingMsg.content" />
167-
</div>
133+
<div id="pending-msg" class="group">
134+
<message-bubble
135+
v-if="pendingMsg"
136+
:config="config"
137+
:msg="pendingMsg"
138+
:key="pendingMsg.id"
139+
:is-generating="isGenerating"
140+
:edit-user-msg-and-regenerate="() => {}"
141+
:regenerate-msg="() => {}"></message-bubble>
168142
</div>
169143
</div>
170144

@@ -227,6 +201,10 @@ <h3 class="text-lg font-bold mb-6">Settings</h3>
227201
<details class="collapse collapse-arrow bg-base-200 mb-2 overflow-visible">
228202
<summary class="collapse-title font-bold">Advanced config</summary>
229203
<div class="collapse-content">
204+
<div class="flex flex-row items-center mb-2">
205+
<input type="checkbox" class="checkbox" v-model="config.showTokensPerSecond" />
206+
<span class="ml-4">Show tokens per second</span>
207+
</div>
230208
<label class="form-control mb-2">
231209
<!-- Custom parameters input -->
232210
<div class="label inline">Custom JSON config (For more info, refer to <a class="underline" href="https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md" target="_blank" rel="noopener noreferrer">server documentation</a>)</div>
@@ -247,6 +225,66 @@ <h3 class="text-lg font-bold mb-6">Settings</h3>
247225

248226
</div>
249227

228+
229+
<!-- Template to be used as message bubble -->
230+
<template id="message-bubble">
231+
<div :class="{
232+
'chat': true,
233+
'chat-start': msg.role !== 'user',
234+
'chat-end': msg.role === 'user',
235+
}">
236+
<div :class="{
237+
'chat-bubble markdown': true,
238+
'chat-bubble-base-300': msg.role !== 'user',
239+
}">
240+
<!-- textarea for editing message -->
241+
<template v-if="editingContent !== null">
242+
<textarea
243+
class="textarea textarea-bordered bg-base-100 text-base-content w-[calc(90vw-8em)] lg:w-96"
244+
v-model="editingContent"></textarea>
245+
<br/>
246+
<button class="btn btn-ghost mt-2 mr-2" @click="editingContent = null">Cancel</button>
247+
<button class="btn mt-2" @click="editMsg()">Submit</button>
248+
</template>
249+
<template v-else>
250+
<!-- show loading dots for pending message -->
251+
<span v-if="msg.content === null" class="loading loading-dots loading-md"></span>
252+
<!-- render message as markdown -->
253+
<vue-markdown v-else :source="msg.content"></vue-markdown>
254+
<!-- render timings if enabled -->
255+
<div class="dropdown dropdown-hover dropdown-top mt-2" v-if="timings && config.showTokensPerSecond">
256+
<div tabindex="0" role="button" class="cursor-pointer font-semibold text-sm opacity-60">Speed: {{ timings.predicted_per_second.toFixed(1) }} t/s</div>
257+
<div class="dropdown-content bg-base-100 z-10 w-64 p-2 shadow mt-4">
258+
<b>Prompt</b><br/>
259+
- Tokens: {{ timings.prompt_n }}<br/>
260+
- Time: {{ timings.prompt_ms }} ms<br/>
261+
- Speed: {{ timings.prompt_per_second.toFixed(1) }} t/s<br/>
262+
<b>Generation</b><br/>
263+
- Tokens: {{ timings.predicted_n }}<br/>
264+
- Time: {{ timings.predicted_ms }} ms<br/>
265+
- Speed: {{ timings.predicted_per_second.toFixed(1) }} t/s<br/>
266+
</div>
267+
</div>
268+
</template>
269+
</div>
270+
</div>
271+
<!-- actions for each message -->
272+
<div :class="{'text-right': msg.role === 'user', 'opacity-0': isGenerating}" class="mx-4 mt-2 mb-2">
273+
<!-- user message -->
274+
<button v-if="msg.role === 'user'" class="badge btn-mini show-on-hover" @click="editingContent = msg.content" :disabled="isGenerating">
275+
✍️ Edit
276+
</button>
277+
<!-- assistant message -->
278+
<button v-if="msg.role === 'assistant'" class="badge btn-mini show-on-hover mr-2" @click="regenerateMsg(msg)" :disabled="isGenerating">
279+
🔄 Regenerate
280+
</button>
281+
<button v-if="msg.role === 'assistant'" class="badge btn-mini show-on-hover mr-2" @click="copyMsg()" :disabled="isGenerating">
282+
📋 Copy
283+
</button>
284+
</div>
285+
</template>
286+
287+
250288
<!-- Template to be used by settings modal -->
251289
<template id="settings-modal-short-input">
252290
<label class="input input-bordered join-item grow flex items-center gap-2 mb-2">

0 commit comments

Comments
 (0)