You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/server/README.md
+20-4Lines changed: 20 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -236,9 +236,13 @@ npm i
236
236
# to run the dev server
237
237
npm run dev
238
238
239
-
# to build the public/index.html
239
+
# to build the public/index.html.gz
240
240
npm run build
241
241
```
242
+
After `public/index.html.gz` has been generated we need to generate the c++
243
+
headers (like build/examples/server/index.html.gz.hpp) that will be included
244
+
by server.cpp. This is done by building `llama-server` as described in the
245
+
[build](#build) section above.
242
246
243
247
NOTE: if you are using the vite dev server, you can change the API base URL to llama.cpp. To do that, run this code snippet in browser's console:
244
248
@@ -456,7 +460,7 @@ These words will not be included in the completion, so make sure to add them to
456
460
- Note: In streaming mode (`stream`), only `content`, `tokens` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
457
461
458
462
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has a nested array `top_logprobs`. It contains at **maximum** `n_probs` elements:
459
-
```json
463
+
```
460
464
{
461
465
"content": "<the generated completion text>",
462
466
"tokens": [ generated token ids if requested ],
@@ -557,7 +561,7 @@ If `with_pieces` is `true`:
557
561
```
558
562
559
563
With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
560
-
```json
564
+
```
561
565
{
562
566
"tokens": [
563
567
{"id": 198, "piece": [195]}, // hex C3
@@ -572,6 +576,18 @@ With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
572
576
573
577
`tokens`: Set the tokens to detokenize.
574
578
579
+
### POST `/apply-template`: Apply chat template to a conversation
580
+
581
+
Uses the server's prompt template formatting functionality to convert chat messages to a single string expected by a chat model as input, but does not perform inference. Instead, the prompt string is returned in the `prompt` field of the JSON response. The prompt can then be modified as desired (for example, to insert "Sure!" at the beginning of the model's response) before sending to `/completion` to generate the chat response.
582
+
583
+
*Options:*
584
+
585
+
`messages`: (Required) Chat turns in the same format as `/v1/chat/completions`.
586
+
587
+
**Response format**
588
+
589
+
Returns a JSON object with a field `prompt` containing a string of the input messages formatted according to the model's chat template format.
590
+
575
591
### POST `/embedding`: Generate embedding of a given text
576
592
577
593
> [!IMPORTANT]
@@ -764,7 +780,7 @@ Same as the `/v1/embeddings` endpoint.
Copy file name to clipboardExpand all lines: examples/server/tests/unit/test_chat_completion.py
+15Lines changed: 15 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -121,6 +121,21 @@ def test_chat_template():
121
121
assertres.body["__verbose"]["prompt"] =="<s> <|start_header_id|>system<|end_header_id|>\n\nBook<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the best book<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
assertres.body["prompt"] =="<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a test.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hi there<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"
0 commit comments