You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/server/README.md
+16-33Lines changed: 16 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -380,8 +380,6 @@ node index.js
380
380
381
381
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`
382
382
383
-
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)
384
-
385
383
`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.
386
384
387
385
**Response format**
@@ -519,34 +517,41 @@ Requires a reranker model (such as [bge-reranker-v2-m3](https://huggingface.co/B
519
517
520
518
Takes a prefix and a suffix and returns the predicted completion as stream.
521
519
522
-
*Options:*
520
+
*Options:*
523
521
524
-
`input_prefix`: Set the prefix of the code to infill.
522
+
-`input_prefix`: Set the prefix of the code to infill.
523
+
-`input_suffix`: Set the suffix of the code to infill.
525
524
526
-
`input_suffix`: Set the suffix of the code to infill.
525
+
It also accepts all the options of `/completion` except `stream` and `prompt`.
527
526
528
-
It also accepts all the options of `/completion` except `stream` and `prompt`.
527
+
### **GET**`/props`: Get server global properties.
529
528
530
-
-**GET**`/props`: Return current server settings.
529
+
This endpoint is public (no API key check). By default, it is read-only. To make POST request to change global properties, you need to start server with `--props`
531
530
532
531
**Response format**
533
532
534
533
```json
535
534
{
536
-
"assistant_name": "",
537
-
"user_name": "",
535
+
"system_prompt": "",
538
536
"default_generation_settings": { ... },
539
537
"total_slots": 1,
540
538
"chat_template": ""
541
539
}
542
540
```
543
541
544
-
-`assistant_name` - the required assistant name to generate the prompt in case you have specified a system prompt for all slots.
545
-
-`user_name` - the required anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
542
+
-`system_prompt` - the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
546
543
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
547
544
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
548
545
-`chat_template` - the model's original Jinja2 prompt template
549
546
547
+
### POST `/props`: Change server global properties.
548
+
549
+
To use this endpoint with POST method, you need to start server with `--props`
550
+
551
+
*Options:*
552
+
553
+
-`system_prompt`: Change the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
554
+
550
555
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
551
556
552
557
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
@@ -813,28 +818,6 @@ To know the `id` of the adapter, use GET `/lora-adapters`
813
818
814
819
## More examples
815
820
816
-
### Change system prompt on runtime
817
-
818
-
To use the server example to serve multiple chat-type clients while keeping the same system prompt, you can utilize the option `system_prompt`. This only needs to be used once.
819
-
820
-
`prompt`: Specify a context that you want all connecting clients to respect.
821
-
822
-
`anti_prompt`: Specify the word you want to use to instruct the model to stop. This must be sent to each client through the `/props` endpoint.
823
-
824
-
`assistant_name`: The bot's name is necessary for each customer to generate the prompt. This must be sent to each client through the `/props` endpoint.
825
-
826
-
```json
827
-
{
828
-
"system_prompt": {
829
-
"prompt": "Transcript of a never ending dialog, where the User interacts with an Assistant.\nThe Assistant is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\nUser: Recommend a nice restaurant in the area.\nAssistant: I recommend the restaurant \"The Golden Duck\". It is a 5 star restaurant with a great view of the city. The food is delicious and the service is excellent. The prices are reasonable and the portions are generous. The restaurant is located at 123 Main Street, New York, NY 10001. The phone number is (212) 555-1234. The hours are Monday through Friday from 11:00 am to 10:00 pm. The restaurant is closed on Saturdays and Sundays.\nUser: Who is Richard Feynman?\nAssistant: Richard Feynman was an American physicist who is best known for his work in quantum mechanics and particle physics. He was awarded the Nobel Prize in Physics in 1965 for his contributions to the development of quantum electrodynamics. He was a popular lecturer and author, and he wrote several books, including \"Surely You're Joking, Mr. Feynman!\" and \"What Do You Care What Other People Think?\".\nUser:",
830
-
"anti_prompt": "User:",
831
-
"assistant_name": "Assistant:"
832
-
}
833
-
}
834
-
```
835
-
836
-
**NOTE**: You can do this automatically when starting the server by simply creating a .json file with these options and using the CLI option `-spf FNAME` or `--system-prompt-file FNAME`.
if (ctx_server.params.embedding || ctx_server.params.reranking) {
2892
2875
res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings` or `--reranking`", ERROR_TYPE_NOT_SUPPORTED));
0 commit comments