You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
server: support multiple generations from one prompt (OAI "n" option) (ggml-org#17775)
* backend support
* server: support multiple generations from one prompt (OAI "n" option)
* fix invalid batch
* format oai
* clean up
* disable ctx shift
* add test
* update comments
* fix style
* add n_cmpl to docs [no ci]
* allowing using both n_cmpl and n
Copy file name to clipboardExpand all lines: tools/server/README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -493,6 +493,8 @@ Note for `multimodal_data` in JSON object prompts. This should be an array of st
493
493
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
494
494
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
495
495
496
+
`n_cmpl`: Number of completions to generate from the current prompt. If input has multiple prompts, the output will have N prompts times `n_cmpl` entries.
497
+
496
498
`stream`: Allows receiving each predicted token in real-time instead of waiting for the completion to finish (uses a different response format). To enable this, set to `true`.
Copy file name to clipboardExpand all lines: tools/server/server-context.cpp
+80-5Lines changed: 80 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,8 @@ constexpr int HTTP_POLLING_SECONDS = 1;
35
35
// state diagram: https://github.com/ggml-org/llama.cpp/pull/9283
36
36
enum slot_state {
37
37
SLOT_STATE_IDLE,
38
-
SLOT_STATE_STARTED, // TODO: this state is only used for setting up the initial prompt processing; maybe merge it with launch_slot_with_task in the future
38
+
SLOT_STATE_WAIT_OTHER, // after assigning a task, but waiting for parent slot to process prompt
39
+
SLOT_STATE_STARTED, // after assigning a task and about to process prompt
39
40
SLOT_STATE_PROCESSING_PROMPT,
40
41
SLOT_STATE_DONE_PROMPT,
41
42
SLOT_STATE_GENERATING,
@@ -254,6 +255,15 @@ struct server_slot {
254
255
generated_token_probs.push_back(token);
255
256
}
256
257
258
+
// note: a slot can also be either a parent or a child
0 commit comments