You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/server/README.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -159,6 +159,7 @@ The project is under active development, and we are [looking for feedback and co
159
159
|`--path PATH`| path to serve static files from (default: )<br/>(env: LLAMA_ARG_STATIC_PATH) |
160
160
|`--no-webui`| Disable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_NO_WEBUI) |
161
161
|`--embedding, --embeddings`| restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
162
+
|`--truncate-embed`| allow truncation for embedding tasks to handle large inputs (default: disabled)<br/>(env: LLAMA_ARG_TRUNCATE_EMBED) |
162
163
|`--reranking, --rerank`| enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
163
164
|`--api-key KEY`| API key to use for authentication (default: none)<br/>(env: LLAMA_API_KEY) |
164
165
|`--api-key-file FNAME`| path to file containing API keys (default: none) |
@@ -636,6 +637,8 @@ Returns a JSON object with a field `prompt` containing a string of the input mes
636
637
637
638
The same as [the embedding example](../embedding) does.
638
639
640
+
**Note**: By default, embedding tasks cannot be split across multiple batches for safety. For large inputs that exceed the batch size, use the `--truncate-embed` flag to enable automatic truncation. When truncation occurs, the `truncated` field in the response will indicate this.
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
1177
1180
1181
+
**Note**: By default, embedding tasks cannot be split across multiple batches for safety. For large inputs that exceed the batch size, use the `--truncate-embed` flag to enable automatic truncation. When truncation occurs, the `truncated` field in the response will indicate this.
1182
+
1178
1183
*Options:*
1179
1184
1180
1185
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
// Note: If the input was truncated (slot.truncated == true), this embedding
2590
+
// represents only the processed portion of the original input
2579
2591
for (int i = 0; i < batch.n_tokens; ++i) {
2580
2592
if (!batch.logits[i] || batch.seq_id[i][0] != slot.id) {
2581
2593
continue;
@@ -3129,7 +3141,7 @@ struct server_context {
3129
3141
continue;
3130
3142
}
3131
3143
3132
-
if (!slot.can_split()) {
3144
+
if (!slot.can_split(params_base.truncate_embed)) {
3133
3145
if (slot.n_prompt_tokens > n_ubatch) {
3134
3146
slot.release();
3135
3147
send_error(slot, "input is too large to process. increase the physical batch size", ERROR_TYPE_SERVER);
@@ -3146,7 +3158,8 @@ struct server_context {
3146
3158
// if context shift is disabled, we make sure prompt size is smaller than KV size
3147
3159
// TODO: there should be a separate parameter that control prompt truncation
3148
3160
// context shift should be applied only during the generation phase
3149
-
if (slot.n_prompt_tokens >= slot.n_ctx) {
3161
+
// For embedding tasks, allow truncation even when context shift is disabled
3162
+
if (slot.n_prompt_tokens >= slot.n_ctx && !slot.need_embd()) {
3150
3163
slot.release();
3151
3164
send_error(slot, "the request exceeds the available context size. try increasing the context size or enable context shift", ERROR_TYPE_INVALID_REQUEST);
// Warn specifically for embedding tasks about potential quality impact
3203
+
if (slot.need_embd()) {
3204
+
SLT_WRN(slot, "%s", "WARNING: Embedding input was truncated. The resulting embedding may not fully represent the original input. Consider increasing context size or reducing input length for better embedding quality.");
3205
+
}
3188
3206
3189
3207
GGML_ASSERT(slot.n_prompt_tokens < slot.n_ctx);
3190
3208
}
@@ -3272,7 +3290,7 @@ struct server_context {
3272
3290
slot.n_prompt_tokens_processed = 0;
3273
3291
}
3274
3292
3275
-
if (!slot.can_split()) {
3293
+
if (!slot.can_split(params_base.truncate_embed)) {
3276
3294
// cannot fit the prompt in the current batch - will try next iter
3277
3295
if (batch.n_tokens + slot.n_prompt_tokens > n_batch) {
0 commit comments