Skip to content

Commit 0b4aab8

Browse files
authored
Merge pull request #713 from corebonts/openai-compatibility
Improve OpenAI compatibility for /v1/* endpoints
2 parents 480b76f + 01cc848 commit 0b4aab8

File tree

8 files changed

+253
-3
lines changed

8 files changed

+253
-3
lines changed

llamafile/server/client.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,8 @@ Client::dispatcher()
699699
return v1_completions();
700700
if (p1 == "v1/chat/completions")
701701
return v1_chat_completions();
702+
if (p1 == "v1/models")
703+
return v1_models();
702704
if (p1 == "slotz")
703705
return slotz();
704706
if (p1 == "flagz")

llamafile/server/client.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ struct Client
117117
bool v1_chat_completions() __wur;
118118
bool get_v1_chat_completions_params(V1ChatCompletionParams*) __wur;
119119

120+
bool v1_models() __wur;
121+
120122
bool slotz() __wur;
121123
bool flagz() __wur;
122124
bool db_chat(int64_t) __wur;

llamafile/server/doc/endpoints.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# LLaMAfiler Endpoints Reference
22

3-
- [`/tokenize`](tokenize.md)
4-
- [`/embedding`](embedding.md)
5-
- [`/v1/chat/completions`](v1_chat_completions.md)
3+
- [`/v1/tokenize`](tokenize.md) endpoint provides a robust interface for
4+
converting text prompts into tokens.
5+
- [`/v1/embedding`](embedding.md) endpoint provides a way to
6+
transform textual prompts into numerical representations.
7+
- [`/v1/chat/completions`](v1_chat_completions.md) endpoint lets you build a chatbot.
8+
- [`/v1/completions`](v1_completions.md) returns a predicted completion for a given prompt.
9+
- `/v1/models` returns a basic model info which is usually used by OpenAI clients for discovery and health check.

llamafile/server/doc/v1_chat_completions.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,19 @@ This endpoint supports the following features:
7878
will be named delta instead. It's assumed the client will reconstruct
7979
the full conversation.
8080

81+
- `stream_options`: `object|null`
82+
83+
Options for streaming the API response. This parameter is only
84+
applicable when `stream: true` is also specified. Default is `null`.
85+
86+
- `include_usage`: `boolean|null`
87+
88+
Whether to include usage statistics in the streaming response. Default is `false`.
89+
90+
If set to `true`, a `usage` field with the usage information will be
91+
included in an additional empty chunk. Note that all other chunks will
92+
also contain this field, but with `null` value.
93+
8194
- `max_tokens`: `integer|null`
8295

8396
Specifies an upper bound for the number of tokens that can be
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# LLaMAfiler Completions Endpoint
2+
3+
The `/v1/completions` endpoint generates text completions based on a
4+
given prompt. It provides a flexible interface for text generation,
5+
allowing customization of parameters such as temperature, top-p
6+
sampling, and maximum tokens.
7+
8+
This endpoint supports the following features:
9+
10+
1. Deterministic outputs using a fixed seed
11+
2. Streaming responses for real-time token generation
12+
3. Configurable stopping criteria for token generation
13+
14+
## Request URIs
15+
16+
- `/v1/completions` (OpenAI API compatible)
17+
18+
## Request Methods
19+
20+
- `POST`
21+
22+
## Request Content Types
23+
24+
- `application/json` must be used.
25+
26+
## Request Parameters
27+
28+
- `model`: `string`
29+
30+
Specifies name of model to run.
31+
32+
Only a single model is currently supported, so this field is simply
33+
copied along to the response. In the future, this will matter.
34+
35+
This field is required in the request.
36+
37+
- `prompt`: `string`
38+
39+
The input text that the model will generate a completion for.
40+
41+
This field is required.
42+
43+
- `stream`: `boolean|null`
44+
45+
If this field is optionally set to true, then this endpoint will
46+
return a text/event-stream using HTTP chunked transfer encoding. This
47+
allows your chatbot to rapidly show text as it's being genearted. The
48+
standard JSON response is slightly modified so that its message field
49+
will be named delta instead. It's assumed the client will reconstruct
50+
the full conversation.
51+
52+
- `stream_options`: `object|null`
53+
54+
Options for streaming the API response. This parameter is only
55+
applicable when `stream: true` is also specified. Default is `null`.
56+
57+
- `include_usage`: `boolean|null`
58+
59+
Whether to include usage statistics in the streaming response. Default is `false`.
60+
61+
If set to `true`, a `usage` field with the usage information will be
62+
included in an additional empty chunk. Note that all other chunks will
63+
also contain this field, but with `null` value.
64+
65+
- `max_tokens`: `integer|null`
66+
67+
Specifies an upper bound for the number of tokens that can be
68+
generated for this completion. This can be used to control compute
69+
and/or latency costs.
70+
71+
- `top_p`: `number|null`
72+
73+
May optionally be used to set the `top_p` sampling parameter. This
74+
should be a floating point number. Setting this to 1.0 (the default)
75+
will disable this feature. Setting this to, for example, 0.1, would
76+
mean that only the top 10% probability tokens are considered.
77+
78+
We generally recommend altering this or temperature but not both.
79+
80+
- `temperature`: `number|null`
81+
82+
Configures the randomness level of generated text.
83+
84+
This field may be set to a value between 0.0 and 2.0 inclusive. It
85+
defaults to 1.0. Lower numbers are more deterministic. Higher numbers
86+
mean more randomness.
87+
88+
We generally recommend altering this or top_p but not both.
89+
90+
- `seed`: `integer|null`
91+
92+
If specified, llamafiler will make its best effort to sample
93+
deterministically, even when temperature is non-zero. This means that
94+
repeated requests with the same seed and parameters should return the
95+
same result.
96+
97+
- `presence_penalty`: `number|null`
98+
99+
Number between -2.0 and 2.0. Positive values penalize new tokens based
100+
on whether they appear in the text so far, increasing the model's
101+
likelihood to talk about new topics.
102+
103+
- `frequency_penalty`: `number|null`
104+
105+
Number between -2.0 and 2.0. Positive values penalize new tokens based
106+
on their existing frequency in the text so far, decreasing the model's
107+
likelihood to repeat the same line verbatim.
108+
109+
- `user`: `string|null`
110+
111+
A unique identifier representing your end-user, which can help
112+
llamafiler to monitor and detect abuse.
113+
114+
- `stop`: `string|array<string>|null`
115+
116+
Specifies up to 4 stop sequences where the API will cease text generation.
117+
118+
## See Also
119+
120+
- [LLaMAfiler Documentation Index](index.md)
121+
- [LLaMAfiler Endpoints Reference](endpoints.md)
122+
- [LLaMAfiler Technical Details](technical_details.md)

llamafile/server/v1_chat_completions.cpp

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ namespace server {
4646
struct V1ChatCompletionParams
4747
{
4848
bool stream = false;
49+
bool stream_include_usage = false;
4950
long max_tokens = -1;
5051
long seed = _rand64();
5152
double top_p = 1;
@@ -276,6 +277,26 @@ Client::get_v1_chat_completions_params(V1ChatCompletionParams* params)
276277
if (!stream.isBool())
277278
return send_error(400, "stream field must be boolean");
278279
params->stream = stream.getBool();
280+
281+
// stream_options: object|null
282+
//
283+
// Options for the streaming response.
284+
Json& stream_options = json["stream_options"];
285+
if (!stream_options.isNull()) {
286+
if (!stream_options.isObject())
287+
return send_error(400, "stream_options field must be object");
288+
289+
// include_usage: bool|null
290+
//
291+
// Include usage also for streaming responses. The actual usage will be reported before
292+
// the [DONE] message, but all chunks contain an empty usage field.
293+
Json& include_usage = stream_options["include_usage"];
294+
if (!include_usage.isNull()) {
295+
if (!include_usage.isBool())
296+
return send_error(400, "include_usage field must be boolean");
297+
params->stream_include_usage = include_usage.getBool();
298+
}
299+
}
279300
}
280301

281302
// max_tokens: integer|null
@@ -570,6 +591,8 @@ Client::v1_chat_completions()
570591
return false;
571592
choice["delta"]["role"] = "assistant";
572593
choice["delta"]["content"] = "";
594+
if (params->stream_include_usage)
595+
response->json["usage"] = nullptr;
573596
}
574597

575598
// prefill time
@@ -661,6 +684,12 @@ Client::v1_chat_completions()
661684
if (params->stream) {
662685
choice["delta"]["content"] = "";
663686
response->json["created"] = timespec_real().tv_sec;
687+
if (params->stream_include_usage) {
688+
Json& usage = response->json["usage"];
689+
usage["prompt_tokens"] = prompt_tokens;
690+
usage["completion_tokens"] = completion_tokens;
691+
usage["total_tokens"] = completion_tokens + prompt_tokens;
692+
}
664693
response->content = make_event(response->json);
665694
choice.getObject().erase("delta");
666695
if (!send_response_chunk(response->content))

llamafile/server/v1_completions.cpp

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ struct V1CompletionParams
4646
{
4747
bool echo = false;
4848
bool stream = false;
49+
bool stream_include_usage = false;
4950
long max_tokens = -1;
5051
long seed = _rand64();
5152
double top_p = 1;
@@ -248,6 +249,26 @@ Client::get_v1_completions_params(V1CompletionParams* params)
248249
if (!stream.isBool())
249250
return send_error(400, "stream field must be boolean");
250251
params->stream = stream.getBool();
252+
253+
// stream_options: object|null
254+
//
255+
// Options for the streaming response.
256+
Json& stream_options = json["stream_options"];
257+
if (!stream_options.isNull()) {
258+
if (!stream_options.isObject())
259+
return send_error(400, "stream_options field must be object");
260+
261+
// include_usage: bool|null
262+
//
263+
// Include usage also for streaming responses. The actual usage will be reported before
264+
// the [DONE] message, but all chunks contain an empty usage field.
265+
Json& include_usage = stream_options["include_usage"];
266+
if (!include_usage.isNull()) {
267+
if (!include_usage.isBool())
268+
return send_error(400, "include_usage field must be boolean");
269+
params->stream_include_usage = include_usage.getBool();
270+
}
271+
}
251272
}
252273

253274
// max_tokens: integer|null
@@ -441,6 +462,8 @@ Client::v1_completions()
441462
choice["delta"]["role"] = "assistant";
442463
choice["delta"]["content"] = "";
443464
response->json["created"] = timespec_real().tv_sec;
465+
if (params->stream_include_usage)
466+
response->json["usage"] = nullptr;
444467
response->content = make_event(response->json);
445468
choice.getObject().erase("delta");
446469
if (!send_response_chunk(response->content))
@@ -494,6 +517,12 @@ Client::v1_completions()
494517
if (params->stream) {
495518
choice["text"] = "";
496519
response->json["created"] = timespec_real().tv_sec;
520+
if (params->stream_include_usage) {
521+
Json& usage = response->json["usage"];
522+
usage["prompt_tokens"] = prompt_tokens;
523+
usage["completion_tokens"] = completion_tokens;
524+
usage["total_tokens"] = completion_tokens + prompt_tokens;
525+
}
497526
response->content = make_event(response->json);
498527
if (!send_response_chunk(response->content))
499528
return false;

llamafile/server/v1_models.cpp

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
// -*- mode:c++;indent-tabs-mode:nil;c-basic-offset:4;coding:utf-8 -*-
2+
// vi: set et ft=cpp ts=4 sts=4 sw=4 fenc=utf-8 :vi
3+
//
4+
// Copyright 2024 Mozilla Foundation
5+
//
6+
// Licensed under the Apache License, Version 2.0 (the "License");
7+
// you may not use this file except in compliance with the License.
8+
// You may obtain a copy of the License at
9+
//
10+
// http://www.apache.org/licenses/LICENSE-2.0
11+
//
12+
// Unless required by applicable law or agreed to in writing, software
13+
// distributed under the License is distributed on an "AS IS" BASIS,
14+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
// See the License for the specific language governing permissions and
16+
// limitations under the License.
17+
18+
#include "client.h"
19+
#include "llama.cpp/llama.h"
20+
#include "llamafile/json.h"
21+
#include "llamafile/llamafile.h"
22+
#include "llamafile/string.h"
23+
#include <ctime>
24+
25+
using jt::Json;
26+
27+
namespace lf {
28+
namespace server {
29+
30+
// Use it as reported model creation time
31+
static const time_t model_creation_time = time(0);
32+
33+
bool
34+
Client::v1_models()
35+
{
36+
jt::Json json;
37+
json["object"] = "list";
38+
Json& model = json["data"][0];
39+
model["id"] = stripext(basename(FLAG_model));
40+
model["object"] = "model";
41+
model["created"] = model_creation_time;
42+
model["owned_by"] = "llamafile";
43+
char* p = append_http_response_message(obuf_.p, 200);
44+
p = stpcpy(p, "Content-Type: application/json\r\n");
45+
return send_response(obuf_.p, p, json.toString());
46+
}
47+
48+
} // namespace server
49+
} // namespace lf

0 commit comments

Comments
 (0)