Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 101 additions & 6 deletions tools/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ enum oaicompat_type {
OAICOMPAT_TYPE_CHAT,
OAICOMPAT_TYPE_COMPLETION,
OAICOMPAT_TYPE_EMBEDDING,
OAICOMPAT_TYPE_API_CHAT
Copy link
Collaborator

@ngxson ngxson May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If OAI does not support this API, then having the prefix OAICOMPAT here will be very confused for other contributors who doesn't know much about the story of ollama.

Tbh, I think this is not very necessary, as most applications nowadays will support OAI-compat API. If they don't, you can have a proxy to convert the API, I bet someone already made that.

Also, since OAI introduced the new Response API, I think we should keep things simple by only supporting OAI specs (which has good support for reasoning and multimodal models). The API for ollama can be added if more users ask for it

Copy link
Contributor Author

@R-Dson R-Dson May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct with that name and it should be changed.

If the endpoint is not part of the OAI API then should it instead be removed completely?

This was mostly added for the cases where someone want to swap out ollama for llama-server.

Copy link
Member

@ggerganov ggerganov May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this API is needed? AFAIK this API is specific to ollama and OAI doesn't have it (they moved to the new Response API instead)

Tbh, I think this is not very necessary, as most applications nowadays will support OAI-compat API.

I second that. The main question that should be asked is if these ollama APIs enable any sort of new useful functionality, compared to the existing standard APIs. If the answer is no, then these APIs should not exist in the first place and we should not support them.

As an example, we introduced the /infill API in llama-server, because the existing /v1/completions spec was not enough to support the needs for advanced local fill-in-the-middle use-cases (#9787).

Currently, there is rudimentary support added for /api/show, /api/tags and /api/chat mainly because VS Code made the mistake to support require them. As soon as this is fixed (microsoft/vscode#249605), these endpoints should be removed from llama-server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that makes it clear why they currently exist.

};

// https://community.openai.com/t/openai-chat-list-of-error-codes-and-types/357791/11
Expand Down Expand Up @@ -676,6 +677,8 @@ struct server_task_result_cmpl_final : server_task_result {
return to_json_oaicompat();
case OAICOMPAT_TYPE_CHAT:
return stream ? to_json_oaicompat_chat_stream() : to_json_oaicompat_chat();
case OAICOMPAT_TYPE_API_CHAT:
return to_json_oaicompat_api_chat();
default:
GGML_ASSERT(false && "Invalid oaicompat_type");
}
Expand Down Expand Up @@ -858,6 +861,55 @@ struct server_task_result_cmpl_final : server_task_result {

return deltas;
}

json to_json_oaicompat_api_chat() {
// Ollama final response format (streaming or non-streaming)
std::time_t t = std::time(0);
std::string finish_reason = "none"; // default value
if (stop == STOP_TYPE_EOS || stop == STOP_TYPE_WORD) {
// Ollama uses "stop" for both EOS and word stops
finish_reason = "stop";
} else if (stop == STOP_TYPE_LIMIT) {
// Ollama uses "length" for limit stops
finish_reason = "length";
}

uint64_t prompt_ns = static_cast<uint64_t>(timings.prompt_ms) * 1e6; // ms to ns
uint64_t predicted_ns = static_cast<uint64_t>(timings.predicted_ms) * 1e6; // ms to ns

json res = {
{ "model", oaicompat_model },
{ "created_at", t },
{ "message",
{
{ "role", "assistant" },
{ "content", stream ? "" : content } // content is empty in final streaming chunk
} },
{ "done_reason", finish_reason },
{ "done", true },
// Add metrics from timings and other fields, converted to nanoseconds
{ "total_duration", prompt_ns + predicted_ns },
{ "load_duration", prompt_ns }, // Assuming load duration is prompt eval time
{ "prompt_eval_count", n_prompt_tokens },
{ "prompt_eval_duration", prompt_ns },
{ "eval_count", n_decoded },
{ "eval_duration", predicted_ns },
{ "prompt_tokens", n_prompt_tokens },
{ "completion_tokens", n_decoded },
{ "total_tokens", n_prompt_tokens + n_decoded },
{ "id_slot", id_slot },
{ "id", oaicompat_cmpl_id },
{ "system_fingerprint", build_info },
{ "object", "chat.completion" },
};

// Ollama non-streaming includes the full content in the final response
if (!stream) {
res["message"]["content"] = content;
}

return res;
}
};

struct server_task_result_cmpl_partial : server_task_result {
Expand Down Expand Up @@ -896,6 +948,8 @@ struct server_task_result_cmpl_partial : server_task_result {
return to_json_oaicompat();
case OAICOMPAT_TYPE_CHAT:
return to_json_oaicompat_chat();
case OAICOMPAT_TYPE_API_CHAT:
return to_json_oaicompat_api_chat();
default:
GGML_ASSERT(false && "Invalid oaicompat_type");
}
Expand Down Expand Up @@ -1007,6 +1061,24 @@ struct server_task_result_cmpl_partial : server_task_result {

return deltas;
}

json to_json_oaicompat_api_chat() {
std::time_t t = std::time(0);
{
// Ollama streaming partial response format
json res = {
{ "model", oaicompat_model },
{ "created_at", t },
{ "message",
{
{ "role", "assistant" }, { "content", content } // partial content
} },
{ "done", false }
};
// Ollama streaming responses don't seem to include timings or logprobs per partial token
return res;
}
}
};

struct server_task_result_embd : server_task_result {
Expand Down Expand Up @@ -4294,22 +4366,35 @@ int main(int argc, char ** argv) {
json res_json = result->to_json();
if (res_json.is_array()) {
for (const auto & res : res_json) {
if (!server_sent_event(sink, "data", res)) {
// sending failed (HTTP connection closed), cancel the generation
return false;
// ollama's /api/chat does not conform to the SEE format
if (oaicompat == OAICOMPAT_TYPE_API_CHAT) {
std::string s = safe_json_to_str(res) + "\n";
if (!sink.write(s.data(), s.size())) {
return false;
}
} else {
if (!server_sent_event(sink, "data", res)) {
// sending failed (HTTP connection closed), cancel the generation
return false;
}
}
}
return true;
} else {
return server_sent_event(sink, "data", res_json);
if (oaicompat == OAICOMPAT_TYPE_API_CHAT) {
std::string s = safe_json_to_str(res_json) + "\n";
return sink.write(s.data(), s.size());
} else {
return server_sent_event(sink, "data", res_json);
}
}
}, [&](const json & error_data) {
server_sent_event(sink, "error", error_data);
}, [&sink]() {
// note: do not use req.is_connection_closed here because req is already destroyed
return !sink.is_writable();
});
if (oaicompat != OAICOMPAT_TYPE_NONE) {
if (oaicompat != OAICOMPAT_TYPE_NONE && oaicompat != OAICOMPAT_TYPE_API_CHAT) {
static const std::string ev_done = "data: [DONE]\n\n";
sink.write(ev_done.data(), ev_done.size());
}
Expand Down Expand Up @@ -4436,6 +4521,16 @@ int main(int argc, char ** argv) {
}

auto body = json::parse(req.body);
// Ollama chat endpoint specific handling
auto OAICOMPAT_TYPE = OAICOMPAT_TYPE_CHAT;
if (req.path == "/api/chat") {
OAICOMPAT_TYPE = OAICOMPAT_TYPE_API_CHAT;

// Set default stream to true for /api/chat
if (!body.contains("stream")) {
body["stream"] = true;
}
}
std::vector<raw_buffer> files;
json data = oaicompat_chat_params_parse(
body,
Expand All @@ -4448,7 +4543,7 @@ int main(int argc, char ** argv) {
files,
req.is_connection_closed,
res,
OAICOMPAT_TYPE_CHAT);
OAICOMPAT_TYPE);
};

// same with handle_chat_completions, but without inference part
Expand Down
Loading