changelog : `llama-server` REST API

**Note: This issue was copied from [https://github.com/ggml-org/llama.cpp/issues/9291](https://github.com/ggml-org/llama.cpp/issues/9291)**

**Original Author:** @ggerganov
**Original Issue Number:** #9291
**Created:** 2024-09-03T06:56:11Z

---

# Overview

This is a list of changes to the public HTTP interface of the `llama-server` example. Collaborators are encouraged to edit this post in order to reflect important changes to the API that end up merged into the `master` branch.

If you are building a 3rd party project that relies on `llama-server`, it is recommended to follow this issue and check it carefully before upgrading to new versions.

See also:

- [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)

## Recent API changes (most recent at the top)

| version | PR  | desc |
| ---     | --- | ---  |
| b6523 | #16109 | In stream mode, error events are now OAI-compatible |
| b6508 | #16052 | include usage statistics only when `stream_options.include_usage` is specified |
| b6399 | #15827 | added `return_progress` and `timings.cache_n` |
| b6243 | #15108 | Add multimodal support to `completions` and `embeddings` endpoints |
| b6205 | #15416 | Disable context shift by default |
| b5441 | #13660 | Remove `/metrics` fields related to KV cache tokens and cells` |
| b5441 | #13660 | Remove `/metrics` fields related to KV cache tokens and cells` |
| b5223 | #13174 | For chat competion, if last message is assistant, it will be a prefilled message |
| b4599 | #9639 | `/v1/chat/completions` now supports `tools` & `tool_choice` |
| TBD.  | #10974 | `/v1/completions` is now OAI-compat |
| TBD.  | #10783 | `logprobs` is now OAI-compat, default to pre-sampling probs |
| TBD.  | #10861 | `/embeddings` supports pooling type `none` |
| TBD.  | #10853 | Add optional `"tokens"` output to `/completions` endpoint |
| b4337 | #10803 | Remove `penalize_nl` |
| b4265 | #10626 | CPU docker images working directory changed to /app |
| b4285 | #10691 | (Again) Change `/slots` and `/props` responses |
| b4283 | #10704 | Change `/slots` and `/props` responses |
| b4027 | #10162 | `/slots` endpoint: remove `slot[i].state`, add `slot[i].is_processing` |
| b3912 | #9865 | Add option to time limit the generation phase |
| b3911 | #9860 | Remove self-extend support |
| b3910 | #9857 | Remove legacy system prompt support |
| b3897 | #9776 | Change default security settings, `/slots` is now disabled by default<br/>Endpoints now check for API key if it's set |
| b3887 | #9510 | Add `/rerank` endpoint |
| b3754 | #9459 | Add `[DONE]\n\n` in OAI stream response to match spec |
| b3721 | #9398 | Add `seed_cur` to completion response |
| b3683 | #9308 | Environment variable updated |
| b3599 | #9056 | Change `/health` and `/slots` |

*For older changes, use:*

```bash
git log --oneline -p b3599 -- examples/server/README.md
```

## Upcoming API changes

- TBD

version	PR	desc
b6523	ggml-org#16109	In stream mode, error events are now OAI-compatible
b6508	ggml-org#16052	include usage statistics only when `stream_options.include_usage` is specified
b6399	ggml-org#15827	added `return_progress` and `timings.cache_n`
b6243	ggml-org#15108	Add multimodal support to `completions` and `embeddings` endpoints
b6205	ggml-org#15416	Disable context shift by default
b5441	ggml-org#13660	Remove `/metrics` fields related to KV cache tokens and cells`
b5441	ggml-org#13660	Remove `/metrics` fields related to KV cache tokens and cells`
b5223	ggml-org#13174	For chat competion, if last message is assistant, it will be a prefilled message
b4599	ggml-org#9639	`/v1/chat/completions` now supports `tools` & `tool_choice`
TBD.	ggml-org#10974	`/v1/completions` is now OAI-compat
TBD.	ggml-org#10783	`logprobs` is now OAI-compat, default to pre-sampling probs
TBD.	ggml-org#10861	`/embeddings` supports pooling type `none`
TBD.	ggml-org#10853	Add optional `"tokens"` output to `/completions` endpoint
b4337	ggml-org#10803	Remove `penalize_nl`
b4265	ggml-org#10626	CPU docker images working directory changed to /app
b4285	ggml-org#10691	(Again) Change `/slots` and `/props` responses
b4283	ggml-org#10704	Change `/slots` and `/props` responses
b4027	ggml-org#10162	`/slots` endpoint: remove `slot[i].state`, add `slot[i].is_processing`
b3912	ggml-org#9865	Add option to time limit the generation phase
b3911	ggml-org#9860	Remove self-extend support
b3910	ggml-org#9857	Remove legacy system prompt support
b3897	ggml-org#9776	Change default security settings, `/slots` is now disabled by default Endpoints now check for API key if it's set
b3887	ggml-org#9510	Add `/rerank` endpoint
b3754	ggml-org#9459	Add `[DONE]\n\n` in OAI stream response to match spec
b3721	ggml-org#9398	Add `seed_cur` to completion response
b3683	ggml-org#9308	Environment variable updated
b3599	ggml-org#9056	Change `/health` and `/slots`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

changelog : `llama-server` REST API #245

Overview

Recent API changes (most recent at the top)

Upcoming API changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

changelog : llama-server REST API #245

Description

Overview

Recent API changes (most recent at the top)

Upcoming API changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

changelog : `llama-server` REST API #245