Feature Request: batched endpoints for tokenization

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I am running server as

```bash
llama-server -m ./models/bge-small-en-v1.5-f16.gguf --host 0.0.0.0 --port 8081 --embedding -t 8 --embd-bge-small-en-default --pooling cls
```

tokenization works as 
```bash
curl http://localhost:8081/tokenize \
        -H "Content-Type: application/json" \
        -d '{"content":["some random input a some more"] }'
```

This is useful if we want to make sure not to exceed context window. We can even send tokens back (not only strings), as follows

```bash
curl http://localhost:8081/v1/embeddings \
        -H "Content-Type: application/json" \
        -d '{"content": [[2070,6721,7953,1037,2070],[2062,1998,2070,2625]] }'
```
which is very useful.

Unfortunately for tokinzation step, there is no way to get tokens for each input separately. E.g. if we send 

```bash
curl http://localhost:8081/tokenize \
        -H "Content-Type: application/json" \
        -d '{"content":["some random input a some more", "and some less"] }'
{"tokens":[2070,6721,7953,1037,2070,2062,1998,2070,2625]}
```

Hence we can  send batched requests (multiple), but single vector will come out (we want vector per input string).

Alternatively, we could support option to auto-truncate and not fail with embedding model (which is available for chat), n_keep parameter. Because if we surpass context size, we fail in embedding. This would be cleaner, since generating embeddings from tokens vs strings has slight differences.


### Motivation

We dont want to run 50 requests to server in parallel if we later plan to send single batched request with 50 strings for embedding, since embedding does provide batched endpoint. Alternatively (preferred) we could allow n_keep parameter in /embeddings endpoint to allow to send arbitrary long text and not fail.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: batched endpoints for tokenization #16458

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: batched endpoints for tokenization #16458

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions