Skip to content

Feature Request: batched endpoints for tokenizationΒ #16458

@jozefRudy

Description

@jozefRudy

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I am running server as

llama-server -m ./models/bge-small-en-v1.5-f16.gguf --host 0.0.0.0 --port 8081 --embedding -t 8 --embd-bge-small-en-default --pooling cls

tokenization works as

curl http://localhost:8081/tokenize \
        -H "Content-Type: application/json" \
        -d '{"content":["some random input a some more"] }'

This is useful if we want to make sure not to exceed context window. We can even send tokens back (not only strings), as follows

curl http://localhost:8081/v1/embeddings \
        -H "Content-Type: application/json" \
        -d '{"content": [[2070,6721,7953,1037,2070],[2062,1998,2070,2625]] }'

which is very useful.

Unfortunately for tokinzation step, there is no way to get tokens for each input separately. E.g. if we send

curl http://localhost:8081/tokenize \
        -H "Content-Type: application/json" \
        -d '{"content":["some random input a some more", "and some less"] }'
{"tokens":[2070,6721,7953,1037,2070,2062,1998,2070,2625]}

Hence we can send batched requests (multiple), but single vector will come out (we want vector per input string).

Alternatively, we could support option to auto-truncate and not fail with embedding model (which is available for chat), n_keep parameter. Because if we surpass context size, we fail in embedding. This would be cleaner, since generating embeddings from tokens vs strings has slight differences.

Motivation

We dont want to run 50 requests to server in parallel if we later plan to send single batched request with 50 strings for embedding, since embedding does provide batched endpoint. Alternatively (preferred) we could allow n_keep parameter in /embeddings endpoint to allow to send arbitrary long text and not fail.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestroadmapPart of a roadmap project

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions