-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I am running server as
llama-server -m ./models/bge-small-en-v1.5-f16.gguf --host 0.0.0.0 --port 8081 --embedding -t 8 --embd-bge-small-en-default --pooling cls
tokenization works as
curl http://localhost:8081/tokenize \
-H "Content-Type: application/json" \
-d '{"content":["some random input a some more"] }'
This is useful if we want to make sure not to exceed context window. We can even send tokens back (not only strings), as follows
curl http://localhost:8081/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"content": [[2070,6721,7953,1037,2070],[2062,1998,2070,2625]] }'
which is very useful.
Unfortunately for tokinzation step, there is no way to get tokens for each input separately. E.g. if we send
curl http://localhost:8081/tokenize \
-H "Content-Type: application/json" \
-d '{"content":["some random input a some more", "and some less"] }'
{"tokens":[2070,6721,7953,1037,2070,2062,1998,2070,2625]}
Hence we can send batched requests (multiple), but single vector will come out (we want vector per input string).
Alternatively, we could support option to auto-truncate and not fail with embedding model (which is available for chat), n_keep parameter. Because if we surpass context size, we fail in embedding. This would be cleaner, since generating embeddings from tokens vs strings has slight differences.
Motivation
We dont want to run 50 requests to server in parallel if we later plan to send single batched request with 50 strings for embedding, since embedding does provide batched endpoint. Alternatively (preferred) we could allow n_keep parameter in /embeddings endpoint to allow to send arbitrary long text and not fail.
Possible Implementation
No response