Skip to content

Conversation

@ggerganov
Copy link
Member

fix #13689

Temporary workaround until batching logic in libllama is improved.

@aviallon
Copy link
Contributor

aviallon commented May 22, 2025

It works very well. There is only one issue (which may come from me): when chunking the input using HF Tokenizers to ensure we feed at most n_ubatch/n_batch/n_ctx_per_seq tokens to the embedding model, we are always around 1 to 3 tokens too big.
For instance if HF Tokenizers predicts input will be 512 tokens, it might very well be considered to be 513 tokens by llama.cpp… which will make it return an error.

For now, I worked-around this other issue by simply adding a safety margin of 3 tokens.

@ggerganov
Copy link
Member Author

Most likely when you tokenize in HF transformers you don't take into account special tokens such as BOS, EOS, CLS, etc. These are model-specific and are automatically added by the llama-server.

Though it would be nice to track down the root cause - it's also possible that we are doing something wrong.

@ggerganov ggerganov merged commit cc74d5b into master May 22, 2025
53 checks passed
@ggerganov ggerganov deleted the gg/server-fix-pooling-small-batches branch May 22, 2025 13:33
@aviallon
Copy link
Contributor

aviallon commented May 22, 2025

@ggerganov I believe you are right on the cause. I'll experiment with that. Thank you very much for your insanely good skills and rapidity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GGML_ASSERT(seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN") failed

4 participants