Skip to content

Conversation

@ggerganov
Copy link
Member

fix #9530

When the temperature is non-positive, we can simply sample greedily the token with the highest logit. But in some cases, the probs of the secondary tokens are also required (e.g. llama-server to display candidate probs, llama-speculative to peform stochastic speculative sampling). In such cases, we first filter the the top sparams.n_probs tokens via a top-k sampler and then apply softmax to them in order to avoid sorting the full vocabulary.

Also add perf timings to test-sampling to keep track of the performance of the samplers.

@github-actions github-actions bot added testing Everything test related examples labels Sep 23, 2024
@ggerganov ggerganov merged commit b0f2736 into master Sep 24, 2024
1 check passed
@ggerganov ggerganov deleted the gg/sampling-faster-greedy branch September 24, 2024 06:03
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64

3 participants