Skip to content

Conversation

danbev
Copy link
Member

@danbev danbev commented Sep 25, 2025

This commit adds a 'normalized' field to the llama_token_data_array
struct to indicate whether the probabilities have been computed and
normalized from the logits.

The motivation for this change is to avoid redundant normalization
calls in the sampling code, as the softmax calculation can be
expensive depending on the size of the llama_token_data array.

Samplers that modify logits or filter tokens (change the size) must set
normalized to false to invalidate cached probabilities. Samplers that
compute probabilities set it to true after normalization.

Refs: #9294 (review)

@github-actions github-actions bot added testing Everything test related examples server labels Sep 25, 2025
@MaggotHATE
Copy link
Contributor

Changing the order of samplers will cause confusion and mismatch between llama.cpp and other LLM implementations - especially for such essential sampler as temperature. I'd suggest to keep the order and change the temperature sampler, if possible.

@danbev
Copy link
Member Author

danbev commented Sep 29, 2025

Looking into this more, I don't think merging the two fields will work. We need to preserve the original logits because:

  1. When a sampler changes the array size (filters tokens), we need to renormalize the remaining probabilities
  2. At that point, the single score/value field already contains probabilities, not logits
  3. We can't accurately reconstruct logits from probabilities to recompute softmax

But perhaps we can still improve this by adding a flag to track normalization state:

struct llama_token_data_array {
    ...
    bool normalized;  // true if .p contains valid probabilities from current .logit values
};

This would allow samplers to:

  • Check the flag before calling softmax to avoid redundant recalculation
  • Understand the state, like are the .p values currently valid?

Sampler responsibilities:

  • If modifying .logit values then set normalized = false
  • If a sampler changes the size of the token data array, and normalized is true, then it must call softmax again before returning (since probabilities no longer sum to 1.0)
  • If a sampler needs probabilities then it can check the normalized flag, and only call softmax if false

So this should address the concern in the original comment:

Having two values per token is very confusing to me because some samplers operate on one or other, and this can lead to situations where a sampler modifies the probabilities, and the next one calls softmax which discards all the changes to the probabilities and computes them from the logits again.

If the second sampler checked the normalized field it would not overwrite the probabilities of the previous sampler. And if the second sampler updates the probabilites, or modifies the size of the token data array then it would recalculate the softmax and set the normalized field to true.

@MaggotHATE
Copy link
Contributor

I always thought that the end goal in reworking sampling was to standardize on logits only (which makes sense) and use probabilities as temporary/QoL values. Maybe I've forgotten something, but probabilities should always depend on logits, no?

Resorting changes positions (so, logits + probabilities should be updated), truncating changes the set of candidates (so, logits + probabilities should be updated too). So far these are the two end results of all sampling algorithms. Am I missing something?

@danbev
Copy link
Member Author

danbev commented Sep 29, 2025

@MaggotHATE Thanks for your comments on this!

Maybe I've forgotten something, but probabilities should always depend on logits, no?

I think I was just wrong in my initial take on this task and that the single field should just have been the logits if we only have one field.

So if I understood your comment above, then each sampler that needs probabilities could just compute the softmax from the logits in the sampler's llama_sampler_apply function, local to the function. So we'd just have the logit field in llama_token_data and remove the p field entirely?

However, softmax is a somewhat expensive operation, so having a flag (like normalized) to track whether valid probabilities are already available in .p could avoid redundant softmax calls. This would make .p essentially a cached/QoL value rather than independent state.

This is my first real dive into the samplers, so I may be missing some context or design considerations. Does the approach above align with what you were thinking, or is there a different direction I should be considering?

@MaggotHATE
Copy link
Contributor

However, softmax is a somewhat expensive operation, so having a flag (like normalized) to track whether valid probabilities are already available in .p could avoid redundant softmax calls. This would make .p essentially a cached/QoL value rather than independent state.

Yep, that's what I've forgotten; the rework on sampling started because of expensive softmax (at least partly for that reason).

I think the state approach is better than nothing - unnecessary calls will be eliminated, speeding up inference (especially for large chains of samplers). I'm not sure if there's a better approach to logits/probabilities structure, but maybe others will have ideas.

For a moment I thought about keeping logits const and working with probabilities only, but that would be a nightmare to optimize.

This commit adds a 'normalized' field to the llama_token_data_array
struct to indicate whether the probabilities have been computed and
normalized from the logits.

The motivation for this change is to avoid redundant normalization
calls in the sampling code, as the softmax calculation can be
expensive depending on the size of the llama_token_data array.

Samplers that modify logits or filter tokens (change the size) must set
normalized to false to invalidate cached probabilities. Samplers that
compute probabilities set it to true after normalization.
@danbev danbev force-pushed the llama-token-data-one-value branch from 3271c6d to 17855ff Compare October 7, 2025 14:03
@danbev danbev changed the title llama : merge logit and p fields in llama_token_data llama : add normalized field to llama_token_data_array struct Oct 7, 2025
@danbev danbev marked this pull request as ready for review October 8, 2025 05:59
@danbev danbev requested a review from ggerganov as a code owner October 8, 2025 05:59
@ggerganov
Copy link
Member

I am not sure this change solves the original concern because we still have the 2 values p and logit present. We now would have one more extra state that we have to keep track of - the normalized flag.

However, I'm not sure what would be a better approach.

@MaggotHATE
Copy link
Contributor

I am not sure this change solves the original concern because we still have the 2 values p and logit present. We now would have one more extra state that we have to keep track of - the normalized flag.

However, I'm not sure what would be a better approach.

Is it even possible to standardize on one value?

  1. If logit is the one we keep, redistribution becomes obscured, because probabilities never exist as readable values. Even if probabilities are added as display values only (on demand), it will require rewriting multiple samplers (top_p, typical, temp_ext, xtc, both mirostats, infill, maybe more).
  2. If p is the one we keep, the entire sampling chain becomes "detached" from the original logit values - and I'm not sure it's even possible.

@danbev
Copy link
Member Author

danbev commented Oct 9, 2025

I am not sure this change solves the original concern because we still have the 2 values p and logit present. We now would have one more extra state that we have to keep track of - the normalized flag.

I was not able to find a good way to merge this into a single field. This commit tries to address the issue/possibility of samplers overwriting calculated probabilities mentioned in #9294 (review). But we can always leave this as it is for now and reopen at a later point if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants