Skip to content

Conversation

@ruivieira
Copy link
Contributor

@ruivieira ruivieira commented Oct 16, 2025

This PR is a proposal to add log probabilities (logprobs) support to llm-d-inference-sim following vLLM's architecture.

vLLM-Style LogprobsProcessor

Added a new LogprobsProcessor that following vLLM's implementation (same tokens always produce same logprobs)

API coverage

  • /v1/completions: logprobs parameter (0-5)
  • /v1/chat/completions: logprobs boolean + top_logprobs integer
  • Per-token logprobs in SSE chunks
  • Response format matching vLLM/OpenAI

Refer to #213

// Check cache stats
hits, misses, hitRate := processor.GetCacheStats()
if hits != 1 {
t.Errorf("Expected 1 cache hit, got %d", hits)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all tests in this project we use Expect function of onsi/gomega. E.g. Expect(err).NotTo(HaveOccurred()), Expect(totalBlocks).To(Equal(unusedBlocks))...
Can you please change to be consistent with other tests.

if i == len(genTokens)-1 && (finishReason == dataset.LengthFinishReason || finishReason == dataset.ToolsFinishReason) {
finishReasonToSend = &finishReason
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part does not change during the loop, could be moved to be calculated before the loop starts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Moved to outside.

@mayabar
Copy link
Collaborator

mayabar commented Oct 16, 2025

@ruivieira Thank you for your contribution!
I have added some general comments regarding the project's conventions for now.
I will perform a more detailed review of the PR soon.

// API response, for chat completion. It sets either role, or token, or tool call info in the message.
func (s *VllmSimulator) createChatCompletionChunk(context *streamingContext, token string, tool *openaiserverapi.ToolCall,
role string, finishReason *string) openaiserverapi.CompletionRespChunk {
role string, finishReason *string, includeLogprobs bool, topK int) openaiserverapi.CompletionRespChunk {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean value which defines if logprob should be returned and the topk value are constant during a request processing, such kind of values we store in streamingContext.
If token is empty no need to create all logprob related fields, otherwise relevant response part will be created.

nPromptTokens: usageData.PromptTokens,
nCachedPromptTokens: reqCtx.CompletionReq.GetNumberOfCachedPromptTokens(),
requestID: req.GetRequestID(),
request: req,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add shouldIncludeLogprobs and topK (maybe change the name to explain the context) to streamingContext instead of request.

(And thanks for adding the missing request id)

func (s *VllmSimulator) sendResponse(reqCtx *openaiserverapi.CompletionReqCtx, respTokens []string, toolCalls []openaiserverapi.ToolCall,
modelName string, finishReason string, usageData *openaiserverapi.Usage) {
resp := s.createCompletionResponse(reqCtx.IsChatCompletion, respTokens, toolCalls, &finishReason, usageData, modelName,
resp := s.createCompletionResponse(reqCtx.CompletionReq, reqCtx.IsChatCompletion, respTokens, toolCalls, &finishReason, usageData, modelName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pass shouldIncludeLogprobs and topK instead of request

// modelName - display name returned to the client and used in metrics. It is either the first alias
// from --served-model-name (for a base-model request) or the LoRA adapter name (for a LoRA request).
func (s *VllmSimulator) createCompletionResponse(isChatCompletion bool, respTokens []string, toolCalls []openaiserverapi.ToolCall,
func (s *VllmSimulator) createCompletionResponse(req openaiserverapi.CompletionRequest, isChatCompletion bool, respTokens []string, toolCalls []openaiserverapi.ToolCall,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pass shouldIncludeLogprobs and topK instead of request

type LogprobData struct {
MainLogprob float64 `json:"main_logprob"`
TopLogprobs []openaiserverapi.ChatCompletionLogProb `json:"top_logprobs"`
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the struct and the fields have to be public? And why are the json tags needed?

hits2, misses2, _ := processor.GetCacheStats()
Expect(hits2).To(BeNumerically(">=", 0))
Expect(misses2).To(BeNumerically(">=", 3))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests for the simulator with logprobs:
completions and chat completions requests with logprobs with and without streaming, and check the response?

cacheMutex sync.RWMutex

// cacheHits and cacheMisses for metrics
cacheHits int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheHits and cacheMisses should be changed under protection of the mutex or be an atomic counters (from sync/atomic package)

}

// generateDeterministicLogprobs creates logprobs with deterministic values based on token content
// This follows vLLM's approach of consistent logprobs for the same token in similar contexts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruivieira can you please explain how context is taken into consideration here. As I understand a token should give same logprob only in the same/similar context. What does context means here? Isn't it the previous tokens? But they are not passed to this function.

Comment on lines +132 to +135
for k := range lp.tokenCache {
delete(lp.tokenCache, k)
break
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iteration on a map will not be in order of adding objects to the map, is we want to remove the oldest item - need to store the insertion date/time or latest usage date/time

}

// ClearCache clears the logprobs cache
func (lp *LogprobsProcessor) ClearCache() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

}

// GetCacheStats returns cache performance statistics
func (lp *LogprobsProcessor) GetCacheStats() (hits, misses int64, hitRate float64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to not add function used in test only to the class, we may add an additional file or maybe choose not to use it in test

@mayabar
Copy link
Collaborator

mayabar commented Oct 19, 2025

@ruivieira Please see some more comment I added.

General question:
In the related issue #213 written

The generated logprobs are synthetic and not mathematically valid. They are provided solely to enable llmd-sim to work with evaluation frameworks and tools that require logprobs in the API response. The focus is on API compatibility and integration testing, not on generating accurate probability distributions.

If the purpose is only to support the API without concern for the actual returned content, why not generate random responses (both tokens and probabilities) and keep the implementation minimal — without involving cache, hash calculations, or other overhead?

@ruivieira
Copy link
Contributor Author

@mayabar That's a good point. My original implementation did take that approach, but ultimately I've pushed this one mainly for reproducibility and consistent (even if simulated) results.

I agree that purely for API coverage generating random responses is enough and avoids the overhead. I'll open a new PR with a new implementation and, if preferred, we close this one?

@mayabar
Copy link
Collaborator

mayabar commented Oct 20, 2025

@ruivieira I think that continue with this PR is better way, you have here lot of relevant code, you just can remove from the logprobs_processor all code relevant to the cache and use random response generation.

@ruivieira
Copy link
Contributor Author

@mayabar did not see your comment before opening #221 :)
Let me know if you still prefer to change this one, fine for me either way. Thanks!

@mayabar
Copy link
Collaborator

mayabar commented Oct 21, 2025

Let's continue with your new PR #221

@mayabar mayabar changed the title feat: Log probabilities support feat: Log probabilities support (with cache) Oct 21, 2025
@mayabar
Copy link
Collaborator

mayabar commented Oct 21, 2025

Replaced by #221

@mayabar mayabar closed this Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants