Skip to content

Conversation

@huydt84
Copy link
Contributor

@huydt84 huydt84 commented Jun 1, 2025

Issue: #13820

Maybe Qwen team is going to release rerankers based on Qwen3ForCausalLM, and the way those models perform ranking is much similar to embedding

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")

def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

This is very different from reranking API supported by llama-cpp, so this scenario should be handled by /embedding endpoint

cc: @yuhao318

Now you can run by:

  • Start llama-cpp server: llama-server -m qwen-reranker.gguf --embedding --pooling none ...
  • Change your HuggingFace ranking code like the following. This is only the sample code:
import requests
...
pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]

# Calculate embeddings
def get_logits_embeddings(content: list[str]):
    url = "http://localhost:8080/embeddings"
    headers = {"Content-Type": "application/json"}
    data = {
        "content": content
    }

    response = requests.post(url, headers=headers, json=data)
    json_response = response.json()
    return [json_response[i]['embedding'][-1] for i in range(len(json_response))]
    
# Get the scores
def compute_logits(embeddings):
    true_vector = embeddings[:, token_true_id]
    false_vector = embeddings[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

embeddings = get_logits_embeddings(pairs)
print("scores: ", compute_logits(torch.tensor(embeddings)))

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is incorrect. The way Qwen3-Reranking work is by simply get the logits of yes and no token output and compare them. There is absolutely no need to patch the internal code of llama.cpp, as we already had llama_get_logits_ith for this purpose.

And we may not even need need to do anything, since we already support returning raw logits via API, client code can read the logits directly

@ngxson
Copy link
Collaborator

ngxson commented Jun 1, 2025

And we may not even need need to do anything, since we already support returning raw logits via API, client code can read the logits directly

llama-server is compatible with OAI logprobs

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 1, 2025

And we may not even need need to do anything, since we already support returning raw logits via API, client code can read the logits directly

So we will use completion API for this case?

@ngxson
Copy link
Collaborator

ngxson commented Jun 1, 2025

Yes and you also need the correct prompt

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 1, 2025

Since this implementation is incorrect, the PR is closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants