Skip to content

Conversation

@yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Oct 30, 2025

Enable vLLM to load embedding model and execute embedding requests

@yinggeh yinggeh force-pushed the yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton branch from 0acb76f to 7d06043 Compare October 30, 2025 01:12
@yinggeh yinggeh force-pushed the yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton branch from 7d06043 to 2c3e148 Compare October 30, 2025 01:14
…end into yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton
Copy link

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments.

pooling_params = PoolingParams(dimensions=dims, task="embed")
return pooling_params

def create_response(self, request_output):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have a type hint on request_output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

async for response in response_iterator:
yield response

def create_response(self, request_output_state, request_output, prepend_input):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have type hints on request_output_state, request_output, and prepend_input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed



class RequestBase:
def __init__(self, request, executor_callback, output_dtype):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have type hints on request, executor_callback, and output_dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

from abc import abstractmethod
from io import BytesIO

import numpy as np
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is using numpy (CPU) good enough?

do we want to leverage cupy (GPU)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that vLLM engine takes care of it.

@yinggeh yinggeh changed the base branch from main to r25.10 October 30, 2025 22:46
"optional": True,
},
# Tentative input reserved for embedding requests in OpenAI-compatible frontend. Subject to change in the future.
# WARN: Triton client should never set this input. It is reserved for embedding requests in OpenAI-compatible frontend.
Copy link
Member

@pskiran1 pskiran1 Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit support to only the OpenAI frontend? Maybe we should also allow deploying embedding models using only the vLLM backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a separate issue. The input prompt format of chat/completion is different from embeddings. We need two different sets of configuration inputs for generate and embed.

…d into yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton
@yinggeh yinggeh changed the base branch from r25.10 to main November 3, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants