Motivation
The /v1/embeddings endpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings.
Currently lmdeploy's /v1/embeddings is a stub that returns Unsupported by turbomind. The infrastructure to pass last_hidden_state through the pipeline already exists at the high level (Response, EngineOutput, GenOut all have the field), but the PyTorch engine's internal pipeline never populates it.
Related resources
Additional context
I have a working implementation on branch feat/embeddings-endpoint that:
- Replaces the stub with a real endpoint that calls the engine with
max_new_tokens=0 + output_last_hidden_state='all', then mean-pools the hidden states
- Threads
last_hidden_states through the PyTorch engine pipeline (BatchedOutputs → InferOutput → EngineOutput), since previously only TurboMind supported hidden state extraction
- Supports both
float and base64 encoding formats per OpenAI spec
Changes: ~160 lines across 9 files (mostly plumbing existing types).
Before opening a PR, I'd like to confirm:
- Is this feature direction aligned with the project? (vs. focusing on the existing
/pooling endpoint)
- Any concerns about the PyTorch engine hidden states pipeline changes?
Happy to open a PR if the direction is approved.
Motivation
The
/v1/embeddingsendpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings.Currently lmdeploy's
/v1/embeddingsis a stub that returnsUnsupported by turbomind. The infrastructure to passlast_hidden_statethrough the pipeline already exists at the high level (Response,EngineOutput,GenOutall have the field), but the PyTorch engine's internal pipeline never populates it.Related resources
EmbeddingsRequest/EmbeddingsResponseprotocol classes,output_last_hidden_stateinGenerationConfig, TurboMind C++ engine support foroutput_last_hidden_state/v1/encode(tokenization),/pooling(pooling API)Additional context
I have a working implementation on branch
feat/embeddings-endpointthat:max_new_tokens=0+output_last_hidden_state='all', then mean-pools the hidden stateslast_hidden_statesthrough the PyTorch engine pipeline (BatchedOutputs→InferOutput→EngineOutput), since previously only TurboMind supported hidden state extractionfloatandbase64encoding formats per OpenAI specChanges: ~160 lines across 9 files (mostly plumbing existing types).
Before opening a PR, I'd like to confirm:
/poolingendpoint)Happy to open a PR if the direction is approved.