Skip to content

[Feature] Implement /v1/embeddings endpoint #4547

@ZhijunLStudio

Description

@ZhijunLStudio

Motivation

The /v1/embeddings endpoint is a standard OpenAI API supported by vLLM, SGLang, and TGI. Many downstream tools (LangChain, LlamaIndex, RAG pipelines) depend on it to generate text embeddings.

Currently lmdeploy's /v1/embeddings is a stub that returns Unsupported by turbomind. The infrastructure to pass last_hidden_state through the pipeline already exists at the high level (Response, EngineOutput, GenOut all have the field), but the PyTorch engine's internal pipeline never populates it.

Related resources

Additional context

I have a working implementation on branch feat/embeddings-endpoint that:

  1. Replaces the stub with a real endpoint that calls the engine with max_new_tokens=0 + output_last_hidden_state='all', then mean-pools the hidden states
  2. Threads last_hidden_states through the PyTorch engine pipeline (BatchedOutputsInferOutputEngineOutput), since previously only TurboMind supported hidden state extraction
  3. Supports both float and base64 encoding formats per OpenAI spec

Changes: ~160 lines across 9 files (mostly plumbing existing types).

Before opening a PR, I'd like to confirm:

  • Is this feature direction aligned with the project? (vs. focusing on the existing /pooling endpoint)
  • Any concerns about the PyTorch engine hidden states pipeline changes?

Happy to open a PR if the direction is approved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions