Skip to content

[Feature]: Add CLI to launch FastEmbed as an Embedding API service (OpenAI-compatible) #571

@magicbrighter

Description

@magicbrighter

What feature would you like to request?

Feature Request: Add CLI to launch FastEmbed as an Embedding API service (OpenAI-compatible)

Title

CLI to start FastEmbed as a local HTTP service providing /v1/embeddings endpoint


Background / Motivation

Currently FastEmbed is a Python library — users need to write Python code to create embeddings.
In many production setups:

  • Teams want to run FastEmbed as a self-contained service (like OpenAI embeddings API), locally or on an internal server.
  • Many existing apps/tools (LangChain, LlamaIndex, etc.) expect an OpenAI-style /v1/embeddings REST API, so they can work without code changes.
  • This would also make FastEmbed easier to integrate into non-Python environments.

Proposed Solution

Add a CLI command (e.g. fastembed serve) to start a HTTP server.
Server will host a REST API compatible with OpenAI's embeddings format:

CLI Example

fastembed serve \
  --model-name BAAI/bge-small-zh-v1.5 \
  --device cuda \
  --model-path /path/to/local/model \
  --port 8080 \
  --host 0.0.0.0

Optional flags:

  • --model-path : Specify local ONNX/quantized model file path (avoid re-download)
  • --model-name : HuggingFace model name (fallback to remote download if path not given)
  • --device : cpu / cuda
  • --port : Listening port
  • --host : Bind address

API Specification (POST /v1/embeddings)

Request:

{
  "input": [
    "Artificial intelligence is the future",
    "FastEmbed makes embeddings blazing fast"
  ]
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [-0.0246, -0.0536, -0.0010, ...]
    },
    {
      "object": "embedding",
      "index": 1,
      "embedding": [0.0123, -0.0456, 0.0231, ...]
    }
  ],
  "model": "BAAI/bge-small-zh-v1.5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Benefits

  • Drop-in replacement for OpenAI embeddings — developers can point existing code to local FastEmbed service by only changing OPENAI_API_BASE.
  • Works in multi-language environments (Java, Go, JS, etc.) via HTTP.
  • No cloud dependency — model runs locally, respects data privacy.
  • Easy deployment:
    • Local dev via CLI
    • Docker container for production use (docker run ... fastembed serve)

Alternatives

  • Users manually wrap FastEmbed in Flask/FastAPI (but requires extra coding).
  • Run embedding in Python only — limits cross-language integration.

Additional Context

  • Similar approach in Ollama (ollama serve) and HuggingFace Inference Server.
  • Many users (including me) have internal projects using /v1/embeddings, switching from OpenAI to FastEmbed would be zero code change with this feature.
  • Could later extend to /v1/reRank for ColBERT / BGE reranking models.

Possible Implementation
Use FastAPI / Uvicorn internally:

from fastapi import FastAPI
from fastembed import TextEmbedding

app = FastAPI()
embedder = TextEmbedding(model_name="BAAI/bge-small-zh-v1.5")

@app.post("/v1/embeddings")
async def create_embeddings(request: dict):
    inputs = request.get("input", [])
    vectors = list(embedder.embed(inputs))
    return {
        "object": "list",
        "data": [
            {"object": "embedding", "index": i, "embedding": vec.tolist()}
            for i, vec in enumerate(vectors)
        ],
        "model": "BAAI/bge-small-zh-v1.5",
        "usage": {"prompt_tokens": 0, "total_tokens": 0}
    }

Is there any additional information you would like to provide?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions