-
Notifications
You must be signed in to change notification settings - Fork 168
Open
Description
What feature would you like to request?
Feature Request: Add CLI to launch FastEmbed as an Embedding API service (OpenAI-compatible)
Title
CLI to start FastEmbed as a local HTTP service providing
/v1/embeddingsendpoint
Background / Motivation
Currently FastEmbed is a Python library — users need to write Python code to create embeddings.
In many production setups:
- Teams want to run FastEmbed as a self-contained service (like OpenAI embeddings API), locally or on an internal server.
- Many existing apps/tools (LangChain, LlamaIndex, etc.) expect an OpenAI-style
/v1/embeddingsREST API, so they can work without code changes. - This would also make FastEmbed easier to integrate into non-Python environments.
Proposed Solution
Add a CLI command (e.g. fastembed serve) to start a HTTP server.
Server will host a REST API compatible with OpenAI's embeddings format:
CLI Example
fastembed serve \
--model-name BAAI/bge-small-zh-v1.5 \
--device cuda \
--model-path /path/to/local/model \
--port 8080 \
--host 0.0.0.0Optional flags:
--model-path: Specify local ONNX/quantized model file path (avoid re-download)--model-name: HuggingFace model name (fallback to remote download if path not given)--device:cpu/cuda--port: Listening port--host: Bind address
API Specification (POST /v1/embeddings)
Request:
{
"input": [
"Artificial intelligence is the future",
"FastEmbed makes embeddings blazing fast"
]
}Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [-0.0246, -0.0536, -0.0010, ...]
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0123, -0.0456, 0.0231, ...]
}
],
"model": "BAAI/bge-small-zh-v1.5",
"usage": {
"prompt_tokens": 0,
"total_tokens": 0
}
}Benefits
- Drop-in replacement for OpenAI embeddings — developers can point existing code to local FastEmbed service by only changing
OPENAI_API_BASE. - Works in multi-language environments (Java, Go, JS, etc.) via HTTP.
- No cloud dependency — model runs locally, respects data privacy.
- Easy deployment:
- Local dev via CLI
- Docker container for production use (
docker run ... fastembed serve)
Alternatives
- Users manually wrap FastEmbed in Flask/FastAPI (but requires extra coding).
- Run embedding in Python only — limits cross-language integration.
Additional Context
- Similar approach in Ollama (
ollama serve) and HuggingFace Inference Server. - Many users (including me) have internal projects using
/v1/embeddings, switching from OpenAI to FastEmbed would be zero code change with this feature. - Could later extend to
/v1/reRankfor ColBERT / BGE reranking models.
Possible Implementation
Use FastAPI / Uvicorn internally:
from fastapi import FastAPI
from fastembed import TextEmbedding
app = FastAPI()
embedder = TextEmbedding(model_name="BAAI/bge-small-zh-v1.5")
@app.post("/v1/embeddings")
async def create_embeddings(request: dict):
inputs = request.get("input", [])
vectors = list(embedder.embed(inputs))
return {
"object": "list",
"data": [
{"object": "embedding", "index": i, "embedding": vec.tolist()}
for i, vec in enumerate(vectors)
],
"model": "BAAI/bge-small-zh-v1.5",
"usage": {"prompt_tokens": 0, "total_tokens": 0}
}Is there any additional information you would like to provide?
No response
recursingfeynman and jtsang4coderabbitai
Metadata
Metadata
Assignees
Labels
No labels