[Feature Request] Optional TEI Backend for Multi-User Deployments #278

doobidoo · 2025-12-14T08:06:30Z

doobidoo
Dec 14, 2025
Maintainer

🚀 Proposal: Optional TEI Backend for Multi-User Deployments

Problem Statement

MCP Memory Service currently uses ONNX Runtime for embedding generation, which works excellently for single-user deployments:

✅ Fast in-process inference (5-50ms)
✅ No external dependencies
✅ Local-first, offline-capable
✅ Cross-platform (CPU/CUDA/DirectML/Metal)

However, this architecture may have limitations for high-concurrency server deployments:

❌ No request batching (linear scaling)
❌ No production observability (Prometheus metrics)
❌ Limited re-ranking capabilities

Proposed Solution

Integrate HuggingFace Text Embeddings Inference (TEI) as an optional embedding backend for users who need higher throughput:

# Current default (unchanged)
MCP_EMBEDDING_BACKEND=onnx

# New optional backend for server mode
MCP_EMBEDDING_BACKEND=tei
MCP_TEI_ENDPOINT=http://localhost:8080

Key Benefits

Feature	Current (ONNX)	TEI	Use Case
Single-user latency	5-50ms (in-process)	10-100ms (HTTP overhead)	Local Claude Desktop
Multi-user throughput	Linear scaling	Dynamic batching, 2-5x faster	Server with 10+ clients
Production monitoring	Basic logs	Prometheus + OpenTelemetry	Enterprise deployments
Re-ranking	Not built-in	Native cross-encoder support	Quality improvements
Setup complexity	Zero config	Docker/binary required	Trade-off vs benefits

Architecture

Backward Compatible Integration:

# Embedding backend abstraction
class EmbeddingBackend(ABC):
    async def encode(self, texts: List[str]) -> np.ndarray: ...

# Implementations
ONNXBackend(EmbeddingBackend)      # Default, current behavior
TEIBackend(EmbeddingBackend)       # New optional backend
SentenceTransformersBackend()      # Existing fallback

Deployment Flow:

graph LR
    A[User Request] --> B{Backend Type?}
    B -->|onnx| C[ONNX In-Process]
    B -->|tei| D[TEI HTTP Server]
    C --> E[Embeddings]
    D --> E

TEI Advantages

Rust-based: Minimal overhead, fast startup
Production-grade: Used by HuggingFace in production
Model flexibility: Supports all sentence-transformers models
Observability: Prometheus metrics out-of-box
Apache 2.0: License compatible

Implementation Plan

Phase 1: Validation (2 weeks)

Benchmark TEI vs ONNX (single-user, multi-user, batch)
Test model compatibility (all-MiniLM-L6-v2, paraphrase-multilingual-mpnet-base-v2)
Measure HTTP overhead (localhost latency)
Go/No-Go: Proceed if ≥2x improvement in multi-user scenarios

Phase 2: Optional Backend (4 weeks)

Implement TEIEmbeddingClient class
Add MCP_EMBEDDING_BACKEND config flag
Health checks + fallback to ONNX if TEI unavailable
Docker Compose template for easy deployment
Documentation: "When to use TEI" decision tree

Phase 3: Production Hardening (2 weeks)

Integration tests (TEI available/unavailable)
Prometheus dashboard template
Migration guide for existing deployments
Performance benchmark documentation

Total Effort: 8 weeks (1 developer, part-time)

Open Questions for Community

Use Case Priority: How many users run MCP Memory in server mode with multiple concurrent clients?
Performance Needs: Is current ONNX throughput a bottleneck for anyone?
Complexity Trade-off: Is Docker/TEI setup acceptable for 2-5x throughput improvement?
Alternative Approaches: Should we optimize current ONNX pipeline first (batching, threading)?
Feature Priority: How does this rank vs other roadmap items (test coverage, quality system)?

Related Context

Recent Fix: v8.50.1 now properly respects MCP_EMBEDDING_MODEL env variable (PR fix: Use EMBEDDING_MODEL_NAME from config instead of hardcoded value #276, issue Bug: MCP_EMBEDDING_MODEL environment variable ignored in server.py eager/lazy init #275)
Roadmap Alignment: Performance & Scalability section
Current Performance: 5ms reads (SQLite-vec), <100ms search with quality boost
Multi-Client Support: Works with 13+ AI applications (Claude, Cursor, VSCode, etc.)

Decision Criteria

Recommended to proceed if:

✅ ≥5 users express need for higher throughput
✅ Benchmark shows ≥2x improvement in multi-user scenario
✅ HTTP overhead acceptable for target use cases (<50ms)

Not recommended if:

❌ Majority of users are single-user, local deployments
❌ Current ONNX performance is sufficient
❌ Community prefers simpler alternatives (optimize ONNX)

Next Steps

Community Feedback (2 weeks): Gather input on use cases and priority
Prototype (2 weeks): If validated, create proof-of-concept
Benchmark (1 week): Measure actual performance gains
Decision: Go/No-Go based on data and community interest

Call to Action:

👥 Who would benefit? Share your deployment scenario (single-user vs server mode)
📊 Performance concerns? Is ONNX throughput a bottleneck for you?
🤔 Complexity concerns? Would Docker/TEI setup be acceptable?
💡 Better alternatives? Suggest other approaches

References:

doobidoo · 2026-01-28T15:28:22Z

doobidoo
Jan 28, 2026
Maintainer Author

✅ Implemented in v10.2.0!

Great news! External embedding API support (including TEI) is now available in v10.2.0! 🎉

What's New

You can now use external OpenAI-compatible embedding APIs instead of local models, including:

Text Embeddings Inference (TEI) ✨ (as requested!)
vLLM
Ollama
OpenAI
Any OpenAI-compatible /v1/embeddings endpoint

Configuration

# TEI example (as requested in this discussion)
export MCP_EXTERNAL_EMBEDDING_URL=http://localhost:8080/v1/embeddings
export MCP_EXTERNAL_EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5

# vLLM example
export MCP_EXTERNAL_EMBEDDING_URL=http://localhost:8890/v1/embeddings
export MCP_EXTERNAL_EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5

# Ollama example
export MCP_EXTERNAL_EMBEDDING_URL=http://localhost:11434/v1/embeddings
export MCP_EXTERNAL_EMBEDDING_MODEL=nomic-embed-text

# Optional: API key for authenticated endpoints
export MCP_EXTERNAL_EMBEDDING_API_KEY=sk-xxx

Key Features

✅ OpenAI-compatible API support - Works with any service implementing /v1/embeddings
✅ Graceful fallback - Automatically falls back to local models if external API unavailable
✅ Automatic dimension detection - Detects embedding dimensions from API responses
✅ Backend validation - Ensures correct configuration

Multi-User Deployment Benefits

This addresses the multi-user deployment concerns mentioned in this discussion:

Shared infrastructure - Run a single TEI/vLLM/Ollama service for multiple MCP instances
Resource efficiency - Offload GPU/CPU intensive embedding to dedicated servers
Centralized management - One embedding service for all users

Important Note

⚠️ External embedding APIs are currently only supported with the sqlite_vec backend (not compatible with hybrid or cloudflare backends). This is documented and validated at runtime.

Documentation

Complete setup guide with examples for all supported backends:

docs/deployment/external-embeddings.md

Get Started

# Install or upgrade
pip install --upgrade mcp-memory-service

# Or specify exact version
pip install mcp-memory-service==10.2.0

Credits

Special thanks to @isiahw1 for implementing this feature!

Release: https://github.com/doobidoo/mcp-memory-service/releases/tag/v10.2.0
CHANGELOG: https://github.com/doobidoo/mcp-memory-service/blob/v10.2.0/CHANGELOG.md#1020---2026-01-28

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Optional TEI Backend for Multi-User Deployments #278

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[Feature Request] Optional TEI Backend for Multi-User Deployments #278

Uh oh!

Uh oh!

doobidoo Dec 14, 2025 Maintainer

🚀 Proposal: Optional TEI Backend for Multi-User Deployments

Problem Statement

Proposed Solution

Key Benefits

Architecture

TEI Advantages

Implementation Plan

Open Questions for Community

Related Context

Decision Criteria

Next Steps

Replies: 1 comment

Uh oh!

doobidoo Jan 28, 2026 Maintainer Author

✅ Implemented in v10.2.0!

What's New

Configuration

Key Features

Multi-User Deployment Benefits

Important Note

Documentation

Get Started

Credits

doobidoo
Dec 14, 2025
Maintainer

doobidoo
Jan 28, 2026
Maintainer Author