Blazing fast embedding and cross-encoder reranker service optimized for Retrieval-Augmented Generation (RAG), KG‑RAG (Knowledge-Graph RAG), and Graph‑RAG workflows. Self-host on a small GPU — no per-request fees, no token limits, and fully model‑swappable.
- Run embeddings + cross-encoder reranker on your infra
- Plug into RAG pipelines (text RAG, KG‑RAG, Graph‑RAG) for high-quality context retrieval
- Production-ready: batching, FP16, warmup, configurable via
.env
- Text RAG: query → vector DB retrieve (top_k) → cross-encoder rerank → pass top results to LLM prompt.
- KG‑RAG: embed node text/properties, index nodes + edges, retrieve candidate nodes/subgraph, rerank node texts, expand subgraph and include top nodes in prompt.
- Graph‑RAG: use graph-aware retrieval (graph embeddings or neighbor walk), fetch subgraph, rerank node/document texts, fuse into LLM context.
POST /api/v1/embedding
→ get embeddings for documents/nodes.- Index embeddings into a vector store (FAISS / postgresql(pgvector) / etc.).
- On user query: retrieve candidates from vector store.
POST /api/v1/ce/reranker
withquery
+documents
→ get final top‑N.- Send top‑N to your LLM as context for generation.
POST /api/v1/embedding
Request:
{ "texts": ["doc1 text", "doc2 text"] }
Response:
{ "embeddings": [[...],[...]], "dimensions": 1024 }
POST /api/v1/ce/reranker
Request:
{
"query": "Find causes of X",
"documents": ["candidate A", "candidate B", "..."],
"returnDocuments": false,
"topN": 10
}
Response:
{
"results":[
{
"docIndex"::17,
"doctext":"",
"score":0.99951171875
}, ...
],
"query":"Find causes of X"
}
PORT = 8000
MAX_TOKEN_LIMIT_PER_TEXT = 500
EMBEDDING_MODEL_NAME = thenlper/gte-large
MAX_EMBEDDING_TEXTS_PER_REQUEST = 100
MAX_EMBEDDING_BATCH_REQUEST_DELAY = 5
MAX_EMBEDDING_BATCH_SIZE = 50
CROSS_ENCODER_MODEL_NAME = cross-encoder/ms-marco-MiniLM-L6-v2
MAX_CE_RE_RANKER_PAIRS = 200
MAX_CE_RE_RANKER_BACTH_SIZE = 100
MAX_CE_RE_RANKER_BACTH_REQUEST_DELAY = 5
- Embeddings: 20×100 tokens ≈ 200ms; 100×400 tokens ≈ 700ms
- Reranker: 100 docs × 300 tokens ≲ 300ms
- Throughput: 100 req/sec, 6000 req/min (observed)
git clone https://github.com/afrid/embedhub.git
cd embedhub
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port ${PORT:-8000} --workers 1
- Clean modular layout (controllers, services, implementations) for easy model swap and performance tuning.
- Use this as an SDK component inside a RAG pipeline: embedding + indexing + retrieval + rerank → generator.