-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
disclaimer: the following is created by claude code investigating slow CPU embedding performance.
Not sure if it's right, but cpu embedding performance on my machine is unusable slow.
Problem
When the indexing model server runs on CPU (gpu=none), embedding performance degrades catastrophically due to unbounded concurrent requests. Each batch of 8 texts takes ~110 seconds instead of the expected ~2-5 seconds — a ~30-50x slowdown caused purely by thread contention.
With GPU enabled, the same batches complete in ~0.5 seconds.
Root Cause
In backend/model_server/encoders.py, the embed_text function uses asyncio.get_event_loop().run_in_executor(None, ...) (line 139), which submits work to Python's default ThreadPoolExecutor with no concurrency limit.
The background worker (lightweight mode, concurrency=20) sends multiple doc processing tasks in parallel, each making embedding requests. The model server happily runs all of them concurrently. Since PyTorch is configured with Torch Threads: 6 (matching host CPUs), each concurrent model.encode() call spawns 6 threads internally. With 7 concurrent requests, that's 42 threads fighting over 6 CPU cores, causing massive contention.
Observed in logs:
- 7 concurrent embedding request IDs active simultaneously
- Container at ~537% CPU (fully saturated)
- Each batch: ~108-120 seconds (vs ~0.5s on GPU, or ~2-5s expected on CPU with proper serialization)
Suggested Fix
Add an asyncio.Semaphore to limit concurrent embedding calls on the model server. On CPU, serializing requests (semaphore=1) would actually be faster since it eliminates thread contention:
_EMBED_SEMAPHORE: asyncio.Semaphore | None = None
def _get_embed_semaphore() -> asyncio.Semaphore:
global _EMBED_SEMAPHORE
if _EMBED_SEMAPHORE is None:
# On CPU, serialize to avoid thread contention. On GPU, allow some concurrency.
max_concurrent = 1 if gpu_type == "none" else 4
_EMBED_SEMAPHORE = asyncio.Semaphore(max_concurrent)
return _EMBED_SEMAPHORE
async def embed_text(...):
async with _get_embed_semaphore():
embeddings_vectors = await asyncio.get_event_loop().run_in_executor(...)This would reduce CPU embedding time from ~110s to ~2-5s per batch (7 serialized batches ≈ 15-35s total vs 110s+ each).
Relevant Code
backend/model_server/encoders.py:139—run_in_executor(None, ...)with no concurrency limitbackend/onyx/configs/app_configs.py:451—CELERY_WORKER_BACKGROUND_CONCURRENCYdefaults to 20backend/onyx/configs/model_configs.py:42—BATCH_SIZE_ENCODE_CHUNKSdefaults to 8
Environment
- 6 CPU cores, no GPU
nomic-ai/nomic-embed-text-v1local model- Lightweight background worker mode (single worker, concurrency=20)