Skip to content

CPU embedding is ~200x slower than expected due to unbounded concurrent requests #8396

@tfriedel

Description

@tfriedel

disclaimer: the following is created by claude code investigating slow CPU embedding performance.
Not sure if it's right, but cpu embedding performance on my machine is unusable slow.

Problem

When the indexing model server runs on CPU (gpu=none), embedding performance degrades catastrophically due to unbounded concurrent requests. Each batch of 8 texts takes ~110 seconds instead of the expected ~2-5 seconds — a ~30-50x slowdown caused purely by thread contention.

With GPU enabled, the same batches complete in ~0.5 seconds.

Root Cause

In backend/model_server/encoders.py, the embed_text function uses asyncio.get_event_loop().run_in_executor(None, ...) (line 139), which submits work to Python's default ThreadPoolExecutor with no concurrency limit.

The background worker (lightweight mode, concurrency=20) sends multiple doc processing tasks in parallel, each making embedding requests. The model server happily runs all of them concurrently. Since PyTorch is configured with Torch Threads: 6 (matching host CPUs), each concurrent model.encode() call spawns 6 threads internally. With 7 concurrent requests, that's 42 threads fighting over 6 CPU cores, causing massive contention.

Observed in logs:

  • 7 concurrent embedding request IDs active simultaneously
  • Container at ~537% CPU (fully saturated)
  • Each batch: ~108-120 seconds (vs ~0.5s on GPU, or ~2-5s expected on CPU with proper serialization)

Suggested Fix

Add an asyncio.Semaphore to limit concurrent embedding calls on the model server. On CPU, serializing requests (semaphore=1) would actually be faster since it eliminates thread contention:

_EMBED_SEMAPHORE: asyncio.Semaphore | None = None

def _get_embed_semaphore() -> asyncio.Semaphore:
    global _EMBED_SEMAPHORE
    if _EMBED_SEMAPHORE is None:
        # On CPU, serialize to avoid thread contention. On GPU, allow some concurrency.
        max_concurrent = 1 if gpu_type == "none" else 4
        _EMBED_SEMAPHORE = asyncio.Semaphore(max_concurrent)
    return _EMBED_SEMAPHORE

async def embed_text(...):
    async with _get_embed_semaphore():
        embeddings_vectors = await asyncio.get_event_loop().run_in_executor(...)

This would reduce CPU embedding time from ~110s to ~2-5s per batch (7 serialized batches ≈ 15-35s total vs 110s+ each).

Relevant Code

  • backend/model_server/encoders.py:139run_in_executor(None, ...) with no concurrency limit
  • backend/onyx/configs/app_configs.py:451CELERY_WORKER_BACKGROUND_CONCURRENCY defaults to 20
  • backend/onyx/configs/model_configs.py:42BATCH_SIZE_ENCODE_CHUNKS defaults to 8

Environment

  • 6 CPU cores, no GPU
  • nomic-ai/nomic-embed-text-v1 local model
  • Lightweight background worker mode (single worker, concurrency=20)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions