CPU embedding is ~200x slower than expected due to unbounded concurrent requests

disclaimer: the following is created by claude code investigating slow CPU embedding performance. 
Not sure if it's right, but cpu embedding performance on my machine is unusable slow. 

## Problem

When the indexing model server runs on CPU (`gpu=none`), embedding performance degrades catastrophically due to unbounded concurrent requests. Each batch of 8 texts takes **~110 seconds** instead of the expected **~2-5 seconds** — a ~30-50x slowdown caused purely by thread contention.

With GPU enabled, the same batches complete in **~0.5 seconds**.

## Root Cause

In `backend/model_server/encoders.py`, the `embed_text` function uses `asyncio.get_event_loop().run_in_executor(None, ...)` (line 139), which submits work to Python's **default ThreadPoolExecutor with no concurrency limit**.

The background worker (lightweight mode, concurrency=20) sends multiple doc processing tasks in parallel, each making embedding requests. The model server happily runs all of them concurrently. Since PyTorch is configured with `Torch Threads: 6` (matching host CPUs), each concurrent `model.encode()` call spawns 6 threads internally. With 7 concurrent requests, that's **42 threads fighting over 6 CPU cores**, causing massive contention.

**Observed in logs:**
- 7 concurrent embedding request IDs active simultaneously
- Container at ~537% CPU (fully saturated)
- Each batch: ~108-120 seconds (vs ~0.5s on GPU, or ~2-5s expected on CPU with proper serialization)

## Suggested Fix

Add an `asyncio.Semaphore` to limit concurrent embedding calls on the model server. On CPU, serializing requests (semaphore=1) would actually be *faster* since it eliminates thread contention:

```python
_EMBED_SEMAPHORE: asyncio.Semaphore | None = None

def _get_embed_semaphore() -> asyncio.Semaphore:
    global _EMBED_SEMAPHORE
    if _EMBED_SEMAPHORE is None:
        # On CPU, serialize to avoid thread contention. On GPU, allow some concurrency.
        max_concurrent = 1 if gpu_type == "none" else 4
        _EMBED_SEMAPHORE = asyncio.Semaphore(max_concurrent)
    return _EMBED_SEMAPHORE

async def embed_text(...):
    async with _get_embed_semaphore():
        embeddings_vectors = await asyncio.get_event_loop().run_in_executor(...)
```

This would reduce CPU embedding time from ~110s to ~2-5s per batch (7 serialized batches ≈ 15-35s total vs 110s+ each).

## Relevant Code

- `backend/model_server/encoders.py:139` — `run_in_executor(None, ...)` with no concurrency limit
- `backend/onyx/configs/app_configs.py:451` — `CELERY_WORKER_BACKGROUND_CONCURRENCY` defaults to 20
- `backend/onyx/configs/model_configs.py:42` — `BATCH_SIZE_ENCODE_CHUNKS` defaults to 8

## Environment

- 6 CPU cores, no GPU
- `nomic-ai/nomic-embed-text-v1` local model
- Lightweight background worker mode (single worker, concurrency=20)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU embedding is ~200x slower than expected due to unbounded concurrent requests #8396

Problem

Root Cause

Suggested Fix

Relevant Code

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU embedding is ~200x slower than expected due to unbounded concurrent requests #8396

Description

Problem

Root Cause

Suggested Fix

Relevant Code

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions