Skip to content

Initial index population hung when batching.worker_limit is less than number of resource objects to index #1243

@gitHubCoCoder

Description

@gitHubCoCoder

Long story short

When using kopf.index() with BatchingSettings(worker_limit=WORKER_LIMIT), operators can deadlock if WORKER_LIMIT is less than the total objects of the same indexed resource kind (e.g. index on Pod, cluster has 100 Pods, WORKER_LIMIT=50). Change-detection handlers never trigger, and the operator appears to hang after partial index population.

Root Cause

This is a deadlock caused by kopf's operator readiness mechanism:

  1. Each resource type has its own Scheduler with the configured worker_limit
  2. During startup, kopf spawns one async worker per object to perform initial indexing
  3. After indexing completes, each worker blocks waiting for global operator readiness (all resources and objects indexed)
  4. With worker_limit=1, only 1 worker can run per resource kind
  5. Deadlock: If we have 2 objects of the same kind to index, Worker #1 is blocked waiting for Worker #2 to complete
    indexing, but Worker #2 can't start because Worker #1 occupies the only available slot

Code Location (kopf internals)

The blocking occurs at kopf/_core/reactor/processing.py:106:

await operator_indexed.wait_for(True) # Blocks until ALL objects are indexed

But the Scheduler prevents spawning new workers at kopf/_cogs/aiokits/aiotasks.py:347-349:

def_can_spawn(self) -> bool:
    return (not self._pending_coros.empty() and
        (self._limit is None or len(self._running_tasks) < self._limit))

Kopf version

1.40.0

Kubernetes version

1.27.11

Python version

No response

Code

Logs


Additional information

This PR is trying to fix the issue #1218

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions