Skip to content

[BUG] Bug Report: Minion Instability and Resource Exhaustion under High Load #68703

@dwoz

Description

@dwoz

Bug Report: Minion Instability and Resource Exhaustion under High Load

Summary

The Salt Minion experiences stability issues, unresponsiveness, or crashes due to resource exhaustion (specifically file descriptors) when processing a high volume of concurrent jobs (e.g., state.apply). The queuing mechanisms (job_queue and state_queue) lack sufficient flow control, allowing the Minion to attempt spawning processes beyond the operating system's capabilities (ulimit -n), leading to OSError: [Errno 24] Too many open files and subsequent processing failures.

Affected Components

  • salt/minion.py (Main loop, Process Manager, Queue Processing)
  • salt/utils/files.py (File locking)

Reproduction Steps

  1. Configure a Minion with process_count_max: -1 (unlimited) or a value higher than the system's file descriptor limit allows (relative to per-process usage).
  2. Submit a high volume of jobs asynchronously (e.g., 1000 state.apply jobs).
    for i in {1..1000}; do salt-call --local state.apply test.sleep async=True & done
  3. Observed Behavior:
    • The Minion log fills with OSError: [Errno 24] Too many open files.
    • The Minion enters a loop where it fails to read internal status files (due to lack of FDs), leading to potential crashes or the inability to track running jobs.
    • State lock files (state_queue.lock) are left stale if the process terminates unexpectedly, requiring manual intervention to recover.

Root Cause Analysis

  1. Unbounded Process Spawning: When process_count_max is disabled (-1), the Minion attempts to spawn a Python process for every incoming job immediately. Each process consumes multiple file descriptors (imports, logging, sockets, pipes). A large burst of jobs easily exceeds default system limits (e.g., 1024 or 4096 FDs).
  2. Insufficient Error Handling during Spawning: The process start routines do not explicitly handle OS resource limits. When the OS refuses to create a new process/thread, the exception propagates up, causing job failures or loop interruptions.
  3. Missing Resource Awareness: The system attempts to spawn processes until it hits the hard system limit. Once at the limit, critical operations like writing PID files or caching return data fail, leading to jobs that the Minion can no longer track or manage.
  4. Lack of Backpressure: Even after hitting OS limits, the queue processing logic immediately attempts to process the next job, keeping the system in a state of exhaustion.

Metadata

Metadata

Assignees

Labels

bugbroken, incorrect, or confusing behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions