-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Open
Labels
bugbroken, incorrect, or confusing behaviorbroken, incorrect, or confusing behavior
Description
Bug Report: Minion Instability and Resource Exhaustion under High Load
Summary
The Salt Minion experiences stability issues, unresponsiveness, or crashes due to resource exhaustion (specifically file descriptors) when processing a high volume of concurrent jobs (e.g., state.apply). The queuing mechanisms (job_queue and state_queue) lack sufficient flow control, allowing the Minion to attempt spawning processes beyond the operating system's capabilities (ulimit -n), leading to OSError: [Errno 24] Too many open files and subsequent processing failures.
Affected Components
salt/minion.py(Main loop, Process Manager, Queue Processing)salt/utils/files.py(File locking)
Reproduction Steps
- Configure a Minion with
process_count_max: -1(unlimited) or a value higher than the system's file descriptor limit allows (relative to per-process usage). - Submit a high volume of jobs asynchronously (e.g., 1000
state.applyjobs).for i in {1..1000}; do salt-call --local state.apply test.sleep async=True & done
- Observed Behavior:
- The Minion log fills with
OSError: [Errno 24] Too many open files. - The Minion enters a loop where it fails to read internal status files (due to lack of FDs), leading to potential crashes or the inability to track running jobs.
- State lock files (
state_queue.lock) are left stale if the process terminates unexpectedly, requiring manual intervention to recover.
- The Minion log fills with
Root Cause Analysis
- Unbounded Process Spawning: When
process_count_maxis disabled (-1), the Minion attempts to spawn a Python process for every incoming job immediately. Each process consumes multiple file descriptors (imports, logging, sockets, pipes). A large burst of jobs easily exceeds default system limits (e.g., 1024 or 4096 FDs). - Insufficient Error Handling during Spawning: The process start routines do not explicitly handle OS resource limits. When the OS refuses to create a new process/thread, the exception propagates up, causing job failures or loop interruptions.
- Missing Resource Awareness: The system attempts to spawn processes until it hits the hard system limit. Once at the limit, critical operations like writing PID files or caching return data fail, leading to jobs that the Minion can no longer track or manage.
- Lack of Backpressure: Even after hitting OS limits, the queue processing logic immediately attempts to process the next job, keeping the system in a state of exhaustion.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugbroken, incorrect, or confusing behaviorbroken, incorrect, or confusing behavior