-
-
Notifications
You must be signed in to change notification settings - Fork 98
Description
β I'm submitting a ...
- π bug report
- π£ feature request
- β question about the decisions made in the repository
π Describe the bug. What is the current behavior?
When a system under load becomes constrained in disk space available to running processes(e.g. /var partition), cheroot might receive an OSError when handling an incoming request. Normal logging that tries to write to a fail might fail because of insufficient space.
In handling this OSError, the cheroot request thread handler will try to log an error message, which will also fail. This would kill the thread before proper shutdown. Any incoming request will create and leave an open socket awaiting to be closed.
This leads to file descriptor exhaustion(based on max open files setting), which triggers new errors, which triggers new error logging, etc.
If at any point disk space become available, log files will quickly fill up with all those errors, maintaining the pathological condition until the service is stopped.
β What is the motivation / use case for changing the behavior?
- Improving resiliency and recovery in resource-constrained environments
- avoid throwing oil on a fire
π‘ To Reproduce
Steps to reproduce the behavior:
- Setup a simple dummy cheroot server, e.g. responding "Hello" to any request; setup logging to a
/var/logfile. - Simulate a significant baseload of requests to the cheroot server
- create a disk space constrained environment, removing all
/varpartition available space. - After some time(e.g. minutes), add back available space to
/varpartition. Watch the log files fill up the partition.
π‘ Expected behavior
- fail fast when normal functional behavior cannot be preserved
- proper cleanup of connections despite logging errors; no file descriptor exhaustion
π Details
We use chroot for a few of our webservers in wazo-platform, and this has been encountered in the wild in deployments of our systems.
π Environment
- Cheroot version: 8.5.2
- Python version: 3.9
- OS: debian bullseye
π Additional context
See our internal JIRA ticket in the wazo project: https://wazo-dev.atlassian.net/browse/WAZO-3846 .
suggestions
-
We worked around the issue by overriding the
servefunction ofcheroot.wsgi.Serverto simply exit on OSError: https://github.com/wazo-platform/xivo-lib-python/blob/wazo-25.12/xivo/wsgi.py . -
Given an OSError that specifies unavailable disk space, logging should be expected to fail; maybe using an in-memory queue logging backend could allow to preserve some logging ability, but that moves the resource pressure to memory, failing fast is likely preferable;
-
Logging failures should not prevent proper shutdown of connections and cleanup of existing resources; proper resource cleanup should be priorised over logging
-
reaching the file descriptor limit should lead to a fail-fast strategy; process supervision tooling can then restart the service automatically