constrained disk space cause vicious loop leading to file descriptor exhaustion

❓ **I'm submitting a ...**
- [X] 🐞 bug report
- [ ] 🐣 feature request
- [ ] ❓ question about the decisions made in the repository

🐞 **Describe the bug. What is the current behavior?**

When a system under load becomes constrained in disk space available to running processes(e.g. `/var` partition), cheroot might receive an `OSError` when handling an incoming request. Normal logging that tries to write to a fail might fail because of insufficient space. 
In handling this `OSError`, the cheroot request thread handler will try to log an error message, which will also fail. This would kill the thread before proper shutdown. Any incoming request will create and leave an open socket awaiting to be closed.
This leads to file descriptor exhaustion(based on max open files setting), which triggers new errors, which triggers new error logging, etc.

If at any point disk space become available, log files will quickly fill up with all those errors, maintaining the pathological condition until the service is stopped.

❓ **What is the motivation / use case for changing the behavior?**

- Improving resiliency and recovery in resource-constrained environments
- avoid throwing oil on a fire

💡 **To Reproduce**

Steps to reproduce the behavior:
1. Setup a simple dummy cheroot server, e.g. responding "Hello" to any request; setup logging to a `/var/log` file.
2. Simulate a significant baseload of requests to the cheroot server
3. create a disk space constrained environment, removing all `/var` partition available space.
4. After some time(e.g. minutes), add back available space to `/var` partition. Watch the log files fill up the partition.

💡 **Expected behavior**

- fail fast when normal functional behavior cannot be preserved
- proper cleanup of connections despite logging errors; no file descriptor exhaustion

📋 **Details**

We use chroot for a few of our webservers in wazo-platform, and this has been encountered in the wild in deployments of our systems.


📋 **Environment**

- Cheroot version: 8.5.2
- Python version: 3.9
- OS: debian bullseye

📋 **Additional context**

See our internal JIRA ticket in the wazo project: https://wazo-dev.atlassian.net/browse/WAZO-3846 .

**suggestions**

- We worked around the issue by overriding the `serve` function of `cheroot.wsgi.Server` to simply exit on OSError: https://github.com/wazo-platform/xivo-lib-python/blob/wazo-25.12/xivo/wsgi.py .

- Given an OSError that specifies unavailable disk space, logging should be expected to fail; maybe using an in-memory queue logging backend could allow to preserve some logging ability, but that moves the resource pressure to memory, failing fast is likely preferable;
- Logging failures should not prevent proper shutdown of connections and cleanup of existing resources; proper resource cleanup should be priorised over logging
- reaching the file descriptor limit should lead to a fail-fast strategy; process supervision tooling can then restart the service automatically

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

constrained disk space cause vicious loop leading to file descriptor exhaustion #756

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

constrained disk space cause vicious loop leading to file descriptor exhaustion #756

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions