-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Raise default async discard threshold to ERROR #3880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There are two basic scenarios where the ring buffer fills up. One is that an application is simply logging too much and the flushing process can't keep up, and in this case a discard threshold of WARN or INFO is probably sufficient to mitigate the problem. However, if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage. This change sets the default discard threshold to ERROR, in order to better mitigate the scenario where the disk fills up, fails, or is in the process of failing. With this threshold, logging should only block at the FATAL level, which would typically mean that the operation is already failing anyway.
624c803
to
052f802
Compare
More controversially, I think that |
If logging is a vital component of your application, and it doesn't work, I think it is reasonable to signal that the application is down. Consider this scenario in a cluster, say, Kubernetes, environment: container signals down in liveliness probes due to logging buffer failures, pod will be taken down, and re-spawned in a new environment with sufficient logging capacity. This is what you'd want, instead of losing all logging for an indefinite amount of time. I think this is a good default. If you indeed want the other way around, it makes sense that you need to opt-in for extra configuration, which is @remkop, @ppkarwasz, WDYT?
This is a valid remark. I'd support a PR
|
Hi @vy,
I agree: a default that discards all log events would be risky. While it technically still preserves
I’d be hesitant to change the default from blocking ( This “no events lost” behavior is explicitly part of our new threat model, where we guarantee reliability out-of-the-box, while still allowing users to opt into more discard-friendly settings when resilience against DoS attacks is a priority. That said, I’m not opposed to:
But I would avoid replacing |
See, this is what I thought too, but over the years we had so many outages caused by blocking log statements that we adopted these overridden defaults years ago, and I'm not aware of any problems that they've caused. Additionally, the default logging behavior is synchronous, right? If you opt in to asynchronous logging, you already run the risk of a crash causing all of the buffered log messages to be lost (and the log statements leading up to the crash are probably the most important ones!). The current defaults seem like the worst of both worlds: you're exposed to the risk of data loss and the risk of an outage caused by indefinite blocking. The principled reason to discard instead of blocking is that that's the only way to guarantee asynchronous behavior (which is probably what you think you're opting in to by using async logging in the first place). |
Yeah, I can do that. |
I agree with @ppkarwasz that changing the default behavior would be against the original intent of Log4j2 to not drop events. It would be surprising to users to change the default behavior. I would not oppose changing the name of the value used to configure the default behavior to a label that expresses the intention/behavior better. |
@rschmitt, thanks so much for bringing this to our attention, but I will close this PR since all interested maintainers expressed their reluctance. That said, we all agree renaming |
@vy @ppkarwasz Just to be clear, this PR doesn't change the default policy to |
Good point! 💯 Most of the discussion so far has centered on @vy’s comment #3880 (comment) about changing the default to Regarding your PR specifically:
|
Does the status logger go through all the same async appender machinery? I thought it was just a simple console logger used by
Yeah, because that's how all other logging configuration works. I was also confused by this and had to carefully read source code to make sure I wasn't crazy. |
It is just a simple console logger. My point is that, if a queue-full event occurs and you change the default
|
Is that also true today? Are full queue notifications recorded at a log level that is not emitted by default? |
Unfortunately, yes! We do plan to lower the default level of the status logger to logging-log4j2/log4j-core/src/main/java/org/apache/logging/log4j/core/appender/FileAppender.java Lines 97 to 99 in e84655e
We have several other cases like this. Until we can be confident that a user following the documented configuration won’t see spurious warnings, we’re holding off on lowering the threshold. |
It sounds like this case should be an error, since the logger is no longer able to meet its contract to both (1) not block and (2) not drop events. This is in a whole other realm of impact from something like an unrecognized config property. |
There are two basic scenarios where the ring buffer fills up. One is that an application is simply logging too much and the flushing process can't keep up, and in this case a discard threshold of WARN or INFO is probably sufficient to mitigate the problem. However, if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage.
This change sets the default discard threshold to ERROR, in order to better mitigate the scenario where the disk fills up, fails, or is in the process of failing. With this threshold, logging should only block at the FATAL level, which would typically mean that the operation is already failing anyway.