Raise default async discard threshold to ERROR #3880

rschmitt · 2025-08-13T19:21:49Z

There are two basic scenarios where the ring buffer fills up. One is that an application is simply logging too much and the flushing process can't keep up, and in this case a discard threshold of WARN or INFO is probably sufficient to mitigate the problem. However, if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage.

This change sets the default discard threshold to ERROR, in order to better mitigate the scenario where the disk fills up, fails, or is in the process of failing. With this threshold, logging should only block at the FATAL level, which would typically mean that the operation is already failing anyway.

There are two basic scenarios where the ring buffer fills up. One is that an application is simply logging too much and the flushing process can't keep up, and in this case a discard threshold of WARN or INFO is probably sufficient to mitigate the problem. However, if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage. This change sets the default discard threshold to ERROR, in order to better mitigate the scenario where the disk fills up, fails, or is in the process of failing. With this threshold, logging should only block at the FATAL level, which would typically mean that the operation is already failing anyway.

github-actions · 2025-08-13T19:49:03Z

Job	Requested goals	Build Tool Version	Build Outcome
build-macos-latest	clean install	3.9.8	✅
build-ubuntu-latest	clean install	3.9.8	✅
build-windows-latest	clean install	3.9.8	✅

Generated by gradle/develocity-actions

rschmitt · 2025-08-15T02:06:45Z

More controversially, I think that asyncQueueFullPolicy should default to Discard for basically the same reasons. One weird thing about this is that the blocking queue-full policy is literally named DefaultAsyncQueueFullPolicy and is denoted by the property value Default. =\

vy · 2025-08-15T06:48:57Z

if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage.

If logging is a vital component of your application, and it doesn't work, I think it is reasonable to signal that the application is down. Consider this scenario in a cluster, say, Kubernetes, environment: container signals down in liveliness probes due to logging buffer failures, pod will be taken down, and re-spawned in a new environment with sufficient logging capacity. This is what you'd want, instead of losing all logging for an indefinite amount of time. I think this is a good default. If you indeed want the other way around, it makes sense that you need to opt-in for extra configuration, which is log4j2.discardThreshold in this case.

@remkop, @ppkarwasz, WDYT?

More controversially, I think that asyncQueueFullPolicy should default to Discard for basically the same reasons. One weird thing about this is that the blocking queue-full policy is literally named DefaultAsyncQueueFullPolicy and is denoted by the property value Default. =\

This is a valid remark. I'd support a PR

renaming the default from Default to Discard, and
translating Default usages to Discard with a WARN'ing logged

ppkarwasz · 2025-08-15T15:19:30Z

Hi @vy,

I think this is a good default. If you indeed want the other way around, it makes sense that you need to opt-in for extra configuration, which is log4j2.discardThreshold in this case.

I agree: a default that discards all log events would be risky. While it technically still preserves FATAL events, those are rarely used in practice, especially since SLF4J doesn’t define a FATAL level.

More controversially, I think that asyncQueueFullPolicy should default to Discard for basically the same reasons. One weird thing about this is that the blocking queue-full policy is literally named DefaultAsyncQueueFullPolicy and is denoted by the property value Default. =\

This is a valid remark. I'd support a PR

renaming the default from Default to Discard, and

translating Default usages to Discard with a WARN'ing logged

I’d be hesitant to change the default from blocking (Default) to Discard, as that seems contrary to the original motivation behind the Log4j 2 project as initiated by @rgoers. My understanding has always been that one of the main differentiators between Logback and Log4j Core is that Logback drops messages by default (e.g., AsyncAppender discards on a full queue or messages are lost during reconfiguration), while Log4j Core does not, even during reconfiguration.

This “no events lost” behavior is explicitly part of our new threat model, where we guarantee reliability out-of-the-box, while still allowing users to opt into more discard-friendly settings when resilience against DoS attacks is a priority.

That said, I’m not opposed to:

Documenting different “profiles”: for example, an “audit mode” profile (fully reliable) and a “high-throughput” profile (discard under load), with the relevant configuration options spelled out.
Adding a Block synonym for Default, so users can more clearly express their intent without having to remember the historical naming.

But I would avoid replacing Default with Discard as the actual default.

rschmitt · 2025-08-17T00:06:57Z

If logging is a vital component of your application, and it doesn't work, I think it is reasonable to signal that the application is down.

See, this is what I thought too, but over the years we had so many outages caused by blocking log statements that we adopted these overridden defaults years ago, and I'm not aware of any problems that they've caused. Additionally, the default logging behavior is synchronous, right? If you opt in to asynchronous logging, you already run the risk of a crash causing all of the buffered log messages to be lost (and the log statements leading up to the crash are probably the most important ones!). The current defaults seem like the worst of both worlds: you're exposed to the risk of data loss and the risk of an outage caused by indefinite blocking. The principled reason to discard instead of blocking is that that's the only way to guarantee asynchronous behavior (which is probably what you think you're opting in to by using async logging in the first place).

rschmitt · 2025-08-17T00:12:17Z

That said, I’m not opposed to [...] adding a Block synonym for Default, so users can more clearly express their intent without having to remember the historical naming.

Yeah, I can do that.

remkop · 2025-08-17T02:26:52Z

I agree with @ppkarwasz that changing the default behavior would be against the original intent of Log4j2 to not drop events. It would be surprising to users to change the default behavior.

I would not oppose changing the name of the value used to configure the default behavior to a label that expresses the intention/behavior better.

vy · 2025-08-18T08:04:21Z

@rschmitt, thanks so much for bringing this to our attention, but I will close this PR since all interested maintainers expressed their reluctance. That said, we all agree renaming Default to a more self-documenting keyword (e.g., Block) is a good idea, which we better address in a separate PR.

rschmitt · 2025-08-18T17:23:53Z

@vy @ppkarwasz Just to be clear, this PR doesn't change the default policy to Discard, it changes the default discard threshold when Discard is configured (in other words, when the user has already opted in to dropping of log events). I don't think I've seen any comments on the actual proposal here.

ppkarwasz · 2025-08-18T20:31:42Z

@vy @ppkarwasz Just to be clear, this PR doesn't change the default policy to Discard, it changes the default discard threshold when Discard is configured (in other words, when the user has already opted in to dropping of log events). I don't think I've seen any comments on the actual proposal here.

Good point! 💯 Most of the discussion so far has centered on @vy’s comment #3880 (comment) about changing the default to Discard (a Log4j “heresy”) and we lost sight of your actual changes.

Regarding your PR specifically:

I’m still concerned about changing the default threshold to ERROR, since that would effectively stop all logging. The current default of INFO seems safer: it still lets WARN and ERROR through (i.e., prevents them from being discarded). With your change, when the queue is full, the status logger will emit a warning, but since its default level is ERROR, that warning won’t be shown. In practice, users could experience a complete logging blackout.
I also find the semantics of log4j2.discardThreshold a bit unintuitive. Naively, I’d expect it to mean “log events at this level or higher will be retained.” Instead, it only retains events that are strictly more severe than the configured level, which is easy to misinterpret.

rschmitt · 2025-08-18T20:48:09Z

With your change, when the queue is full, the status logger will emit a warning, but since its default level is ERROR, that warning won’t be shown.

Does the status logger go through all the same async appender machinery? I thought it was just a simple console logger used by log4j-api.

I also find the semantics of log4j2.discardThreshold a bit unintuitive. Naively, I’d expect it to mean “log events at this level or higher will be retained.”

Yeah, because that's how all other logging configuration works. I was also confused by this and had to carefully read source code to make sure I wasn't crazy.

ppkarwasz · 2025-08-18T20:58:11Z

Does the status logger go through all the same async appender machinery? I thought it was just a simple console logger used by log4j-api.

It is just a simple console logger. My point is that, if a queue-full event occurs and you change the default log4j2.discardThreshold to ERROR, then:

Users will not see any log events from the system.
They will not even see the status logger warning, unless they changed the default value of log4j2.statusLoggerLevel from ERROR to WARN.

rschmitt · 2025-08-18T21:17:56Z

They will not even see the status logger warning, unless they changed the default value of log4j2.statusLoggerLevel from ERROR to WARN.

Is that also true today? Are full queue notifications recorded at a log level that is not emitted by default?

ppkarwasz · 2025-08-18T21:50:46Z

Unfortunately, yes!

We do plan to lower the default level of the status logger to WARN (see e.g. #1592), but there are still a number of issues to resolve first. For example, the code below will always emit a warning if bufferedIo is false, because bufferSize is initialized to a non-zero value:

logging-log4j2/log4j-core/src/main/java/org/apache/logging/log4j/core/appender/FileAppender.java

Lines 97 to 99 in e84655e

    
           if (!bufferedIo && bufferSize > 0) { 
        
               LOGGER.warn("The bufferSize is set to {} but bufferedIo is false: {}", bufferSize, bufferedIo); 
        
           }

We have several other cases like this. Until we can be confident that a user following the documented configuration won’t see spurious warnings, we’re holding off on lowering the threshold.

rschmitt · 2025-08-18T23:06:33Z

It sounds like this case should be an error, since the logger is no longer able to meet its contract to both (1) not block and (2) not drop events. This is in a whole other realm of impact from something like an unrecognized config property.

github-project-automation bot added this to Log4j bug tracker Aug 13, 2025

github-project-automation bot moved this to To triage in Log4j bug tracker Aug 13, 2025

rschmitt force-pushed the discard-threshold branch from 624c803 to 052f802 Compare August 13, 2025 19:26

vy added api Affects the public API async Affects asynchronous loggers or appenders labels Aug 18, 2025

vy closed this Aug 18, 2025

github-project-automation bot moved this from To triage to Done in Log4j bug tracker Aug 18, 2025

Uh oh!

Raise default async discard threshold to ERROR #3880

Raise default async discard threshold to ERROR #3880

Uh oh!

Conversation

rschmitt commented Aug 13, 2025

Uh oh!

github-actions bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Generated by gradle/develocity-actions

Uh oh!

rschmitt commented Aug 15, 2025

Uh oh!

vy commented Aug 15, 2025

Uh oh!

ppkarwasz commented Aug 15, 2025

Uh oh!

rschmitt commented Aug 17, 2025

Uh oh!

rschmitt commented Aug 17, 2025

Uh oh!

remkop commented Aug 17, 2025

Uh oh!

vy commented Aug 18, 2025

Uh oh!

rschmitt commented Aug 18, 2025

Uh oh!

ppkarwasz commented Aug 18, 2025

Uh oh!

rschmitt commented Aug 18, 2025

Uh oh!

ppkarwasz commented Aug 18, 2025

Uh oh!

rschmitt commented Aug 18, 2025

Uh oh!

ppkarwasz commented Aug 18, 2025

Uh oh!

rschmitt commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Aug 13, 2025 •

edited

Loading