Skip to content

Conversation

rschmitt
Copy link
Contributor

There are two basic scenarios where the ring buffer fills up. One is that an application is simply logging too much and the flushing process can't keep up, and in this case a discard threshold of WARN or INFO is probably sufficient to mitigate the problem. However, if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage.

This change sets the default discard threshold to ERROR, in order to better mitigate the scenario where the disk fills up, fails, or is in the process of failing. With this threshold, logging should only block at the FATAL level, which would typically mean that the operation is already failing anyway.

There are two basic scenarios where the ring buffer fills up. One is
that an application is simply logging too much and the flushing process
can't keep up, and in this case a discard threshold of WARN or INFO is
probably sufficient to mitigate the problem. However, if flushing has
stopped making progress altogether, e.g. due to a full or failed disk,
then logging calls will block indefinitely. This can result in a
production outage.

This change sets the default discard threshold to ERROR, in order to
better mitigate the scenario where the disk fills up, fails, or is in
the process of failing. With this threshold, logging should only block
at the FATAL level, which would typically mean that the operation is
already failing anyway.
Copy link

github-actions bot commented Aug 13, 2025

Job Requested goals Build Tool Version Build Outcome Build Scan®
build-macos-latest clean install 3.9.8 Build Scan PUBLISHED
build-ubuntu-latest clean install 3.9.8 Build Scan PUBLISHED
build-windows-latest clean install 3.9.8 Build Scan PUBLISHED
Generated by gradle/develocity-actions

@rschmitt
Copy link
Contributor Author

More controversially, I think that asyncQueueFullPolicy should default to Discard for basically the same reasons. One weird thing about this is that the blocking queue-full policy is literally named DefaultAsyncQueueFullPolicy and is denoted by the property value Default. =\

@vy
Copy link
Member

vy commented Aug 15, 2025

if flushing has stopped making progress altogether, e.g. due to a full or failed disk, then logging calls will block indefinitely. This can result in a production outage.

If logging is a vital component of your application, and it doesn't work, I think it is reasonable to signal that the application is down. Consider this scenario in a cluster, say, Kubernetes, environment: container signals down in liveliness probes due to logging buffer failures, pod will be taken down, and re-spawned in a new environment with sufficient logging capacity. This is what you'd want, instead of losing all logging for an indefinite amount of time. I think this is a good default. If you indeed want the other way around, it makes sense that you need to opt-in for extra configuration, which is log4j2.discardThreshold in this case.

@remkop, @ppkarwasz, WDYT?

More controversially, I think that asyncQueueFullPolicy should default to Discard for basically the same reasons. One weird thing about this is that the blocking queue-full policy is literally named DefaultAsyncQueueFullPolicy and is denoted by the property value Default. =\

This is a valid remark. I'd support a PR

  1. renaming the default from Default to Discard, and
  2. translating Default usages to Discard with a WARN'ing logged

@ppkarwasz
Copy link
Contributor

Hi @vy,

I think this is a good default. If you indeed want the other way around, it makes sense that you need to opt-in for extra configuration, which is log4j2.discardThreshold in this case.

I agree: a default that discards all log events would be risky. While it technically still preserves FATAL events, those are rarely used in practice, especially since SLF4J doesn’t define a FATAL level.

More controversially, I think that asyncQueueFullPolicy should default to Discard for basically the same reasons. One weird thing about this is that the blocking queue-full policy is literally named DefaultAsyncQueueFullPolicy and is denoted by the property value Default. =\

This is a valid remark. I'd support a PR

  1. renaming the default from Default to Discard, and
  2. translating Default usages to Discard with a WARN'ing logged

I’d be hesitant to change the default from blocking (Default) to Discard, as that seems contrary to the original motivation behind the Log4j 2 project as initiated by @rgoers. My understanding has always been that one of the main differentiators between Logback and Log4j Core is that Logback drops messages by default (e.g., AsyncAppender discards on a full queue or messages are lost during reconfiguration), while Log4j Core does not, even during reconfiguration.

This “no events lost” behavior is explicitly part of our new threat model, where we guarantee reliability out-of-the-box, while still allowing users to opt into more discard-friendly settings when resilience against DoS attacks is a priority.

That said, I’m not opposed to:

  • Documenting different “profiles”: for example, an “audit mode” profile (fully reliable) and a “high-throughput” profile (discard under load), with the relevant configuration options spelled out.
  • Adding a Block synonym for Default, so users can more clearly express their intent without having to remember the historical naming.

But I would avoid replacing Default with Discard as the actual default.

@rschmitt
Copy link
Contributor Author

If logging is a vital component of your application, and it doesn't work, I think it is reasonable to signal that the application is down.

See, this is what I thought too, but over the years we had so many outages caused by blocking log statements that we adopted these overridden defaults years ago, and I'm not aware of any problems that they've caused. Additionally, the default logging behavior is synchronous, right? If you opt in to asynchronous logging, you already run the risk of a crash causing all of the buffered log messages to be lost (and the log statements leading up to the crash are probably the most important ones!). The current defaults seem like the worst of both worlds: you're exposed to the risk of data loss and the risk of an outage caused by indefinite blocking. The principled reason to discard instead of blocking is that that's the only way to guarantee asynchronous behavior (which is probably what you think you're opting in to by using async logging in the first place).

@rschmitt
Copy link
Contributor Author

That said, I’m not opposed to [...] adding a Block synonym for Default, so users can more clearly express their intent without having to remember the historical naming.

Yeah, I can do that.

@remkop
Copy link
Contributor

remkop commented Aug 17, 2025

I agree with @ppkarwasz that changing the default behavior would be against the original intent of Log4j2 to not drop events. It would be surprising to users to change the default behavior.

I would not oppose changing the name of the value used to configure the default behavior to a label that expresses the intention/behavior better.

@vy vy added api Affects the public API async Affects asynchronous loggers or appenders labels Aug 18, 2025
@vy
Copy link
Member

vy commented Aug 18, 2025

@rschmitt, thanks so much for bringing this to our attention, but I will close this PR since all interested maintainers expressed their reluctance. That said, we all agree renaming Default to a more self-documenting keyword (e.g., Block) is a good idea, which we better address in a separate PR.

@vy vy closed this Aug 18, 2025
@github-project-automation github-project-automation bot moved this from To triage to Done in Log4j bug tracker Aug 18, 2025
@rschmitt
Copy link
Contributor Author

@vy @ppkarwasz Just to be clear, this PR doesn't change the default policy to Discard, it changes the default discard threshold when Discard is configured (in other words, when the user has already opted in to dropping of log events). I don't think I've seen any comments on the actual proposal here.

@ppkarwasz
Copy link
Contributor

@vy @ppkarwasz Just to be clear, this PR doesn't change the default policy to Discard, it changes the default discard threshold when Discard is configured (in other words, when the user has already opted in to dropping of log events). I don't think I've seen any comments on the actual proposal here.

Good point! 💯 Most of the discussion so far has centered on @vy’s comment #3880 (comment) about changing the default to Discard (a Log4j “heresy”) and we lost sight of your actual changes.

Regarding your PR specifically:

  • I’m still concerned about changing the default threshold to ERROR, since that would effectively stop all logging. The current default of INFO seems safer: it still lets WARN and ERROR through (i.e., prevents them from being discarded). With your change, when the queue is full, the status logger will emit a warning, but since its default level is ERROR, that warning won’t be shown. In practice, users could experience a complete logging blackout.
  • I also find the semantics of log4j2.discardThreshold a bit unintuitive. Naively, I’d expect it to mean “log events at this level or higher will be retained.” Instead, it only retains events that are strictly more severe than the configured level, which is easy to misinterpret.

@rschmitt
Copy link
Contributor Author

With your change, when the queue is full, the status logger will emit a warning, but since its default level is ERROR, that warning won’t be shown.

Does the status logger go through all the same async appender machinery? I thought it was just a simple console logger used by log4j-api.

I also find the semantics of log4j2.discardThreshold a bit unintuitive. Naively, I’d expect it to mean “log events at this level or higher will be retained.”

Yeah, because that's how all other logging configuration works. I was also confused by this and had to carefully read source code to make sure I wasn't crazy.

@ppkarwasz
Copy link
Contributor

Does the status logger go through all the same async appender machinery? I thought it was just a simple console logger used by log4j-api.

It is just a simple console logger. My point is that, if a queue-full event occurs and you change the default log4j2.discardThreshold to ERROR, then:

  • Users will not see any log events from the system.
  • They will not even see the status logger warning, unless they changed the default value of log4j2.statusLoggerLevel from ERROR to WARN.

@rschmitt
Copy link
Contributor Author

They will not even see the status logger warning, unless they changed the default value of log4j2.statusLoggerLevel from ERROR to WARN.

Is that also true today? Are full queue notifications recorded at a log level that is not emitted by default?

@ppkarwasz
Copy link
Contributor

Unfortunately, yes!

We do plan to lower the default level of the status logger to WARN (see e.g. #1592), but there are still a number of issues to resolve first. For example, the code below will always emit a warning if bufferedIo is false, because bufferSize is initialized to a non-zero value:

if (!bufferedIo && bufferSize > 0) {
LOGGER.warn("The bufferSize is set to {} but bufferedIo is false: {}", bufferSize, bufferedIo);
}

We have several other cases like this. Until we can be confident that a user following the documented configuration won’t see spurious warnings, we’re holding off on lowering the threshold.

@rschmitt
Copy link
Contributor Author

It sounds like this case should be an error, since the logger is no longer able to meet its contract to both (1) not block and (2) not drop events. This is in a whole other realm of impact from something like an unrecognized config property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Affects the public API async Affects asynchronous loggers or appenders

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants