Skip to content

Conversation

@mjameswh
Copy link
Contributor

@mjameswh mjameswh commented Aug 27, 2025

What was changed

Considerably improved the Runtime's NativeLogCollector logic.

  • NativeLogCollector now starts its own interval timer to trigger log flushing.
  • NativeLogCollector now ensures that the number of buffered log entries can't exceed 2000.
  • NativeLogCollector now ensure that each message gets throttled by at least 100 ms. This is required to ensure correct global ordering of messages.
  • Added integration test to confirm that NativeLogCollector is indeed flushing as expected.

I designed this thing so that we can make it user-tunable in the future, but I don’t want to commit to that exact algorithm (and therefore its parameters) just yet. Arguably, it should even be possible for users to completely opt out of log buffering/reordering, e.g. if that's already done anyway by their downstream log collector. These will be considered in future work.

Why?

  • When Runtime is configured to forward logs from Core to the lang side, we use a lang-side buffer (NativeLogCollector) to reorder log entries coming from different sources (notably Workflow log APIs through Sinks, and Core) based on their absolute timestamp. This helps reduce incoherencies in log order.
  • Since 1.12.0, NativeLogCollector was relying on Core periodic flushes to perform its own flushing to the downstream logger. However, it turns out that Core won't flush unless there has been at least one log message in the mean time, which can be a rare occurrence if Core's log level is set to WARN or ERROR.
  • This has been causing considerable delays in propagation of workflow logs (fixes [Bug] Workflow logs not propagated unless telemetryOptions.logging.filter is DEBUG or TRACE #1750), as well as memory leak as the log buffer may reach disproportionate size.

@mjameswh mjameswh requested a review from a team as a code owner August 27, 2025 19:01
@mjameswh mjameswh requested a review from THardy98 August 27, 2025 19:02
@mjameswh mjameswh changed the title Fix worker logs not getting fliushed in Core log level set to ERROR Fix worker logs not getting flushed in Core log level set to ERROR Aug 27, 2025
Copy link
Contributor

@THardy98 THardy98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is an improvement over an unbounded buffer and has more forceful flushing mechanisms with the threshold flush and the unconditional flush.

I'm a bit concerned with hardcoded values that configure the buffer size and flush timers (minBufferTimeMs, flushPassIntervalMs, and maxBufferSize). The only potential issue I can see (aside from incorrectly ordered logs) is if the log emission rate is greater than the flushing rate. Not sure if that's a realistic concern.

In any case, it would be nice to benchmark the flushing functions. It would give a clear indicator at what log emission threshold these buffer/flush config values no longer work.

I generally think this is an improvement, depending on how strenuous the general logging case is.

}
}

private appendOne(entry: LogEntry): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit of a nit, but would prefer a better name here (i.e. appendOneAndFlush)
it's not clear unless you read the method that you are also flushing

Copy link
Contributor

@THardy98 THardy98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, given that we suspect the unconditional flush to handle when the timer flush is overwhelmed.

some nits above

@mjameswh
Copy link
Contributor Author

Split code, renamed methods, and rephrased comments to clarify the points raised by @THardy98.

@mjameswh
Copy link
Contributor Author

mjameswh commented Aug 27, 2025

I'm a bit concerned with hardcoded values that configure the buffer size and flush timers (minBufferTimeMs, flushPassIntervalMs, and maxBufferSize). The only potential issue I can see (aside from incorrectly ordered logs) is if the log emission rate is greater than the flushing rate. Not sure if that's a realistic concern.

Not a concern. That's what flushUnconditionally() is for. Renamed to flushExcess() for clarity.

@mjameswh mjameswh merged commit fc9de90 into main Aug 27, 2025
23 checks passed
@mjameswh mjameswh deleted the buffered-logs-flushing branch August 27, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Workflow logs not propagated unless telemetryOptions.logging.filter is DEBUG or TRACE

3 participants