-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Environment
Fluent Bit version: 4.2
OS: debian 12
Input plugin: forward (listening on TCP with TLS)
Multiple external clients are connecting and sending PackedForward / CompressedPackedForward msgpack batches over TLS (mTLS from clients).
Fluent Bit configured with Threaded On on the forward input.
What I did / How to reproduce
Start Fluent Bit with the provided config (below).
From many clients, send PackedForward/CompressedPackedForward msgpack batches over TLS (example client logs included below). Each client keeps one connection; clients send batches of events frequently.
Observe the CPU usage on the Fluent Bit server.
Fluent Bit configuration:
[SERVICE]
Flush 10
Log_Level info
storage.path /var/lib/fluent-bit/storage
storage.type filesystem
[INPUT]
Name forward
Listen 0.0.0.0
Port 24225
Tag wagent
Threaded On
Buffer_Chunk_Size 64M
Buffer_Max_Size 2048M
tls On
tls.verify On
tls.ca_file /etc/fluent-bit/tls/ca.crt
tls.crt_file /etc/fluent-bit/tls/fluentbit-server.crt
tls.key_file /etc/fluent-bit/tls/fluentbit-server.key
[OUTPUT]
Name http
Match wagent
Host 172.20.18.15
Port 8123
workers 20
URI /?query=INSERT%20INTO%20benchmark.network_events_tls%20FORMAT%20JSONEachRow
Format json_stream
json_date_key timestamp
json_date_format iso8601
HTTP_User default
HTTP_Passwd 1qaz!QAZ
compress gzip
Client (sending) sample log snippets
(Clients are packing events, gzip-compressing the inner msgpack stream and sending PackedForward [tag, bin(compressed), options]):
PackedForward (CompressedPackedForward) packet size: 22350 bytes
#############################################################
stream_buf (concatenated msgpack objects) len = 120205
compressed_entries_len = 22351
PackedForward (CompressedPackedForward) packet size: 22379 bytes
...
[i] TLS shutdown OK
Sent 200 events to 172.24.80.227:24211 (tag='wagent')
Observed behavior
Despite Threaded On on the forward input, Fluent Bit only ever uses one CPU core to process incoming forward traffic.
I verified multiple clients are actively sending data concurrently.
Output plugin appears to create many worker threads (workers 20) and uses multiple cores for HTTP output, but input CPU remains single-core bound.
Increasing workers on http helps output concurrency but does not increase the number of cores used by the forward input.
This prevents Fluent Bit from scaling CPU-wise for high-connection / high-EPS ingestion via the forward plugin.
Expected behavior
Threaded On (or other available configuration) should allow the forward input to use multiple threads/cores to handle multiple simultaneous client connections and decode msgpack concurrently, so ingestion can scale across CPU cores.
What I tried / debugging steps
Confirmed Fluent Bit is v4.2.
Confirmed Threaded On is present in config.
Observed that adding additional [INPUT] forward sections on different ports causes other CPU cores to become used (i.e., multiple inputs each use their own core). That is not ideal: I expect a single forward input to scale across cores when multiple client connections exist.
Verified clients are sending PackedForward msgpack (CompressedPackedForward variant) over TLS and that Fluent Bit accepts and processes small batches successfully.
Checked Fluent Bit logs (info level) — no errors related to TLS or msgpack decoding in steady-state.
Tried Threaded On toggling, buffer tuning (Buffer_Chunk_Size, Buffer_Max_Size) — no change in input core usage.
Output http plugin uses many workers and multiple cores (so multi-threading works on outputs), but forward input remains single-core.
Questions / Clarifications I need
Is Threaded On supposed to enable multi-core concurrency for the forward input when servicing many simultaneous client connections (i.e., process multiple connections in parallel across cores)? Or is it limited to some other concurrency model?
If Threaded On is intended to allow per-connection multi-threaded handling, is there any additional configuration required (service or compile-time option) in v4.2 for TLS-enabled forward input?
Are there known limitations or bugs in v4.2 where the forward input remains single-threaded per process (e.g., TLS blocking in a single thread), and if so is there a recommended workaround?
If the forward input is intentionally single-threaded per listener, what is the recommended best practice to scale ingestion with a single Fluent Bit instance? (e.g., run multiple fluent-bit processes, use a load balancer, increase Buffer_Chunk_Size, use a UDP/HTTP collector, or use a different plugin?)
Could TLS wrapping cause the listener to be handled by a single blocking thread (OpenSSL sync calls), making Threaded On ineffective? If so, are there compile-time or runtime flags to turn on non-blocking TLS or OpenSSL asynchronous IO?
Additional context / notes
Adding multiple forward inputs on different ports spreads load across cores, but that is operationally clunky. Ideally one forward input should handle many connections and scale.