Skip to content

forward input uses only one CPU core even with "Threaded On" (Fluent Bit v4.2, TLS PackedForward msgpack clients) #11285

@elham-azadi

Description

@elham-azadi

Environment

Fluent Bit version: 4.2

OS: debian 12

Input plugin: forward (listening on TCP with TLS)

Multiple external clients are connecting and sending PackedForward / CompressedPackedForward msgpack batches over TLS (mTLS from clients).

Fluent Bit configured with Threaded On on the forward input.

What I did / How to reproduce

Start Fluent Bit with the provided config (below).

From many clients, send PackedForward/CompressedPackedForward msgpack batches over TLS (example client logs included below). Each client keeps one connection; clients send batches of events frequently.

Observe the CPU usage on the Fluent Bit server.

Fluent Bit configuration:

[SERVICE]
    Flush            10
    Log_Level        info
    storage.path     /var/lib/fluent-bit/storage
    storage.type     filesystem

[INPUT]
    Name   forward
    Listen 0.0.0.0
    Port   24225
    Tag    wagent
    Threaded On
    Buffer_Chunk_Size 64M
    Buffer_Max_Size 2048M
    tls         On
    tls.verify  On
    tls.ca_file /etc/fluent-bit/tls/ca.crt
    tls.crt_file /etc/fluent-bit/tls/fluentbit-server.crt
    tls.key_file /etc/fluent-bit/tls/fluentbit-server.key

[OUTPUT]
    Name             http
    Match            wagent
    Host             172.20.18.15
    Port             8123
    workers          20
    URI              /?query=INSERT%20INTO%20benchmark.network_events_tls%20FORMAT%20JSONEachRow
    Format           json_stream
    json_date_key    timestamp
    json_date_format iso8601
    HTTP_User        default
    HTTP_Passwd      1qaz!QAZ
    compress         gzip

Client (sending) sample log snippets
(Clients are packing events, gzip-compressing the inner msgpack stream and sending PackedForward [tag, bin(compressed), options]):

PackedForward (CompressedPackedForward) packet size: 22350 bytes
#############################################################
stream_buf (concatenated msgpack objects) len = 120205
compressed_entries_len = 22351
PackedForward (CompressedPackedForward) packet size: 22379 bytes
...
[i] TLS shutdown OK
Sent 200 events to 172.24.80.227:24211 (tag='wagent')

Observed behavior

Despite Threaded On on the forward input, Fluent Bit only ever uses one CPU core to process incoming forward traffic.

I verified multiple clients are actively sending data concurrently.

Output plugin appears to create many worker threads (workers 20) and uses multiple cores for HTTP output, but input CPU remains single-core bound.

Increasing workers on http helps output concurrency but does not increase the number of cores used by the forward input.

This prevents Fluent Bit from scaling CPU-wise for high-connection / high-EPS ingestion via the forward plugin.

Expected behavior

Threaded On (or other available configuration) should allow the forward input to use multiple threads/cores to handle multiple simultaneous client connections and decode msgpack concurrently, so ingestion can scale across CPU cores.

What I tried / debugging steps

Confirmed Fluent Bit is v4.2.

Confirmed Threaded On is present in config.

Observed that adding additional [INPUT] forward sections on different ports causes other CPU cores to become used (i.e., multiple inputs each use their own core). That is not ideal: I expect a single forward input to scale across cores when multiple client connections exist.

Verified clients are sending PackedForward msgpack (CompressedPackedForward variant) over TLS and that Fluent Bit accepts and processes small batches successfully.

Checked Fluent Bit logs (info level) — no errors related to TLS or msgpack decoding in steady-state.

Tried Threaded On toggling, buffer tuning (Buffer_Chunk_Size, Buffer_Max_Size) — no change in input core usage.

Output http plugin uses many workers and multiple cores (so multi-threading works on outputs), but forward input remains single-core.

Questions / Clarifications I need

Is Threaded On supposed to enable multi-core concurrency for the forward input when servicing many simultaneous client connections (i.e., process multiple connections in parallel across cores)? Or is it limited to some other concurrency model?

If Threaded On is intended to allow per-connection multi-threaded handling, is there any additional configuration required (service or compile-time option) in v4.2 for TLS-enabled forward input?

Are there known limitations or bugs in v4.2 where the forward input remains single-threaded per process (e.g., TLS blocking in a single thread), and if so is there a recommended workaround?

If the forward input is intentionally single-threaded per listener, what is the recommended best practice to scale ingestion with a single Fluent Bit instance? (e.g., run multiple fluent-bit processes, use a load balancer, increase Buffer_Chunk_Size, use a UDP/HTTP collector, or use a different plugin?)

Could TLS wrapping cause the listener to be handled by a single blocking thread (OpenSSL sync calls), making Threaded On ineffective? If so, are there compile-time or runtime flags to turn on non-blocking TLS or OpenSSL asynchronous IO?

Additional context / notes
Adding multiple forward inputs on different ports spreads load across cores, but that is operationally clunky. Ideally one forward input should handle many connections and scale.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions