Skip to content

unknown msgpack type errors in fluentbit 4.0.12/4.1.1 #11078

@MichaelEischer

Description

@MichaelEischer

Bug Report

Describe the bug

We see some rare, randomly occurring crashes of fluent-bit due to running out-of-memory. Judging from our metrics, fluent-bit normally uses a few dozen MB, but in some rare cases it starts printing the following warning several times:

[ warn] [msgpack2json] unknown msgpack type 1880530997

(The number is usually different for each occurrence). A few seconds later, fluent-bit is killed by the kernels OOM killer due to now requiring 1GB of RAM, which is the limit configured for the pod. See below for configuration details.

According to our logs, fluent-bit 4.0.12 and 4.1.1 are both affected by this issue. With fluent-bit 4.0.5, we haven't encountered that issues so far.

Maybe related to #10729 .

To Reproduce

  • Example log message if applicable:
2025-10-28 12:42:59.919	
{"log":"[2025/10/28 11:42:59.918841221] [ warn] [msgpack2json] unknown msgpack type 1880530997"}
	2025-10-28 12:42:59.921	
{"log":"[2025/10/28 11:42:59.919021700] [ warn] [msgpack2json] unknown msgpack type 1880530997"}
[omitted 12 occurrences of the log output]
	2025-10-28 12:43:00.245	
{"log":"[2025/10/28 11:43:00.244996949] [ warn] [msgpack2json] unknown msgpack type 1880530997"}
[a few seconds later the container runs out of memory and is restarted]
[`dmesg -t` output below, the timestamps are a bit out of sync unfortunately]
[Tue Oct 28 11:42:16 2025] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Tue Oct 28 11:42:16 2025] [2582969] 65535 2582969      255      160    28672        0          -998 pause
[Tue Oct 28 11:42:16 2025] [3270883]     0 3270883   968217   265106  2674688        0           985 fluent-bit
[Tue Oct 28 11:42:16 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-befdddd83563f1b8859e3fd64974c2159c23b81e8bc067401ee75ad31dbdbc4a.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod60967bdb_6872_45d7_bbfc_e1b3e3e04e07.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod60967bdb_6872_45d7_bbfc_e1b3e3e04e07.slice/cri-containerd-befdddd83563f1b8859e3fd64974c2159c23b81e8bc067401ee75ad31dbdbc4a.scope,task=fluent-bit,pid=3270883,uid=0
[Tue Oct 28 11:42:16 2025] Memory cgroup out of memory: Killed process 3270883 (fluent-bit) total-vm:3872868kB, anon-rss:1044848kB, file-rss:15576kB, shmem-rss:0kB, UID:0 pgtables:2612kB oom_score_adj:985
[Tue Oct 28 11:42:16 2025] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod60967bdb_6872_45d7_bbfc_e1b3e3e04e07.slice/cri-containerd-befdddd83563f1b8859e3fd64974c2159c23b81e8bc067401ee75ad31dbdbc4a.scope are going to be killed due to memory.oom.group set
[Tue Oct 28 11:42:16 2025] Memory cgroup out of memory: Killed process 3270899 (flb-out-splunk.) total-vm:3872868kB, anon-rss:1044848kB, file-rss:15576kB, shmem-rss:0kB, UID:0 pgtables:2612kB oom_score_adj:985
  • Steps to reproduce the problem:

We send audit logs (basically JSON documents) to fluent-bit which then forwards them to Splunk. There are a few such OOM kills every day and they only randomly show up in production, which makes things harder to reproduce.

fluentbit is configured as shown below. The pod itself just receives that configuration has has a memory usage limit of 1GB (normal usage is around 80MB).

[SERVICE]
    hc_errors_count 0
    hc_period 60
    hc_retry_failure_count 0
    health_check on
    http_listen 0.0.0.0
    http_port 2020
    http_server on
    log_level info
    scheduler.base 1
    scheduler.cap 60
    storage.backlog.mem_limit 5M
    storage.checksum off
    storage.max_chunks_up 128
    storage.metrics on
    storage.path /data/
    storage.sync normal

[INPUT]
    name http
    storage.type filesystem

[OUTPUT]
    match audit
    name null

[FILTER]
    add cluster some-name
    match *
    name modify

[OUTPUT]
    event_host some-name
    event_index some-name
    event_sourcetype auditlog
    host splunk-host
    match audit
    name splunk
    port 8088
    retry_limit no_limits
    splunk_send_raw off
    splunk_token ${SPLUNK_HEC_TOKEN}
    storage.total_limit_size 900M
    tls on
    tls.verify on
    tls.verify_hostname on

Expected behavior

Fluenbit shouldn't start to suddenly allocate so much memory that it crashes.

Your Environment

  • Version used: 4.0.12 / 4.1.1
  • Configuration: see above
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.33.5
  • Server type and version: OpenStack VM
  • Operating System and version: Flatcar Linux 6.6.95-flatcar
  • Filters and plugins: null & splunk output

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions