-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Bug Report
Describe the bug
We see some rare, randomly occurring crashes of fluent-bit due to running out-of-memory. Judging from our metrics, fluent-bit normally uses a few dozen MB, but in some rare cases it starts printing the following warning several times:
[ warn] [msgpack2json] unknown msgpack type 1880530997
(The number is usually different for each occurrence). A few seconds later, fluent-bit is killed by the kernels OOM killer due to now requiring 1GB of RAM, which is the limit configured for the pod. See below for configuration details.
According to our logs, fluent-bit 4.0.12 and 4.1.1 are both affected by this issue. With fluent-bit 4.0.5, we haven't encountered that issues so far.
Maybe related to #10729 .
To Reproduce
- Example log message if applicable:
2025-10-28 12:42:59.919
{"log":"[2025/10/28 11:42:59.918841221] [ warn] [msgpack2json] unknown msgpack type 1880530997"}
2025-10-28 12:42:59.921
{"log":"[2025/10/28 11:42:59.919021700] [ warn] [msgpack2json] unknown msgpack type 1880530997"}
[omitted 12 occurrences of the log output]
2025-10-28 12:43:00.245
{"log":"[2025/10/28 11:43:00.244996949] [ warn] [msgpack2json] unknown msgpack type 1880530997"}
[a few seconds later the container runs out of memory and is restarted]
[`dmesg -t` output below, the timestamps are a bit out of sync unfortunately]
[Tue Oct 28 11:42:16 2025] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Tue Oct 28 11:42:16 2025] [2582969] 65535 2582969 255 160 28672 0 -998 pause
[Tue Oct 28 11:42:16 2025] [3270883] 0 3270883 968217 265106 2674688 0 985 fluent-bit
[Tue Oct 28 11:42:16 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-befdddd83563f1b8859e3fd64974c2159c23b81e8bc067401ee75ad31dbdbc4a.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod60967bdb_6872_45d7_bbfc_e1b3e3e04e07.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod60967bdb_6872_45d7_bbfc_e1b3e3e04e07.slice/cri-containerd-befdddd83563f1b8859e3fd64974c2159c23b81e8bc067401ee75ad31dbdbc4a.scope,task=fluent-bit,pid=3270883,uid=0
[Tue Oct 28 11:42:16 2025] Memory cgroup out of memory: Killed process 3270883 (fluent-bit) total-vm:3872868kB, anon-rss:1044848kB, file-rss:15576kB, shmem-rss:0kB, UID:0 pgtables:2612kB oom_score_adj:985
[Tue Oct 28 11:42:16 2025] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod60967bdb_6872_45d7_bbfc_e1b3e3e04e07.slice/cri-containerd-befdddd83563f1b8859e3fd64974c2159c23b81e8bc067401ee75ad31dbdbc4a.scope are going to be killed due to memory.oom.group set
[Tue Oct 28 11:42:16 2025] Memory cgroup out of memory: Killed process 3270899 (flb-out-splunk.) total-vm:3872868kB, anon-rss:1044848kB, file-rss:15576kB, shmem-rss:0kB, UID:0 pgtables:2612kB oom_score_adj:985
- Steps to reproduce the problem:
We send audit logs (basically JSON documents) to fluent-bit which then forwards them to Splunk. There are a few such OOM kills every day and they only randomly show up in production, which makes things harder to reproduce.
fluentbit is configured as shown below. The pod itself just receives that configuration has has a memory usage limit of 1GB (normal usage is around 80MB).
[SERVICE]
hc_errors_count 0
hc_period 60
hc_retry_failure_count 0
health_check on
http_listen 0.0.0.0
http_port 2020
http_server on
log_level info
scheduler.base 1
scheduler.cap 60
storage.backlog.mem_limit 5M
storage.checksum off
storage.max_chunks_up 128
storage.metrics on
storage.path /data/
storage.sync normal
[INPUT]
name http
storage.type filesystem
[OUTPUT]
match audit
name null
[FILTER]
add cluster some-name
match *
name modify
[OUTPUT]
event_host some-name
event_index some-name
event_sourcetype auditlog
host splunk-host
match audit
name splunk
port 8088
retry_limit no_limits
splunk_send_raw off
splunk_token ${SPLUNK_HEC_TOKEN}
storage.total_limit_size 900M
tls on
tls.verify on
tls.verify_hostname on
Expected behavior
Fluenbit shouldn't start to suddenly allocate so much memory that it crashes.
Your Environment
- Version used: 4.0.12 / 4.1.1
- Configuration: see above
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.33.5
- Server type and version: OpenStack VM
- Operating System and version: Flatcar Linux 6.6.95-flatcar
- Filters and plugins: null & splunk output
Additional context