Skip to content

in_tail: Data loss on exit/restart due to unhandled buffer (partial lines) #11265

@jinyongchoi

Description

@jinyongchoi

Bug Report

Describe the bug
When using the in_tail plugin with a database (DB) configured, data loss occurs if Fluent Bit is restarted while there is unprocessed data in the internal buffer. This typically happens when the log file ends with a partial line (no newline character) at the moment of shutdown.

The in_tail plugin advances the file offset in the database immediately upon reading data. However, if that data is buffered but not fully processed when Fluent Bit exits, the buffer is discarded.

To Reproduce

  1. Fixed for logging and build
    plugins/in_tail/tail_file.c
void flb_tail_file_remove(struct flb_tail_file *file)
...
flb_plg_debug(ctx->ins, "inode=%"PRIu64" removing file name %s",
                  file->inode, file->name);
=>
flb_plg_info(ctx->ins, "inode=%"PRIu64" removing file=%s, buf_len=%lu, offset=%"PRId64,
                  file->inode, file->name, (unsigned long)file->buf_len, file->offset);
  1. Create input log file
#!/usr/bin/env python3

from datetime import datetime

TIMESTAMP = datetime.now().strftime("%d/%b/%Y:%H:%M:%S +0000")
PATH = "/api/v1/" + "a" * 1000


def generate_large_log_line(line_number):
    log_line = (
        f"192.168.1.100 - - [{TIMESTAMP}] "
        f'"GET {PATH} HTTP/1.1" 200 {line_number} '
        f'"-" "Mozilla/5.0"\n'
    )

    return log_line


def main():
    output_file = "/tmp/testing.input"
    target_size = 2 * 1024 * 1024 * 1024
    line_size = 1024
    total_lines = target_size // line_size

    print(f"Starting log generation: {output_file}")
    print(f"Target size: {target_size / (1024**3):.2f} GB")
    print(f"Expected lines: {total_lines:,}")

    written_size = 0
    line_count = 0

    try:
        with open(output_file, "w") as f:
            while written_size < target_size:
                log_line = generate_large_log_line(line_count)
                f.write(log_line)

                written_size += len(log_line.encode("utf-8"))
                line_count += 1

                if line_count % 100000 == 0:
                    progress = (written_size / target_size) * 100
                    size_mb = written_size / (1024 * 1024)
                    print(
                        f"Progress: {progress:.1f}% - {size_mb:.1f} MB - {line_count:,} lines"
                    )

    except KeyboardInterrupt:
        print("\nLog generation interrupted")
    except Exception as e:
        print(f"Error occurred: {e}")

    final_size_gb = written_size / (1024**3)
    print("\nLog generation completed!")
    print(f"Generated size: {final_size_gb:.2f} GB")
    print(f"Generated lines: {line_count:,}")
    print(f"Average line size: {written_size / line_count:.0f} bytes")


if __name__ == "__main__":
    main()

  1. Run Fluent Bit
fluent-bit -v -c ./fluentbit.conf
  1. After 3Sec and stop Fluent Bit
  2. Check Fluent Bit log
[2025/12/08 13:46:41.22042158] [ info] [input:tail:input_log] inode=50630975 removing file=/tmp/testing.input, buf_len=291, offset=1933548765
  • Rubular link if applicable:
  • Example log message if applicable:
[2025/12/08 13:46:41.22042158] [ info] [input:tail:input_log] inode=50630975 removing file=/tmp/testing.input, buf_len=291, offset=1933548765
  • Steps to reproduce the problem:

Expected behavior
When Fluent Bit shuts down with data in its buffer, it should update (rewind) the offset in the database to point to the start of the unprocessed data. This ensures that on the next startup, the data is re-read and processed correctly.

Screenshots

Your Environment

Ubuntu 24.04

  • Version used:
    4.2.0
  • Configuration:
[SERVICE]
    flush 2
    grace 60
    log_level info
    log_file /tmp/testing/logs/testing.log
    parsers_file /tmp/testing/parsers.conf
    plugins_file /tmp/testing/plugins.conf
    http_server on
    http_listen 0.0.0.0
    http_port 22002

    storage.path /tmp/testing/storage
    storage.metrics on
    storage.max_chunks_up 512
    storage.sync full
    storage.checksum off
    storage.backlog.mem_limit 100M

[INPUT]
    Name tail
    Path /tmp/testing.input
    Exclude_Path *.gz,*.zip
    Tag testing
    Key message
    Offset_Key   log_offset

    Read_from_Head true
    Refresh_Interval 3
    Rotate_Wait 31557600

    Buffer_Chunk_Size 1MB
    Buffer_Max_Size 16MB
    Inotify_Watcher false

    storage.type filesystem
    storage.pause_on_chunks_overlimit true

    DB /tmp/testing/storage/testing.db
    DB.sync normal
    DB.locking false

    Alias input_log

[OUTPUT]
    Name file
    Match *
    File /tmp/testing.out
  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version:
  • Operating System and version:
  • Filters and plugins:

Additional context
Analysis of plugins/in_tail/tail_file.c:

  • flb_tail_file_remove (called on exit) destroys file->buf_data without checking for remaining content.
  • Since file->offset tracks the raw read position and is blindly trusted on restart, the discrepancy between "read" and "processed" leads to data loss.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions