diff --git a/.gitbook.yaml b/.gitbook.yaml index 5992b9e3c..43f895a5f 100644 --- a/.gitbook.yaml +++ b/.gitbook.yaml @@ -2,8 +2,8 @@ redirects: # Installation installation/upgrade_notes: ./installation/upgrade-notes.md installation/supported_platforms: ./installation/downloads.md - installation/docker.md: ./installation/downloads/docker.md - installation/windows.md: ./installation/downloads/windows.md + installation/docker: ./installation/downloads/docker.md + installation/windows: ./installation/downloads/windows.md # Inputs input/collectd: ./pipeline/inputs/ @@ -103,3 +103,4 @@ redirects: administration/configuring-fluent-bit/yaml/configuration-file: ./administration/configuring-fluent-bit/yaml.md administration/configuring-fluent-bit/unit-sizes: ./administration/configuring-fluent-bit.md administration/configuring-fluent-bit/multiline-parsing: ./pipeline/parsers/multiline-parsing.md + administration/buffering-and-storage: ./pipeline/buffering.md diff --git a/README.md b/README.md index 285aa0041..430661687 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ description: High Performance Telemetry Agent for Logs, Metrics and Traces - Metrics support: Prometheus and OpenTelemetry compatible - Reliability and data integrity - [Backpressure](administration/backpressure.md) handling - - [Data buffering](administration/buffering-and-storage.md) in memory and file system + - [Data buffering](./pipeline/buffering.md) in memory and file system - Networking - Security: Built-in TLS/SSL support - Asynchronous I/O diff --git a/SUMMARY.md b/SUMMARY.md index dbda93541..cf9aeb637 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -56,7 +56,7 @@ * [Variables](administration/configuring-fluent-bit/classic-mode/variables.md) * [AWS credentials](administration/aws-credentials.md) * [Backpressure](administration/backpressure.md) -* [Buffering and storage](administration/buffering-and-storage.md) +* [Dead letter queue](administration/dead-letter-queue.md) * [Hot reload](administration/hot-reload.md) * [HTTP proxy](administration/http-proxy.md) * [Memory management](administration/memory-management.md) diff --git a/administration/backpressure.md b/administration/backpressure.md index cae433ac0..31d674cf0 100644 --- a/administration/backpressure.md +++ b/administration/backpressure.md @@ -2,35 +2,47 @@ -It's possible for logs or data to be ingested or created faster than the ability to flush it to some destinations. A common scenario is when reading from big log files, especially with a large backlog, and dispatching the logs to a backend over the network, which takes time to respond. This generates _backpressure_, leading to high memory consumption in the service. +It's possible for Fluent Bit to ingest or create data faster than it can flush that data to the intended destinations. This creates a condition known as _backpressure_. -To avoid backpressure, Fluent Bit implements a mechanism in the engine that restricts the amount of data an input plugin can ingest. Restriction is done through the configuration parameters `Mem_Buf_Limit` and `storage.Max_Chunks_Up`. +Fluent Bit can accommodate a certain amount of backpressure by [buffering](../pipeline/buffering.md) that data until it can be processed and routed. However, if Fluent Bit continues buffering new data to temporary storage faster than it can flush old data, that storage will eventually reach capacity. -As described in [Buffering and storage](../administration/buffering-and-storage.md) , Fluent Bit offers two modes for data handling: in-memory only (default) and in-memory and filesystem (optional). +Strategies for managing backpressure vary depending on the [buffering mode](../pipeline/buffering.md#buffering-modes) for each active input plugin. Because of this, choosing the right buffering mode is also a key part of managing backpressure. -The default `storage.type memory` buffer can be restricted with `Mem_Buf_Limit`. If memory reaches this limit and you reach a backpressure scenario, you won't be able to ingest more data until the data chunks that are in memory can be flushed. The input pauses and Fluent Bit [emits](https://github.com/fluent/fluent-bit/blob/v2.0.0/src/flb_input_chunk.c#L1334) a `[warn] [input] {input name or alias} paused (mem buf overlimit)` log message. +## Manage backpressure for memory-only buffering -Depending on the input plugin in use, this might cause incoming data to be discarded (for example, TCP input plugin). The tail plugin can handle pauses without data loss, storing its current file offset and resuming reading later. When buffer memory is available, the input resumes accepting logs. Fluent Bit [emits](https://github.com/fluent/fluent-bit/blob/v2.0.0/src/flb_input_chunk.c#L1277) a `[info] [input] {input name or alias} resume (mem buf overlimit)` message. +If one or more active input plugins use [memory-only buffering](../pipeline/buffering.md#memory-only-buffering), use the following settings to manage backpressure. -Mitigate the risk of data loss by configuring secondary storage on the filesystem using the `storage.type` of `filesystem` (as described in [Buffering and storage](../administration/buffering-and-storage.md)). Initially, logs will be buffered to both memory and the filesystem. When the `storage.max_chunks_up` limit is reached, all new data will be stored in the filesystem. Fluent Bit stops queueing new data in memory and buffers only to the filesystem. When `storage.type filesystem` is set, the `Mem_Buf_Limit` setting no longer has any effect. Instead, the `[SERVICE]` level `storage.max_chunks_up` setting controls the size of the memory buffer. +{% hint style="warning" %} +Some input plugins are prone to data loss after `mem_buf_limit` capacity is reached during memory-only buffering. If you need to avoid data loss, consider using [filesystem buffering](../pipeline/buffering.md#filesystem-buffering-hybrid) instead. +{% endhint %} -## `Mem_Buf_Limit` +### Set `mem_buf_limit` for input plugins -`Mem_Buf_Limit` applies only with the default `storage.type memory`. This option is disabled by default and can be applied to all input plugins. +For input plugins that use memory-only buffering, you can configure the `mem_buf_limit` setting to enforce a limit for how much data that plugin can buffer to memory. -As an example situation: +{% hint style="info" %} +This setting doesn't affect how much data can be buffered to memory by plugins that use filesystem buffering. +{% endhint %} -- `Mem_Buf_Limit` is set to `1MB`. +When the specified `mem_buf_limit` capacity is reached, Fluent Bit will stop buffering data from that source plugin until enough buffered chunks are flushed. Most plugins emit a log message that says `[warn] [input] paused (mem buf overlimit)` when buffering pauses. + +After more memory becomes available, Fluent Bit will resume buffering data from that source plugin. Most plugins emit a log message that says `[info] [input] resume (mem buf overlimit)` when buffering resumes. + +#### Behavior when capacity is reached + +The following example demonstrates what happens when an input plugin with memory-only buffering reaches its `mem_buf_limit` capacity: + +- The input plugin's `mem_buf_limit` is set to `1MB`. - The input plugin tries to append 700 KB. - The engine routes the data to an output plugin. -- The output plugin backend (HTTP Server) is down. +- The output plugin's backend is down, which means it won't accept the data. - Engine scheduler retries the flush after 10 seconds. - The input plugin tries to append 500 KB. -In this situation, the engine allows appending those 500 KB of data into the memory, with a total of 1.2 MB of data buffered. The limit is permissive and will allow a single write past the limit. When the limit is exceeded, the following actions are taken: +In this situation, the engine allows appending those 500 KB of data into the memory, with a total of 1.2 MB of data buffered. The limit is permissive and will allow a single write past the capacity of `mem_buf_limit`. When the limit is exceeded, Fluent Bit takes the following actions: -- Block local buffers for the input plugin (can't append more data). -- Notify the input plugin, invoking a `pause` callback. +- It blocks local buffers for the input plugin (can't append more data). +- It notifies the input plugin, invoking a `pause` callback. The engine protects itself and won't append more data coming from the input plugin in question. It's the responsibility of the plugin to keep state and decide what to do in a `paused` state. @@ -42,32 +54,30 @@ In a few seconds, if the scheduler was able to flush the initial 700 KB of - If the plugin is paused, it invokes a `resume` callback. - The input plugin can continue appending more data. -## `storage.max_chunks_up` +## Manage backpressure for filesystem buffering + +If one or more active input plugins use [filesystem buffering](../pipeline/buffering.md#filesystem-buffering-hybrid), use the following settings to manage backpressure. -The `[SERVICE]` level `storage.max_chunks_up` setting controls the size of the memory buffer. When `storage.type filesystem` is set, the `Mem_Buf_Limit` setting no longer has an effect. +### Set `storage.max_chunks_up` and `storage.backlog.mem_limit` in global settings -The setting behaves similar to the `Mem_Buf_Limit` scenario when the non-default `storage.pause_on_chunks_overlimit` is enabled. +In the [`service` section](../administration/configuring-fluent-bit/yaml/service-section.md) of your Fluent Bit configuration file, you can configure the `storage.max_chunks_up` and `storage.backlog.mem_limit` settings. Both settings dictate how much data can be buffered to memory by input plugins that use filesystem buffering, and are combined limits shared by all applicable input plugins. -When (default) `storage.pause_on_chunks_overlimit` is disabled, the input won't pause when the memory limit is reached. Instead, it switches to buffering logs only in the filesystem. Limit the disk spaced used for filesystem buffering with `storage.total_limit_size`. +{% hint style="info" %} +These settings don't affect how much data can be buffered to memory by plugins that use memory-only buffering. +{% endhint %} -See [Buffering and Storage](buffering-and-storage.md) docs for more information. +When either the specified `storage.max_chunks_up` or `storage.backlog.mem_limit` capacity is reached, all input plugins that use filesystem buffering will stop buffering data to memory until more memory becomes available. Whether these input plugins continue buffering data to the filesystem depends on each plugin's specified `storage.pause_on_chunks_overlimit` value. -## About pause and resume callbacks +### Set `storage.pause_on_chunks_overlimit` for input plugins -Each plugin is independent and not all of them implement `pause` and `resume` callbacks. These callbacks are a notification mechanism for the plugin. +For input plugins that use filesystem buffering, you can configure the `storage.pause_on_chunks_overlimit` setting to specify how each plugin should behave after the global `storage.max_chunks_up` or `storage.backlog.mem_limit` capacity is reached. -One example of a plugin that implements these callbacks and keeps state correctly is the [Tail Input](../pipeline/inputs/tail.md) plugin. When the `pause` callback triggers, it pauses its collectors and stops appending data. Upon `resume`, it resumes the collectors and continues ingesting data. Tail tracks the current file offset when it pauses, and resumes at the same position. If the file hasn't been deleted or moved, it can still be read. +If `storage.pause_on_chunks_overlimit` is set to `off` for an input plugin, the input plugin will stop buffering data to memory but continue buffering data to the filesystem. -With the default `storage.type memory` and `Mem_Buf_Limit`, the following log messages emit for `pause` and `resume`: +If `storage.pause_on_chunks_overlimit` is set to `on` for an input plugin, the input plugin will stop both memory buffering and filesystem buffering until more memory becomes available. -```text -[warn] [input] {input name or alias} paused (mem buf overlimit) -[info] [input] {input name or alias} resume (mem buf overlimit) -``` +### Set `storage.total_limit_size` for output plugins -With `storage.type filesystem` and `storage.max_chunks_up`, the following log messages emit for `pause` and `resume`: +Fluent Bit implements the concept of logical queues for buffered chunks. Based on its tag, a chunk can be routed to multiple destinations. Fluent Bit keeps an internal reference from where each chunk was created and where it needs to go. To limit the number of queued chunks, set the `storage.total_limit_size` for any active output plugins that route data ingested by input plugins that use filesystem buffering. -```text -[input] {input name or alias} paused (storage buf overlimit) -[input] {input name or alias} resume (storage buf overlimit) -``` +Network failures or latency in third-party services is common for output destinations. In some cases, a chunk is tagged for multiple destinations with varying response times, or one destination is generating more backpressure than others. If an output plugin reaches its configured `storage.total_limit_size` capacity, the oldest chunk from its queue will be discarded to make room for new data. diff --git a/administration/buffering-and-storage.md b/administration/buffering-and-storage.md deleted file mode 100644 index c5e3d8dfe..000000000 --- a/administration/buffering-and-storage.md +++ /dev/null @@ -1,370 +0,0 @@ -# Buffering and storage - - - -[Fluent Bit](https://fluentbit.io) collects, parses, filters, and ships logs to a central place. A critical piece of this workflow is the ability to do _buffering_: a mechanism to place processed data into a temporary location until is ready to be shipped. - -By default when Fluent Bit processes data, it uses Memory as a primary and temporary place to store the records. There are scenarios where it would be ideal to have a persistent buffering mechanism based in the filesystem to provide aggregation and data safety capabilities. - -Choosing the right configuration is critical and the behavior of the service can be conditioned based in the backpressure settings. Before jumping into the configuration it helps to understand the relationship between _chunks_, _memory_, _filesystem_, and _backpressure_. - -## Chunks, memory, filesystem, and backpressure - -Understanding chunks, buffering, and backpressure is critical for a proper configuration. - -### Backpressure - -See [Backpressure](https://docs.fluentbit.io/manual/administration/backpressure) for a full explanation. - -### Chunks - -When an input plugin source emits records, the engine groups the records together in a _chunk_. A chunk's size usually is around 2 MB. By configuration, the engine decides where to place this chunk. By default, all chunks are created only in memory. - -### Irrecoverable chunks - -There are two scenarios where Fluent Bit marks chunks as irrecoverable: - -- When Fluent Bit encounters a bad layout in a chunk. A bad layout is a chunk that doesn't conform to the expected format. [Chunk definition](https://github.com/fluent/fluent-bit/blob/master/CHUNKS.md) - -- When Fluent Bit encounters an incorrect or invalid chunk header size. - -In both scenarios Fluent Bit logs an error message and then discards the irrecoverable chunks. - -#### Buffering and memory - -As mentioned previously, chunks generated by the engine are placed in memory by default, but this is configurable. - -If memory is the only mechanism set for the input plugin, it will store as much data as possible in memory. This is the fastest mechanism with the least system overhead. However, if the service isn't able to deliver the records fast enough, Fluent Bit memory usage increases as it accumulates more data than it can deliver. - -In a high load environment with backpressure, having high memory usage risks getting killed by the kernel's OOM Killer. To work around this backpressure scenario, limit the amount of memory in records that an input plugin can register using the `mem_buf_limit` property. If a plugin has queued more than the `mem_buf_limit`, it won't be able to ingest more until that data can be delivered or flushed properly. In this scenario the input plugin in question is paused. When the input is paused, records won't be ingested until the plugin resumes. For some inputs, such as TCP and tail, pausing the input will almost certainly lead to log loss. For the tail input, Fluent Bit can save its current offset in the current file it's reading, and pick back up when the input resumes. - -Look for messages in the Fluent Bit log output like: - -```text -[input] tail.1 paused (mem buf overlimit) -[input] tail.1 resume (mem buf overlimit) -``` - -Using `mem_buf_limit` is good for certain scenarios and environments. It helps to control the memory usage of the service. However, if a file rotates while the plugin is paused, data can be lost since it won't be able to register new records. This can happen with any input source plugin. The goal of `mem_buf_limit` is memory control and survival of the service. - -For a full data safety guarantee, use filesystem buffering. - -Choose your preferred format for an example input definition: - -{% tabs %} -{% tab title="fluent-bit.yaml" %} - -```yaml -pipeline: - inputs: - - name: tcp - listen: 0.0.0.0 - port: 5170 - format: none - tag: tcp-logs - mem_buf_limit: 50MB -``` - -{% endtab %} -{% tab title="fluent-bit.conf" %} - -```text -[INPUT] - Name tcp - Listen 0.0.0.0 - Port 5170 - Format none - Tag tcp-logs - Mem_Buf_Limit 50MB -``` - -{% endtab %} -{% endtabs %} - -If this input uses more than 50 MB memory to buffer logs, you will get a warning like this in the Fluent Bit logs: - -```text -[input] tcp.1 paused (mem buf overlimit) -``` - -{% hint style="info" %} - -`mem_buf_Limit` applies only when `storage.type` is set to the default value of `memory`. - -{% endhint %} - -#### Filesystem buffering - -Filesystem buffering helps with backpressure and overall memory control. Enable it using `storage.type filesystem`. - -Memory and filesystem buffering mechanisms aren't mutually exclusive. Enabling filesystem buffering for your input plugin source can improve both performance and data safety. - -Enabling filesystem buffering changes the behavior of the engine. Upon chunk creation, the engine stores the content in memory and also maps a copy on disk through [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html). The newly created chunk is active in memory, backed up on disk, and called to be `up`, which means the chunk content is up in memory. - -Fluent Bit controls the number of chunks that are `up` in memory by using the filesystem buffering mechanism to deal with high memory usage and backpressure. - -By default, the engine allows a total of 128 chunks `up` in memory in total, considering all chunks. This value is controlled by the service property `storage.max_chunks_up`. The active chunks that are `up` are either ready for delivery (marked busy and locked), or are still receiving records. Any other remaining chunk is in a `down` state, which means that it's only in the filesystem and won't be `up` in memory unless it's ready to be delivered. Chunks are never much larger than 2 MB, so with the default `storage.max_chunks_up` value of 128, each input is limited to roughly 256 MB of memory. - -If the input plugin has enabled `storage.type` as `filesystem`, when reaching the `storage.max_chunks_up` threshold, instead of the plugin being paused, all new data will go to chunks that are `down` in the filesystem. This lets you control memory usage by the service and also provides a guarantee that the service won't lose any data. By default, the enforcement of the `storage.max_chunks_up` limit is best-effort. Fluent Bit can only append new data to chunks that are `up`. When the limit is reached chunks will be temporarily brought `up` in memory to ingest new data, and then put to a `down` state afterwards. In general, Fluent Bit works to keep the total number of `up` chunks at or under `storage.max_chunks_up`. - -If `storage.pause_on_chunks_overlimit` is enabled (default is off), the input plugin pauses upon exceeding `storage.max_chunks_up`. With this option, `storage.max_chunks_up` becomes a hard limit for the input. When the input is paused, records won't be ingested until the plugin resumes. For some inputs, such as TCP and tail, pausing the input will almost certainly lead to log loss. For the tail input, Fluent Bit can save its current offset in the current file it's reading, and pick back up when the input is resumed. - -Look for messages in the Fluent Bit log output like: - -```text -[input] tail.1 paused (storage buf overlimit) -[input] tail.1 resume (storage buf overlimit) -``` - -##### Limiting filesystem space for chunks - -Fluent Bit implements the concept of logical queues. Based on its tag, a chunk can be routed to multiple destinations. Fluent Bit keeps an internal reference from where a chunk was created and where it needs to go. - -It's common to find cases where multiple destinations with different response times exist for a chunk, or one of the destinations is generating backpressure. - -To limit the amount of filesystem chunks logically queueing, Fluent Bit v1.6 and later includes the `storage.total_limit_size` configuration property for output. This property limits the total size in bytes of chunks that can exist in the filesystem for a certain logical output destination. If one of the destinations reaches the configured `storage.total_limit_size`, the oldest chunk from its queue for that logical output destination will be discarded to make room for new data. - -## Configuration - -The storage layer configuration takes place in three sections: - -- Service -- Input -- Output - -The known Service section configures a global environment for the storage layer, the Input sections define which buffering mechanism to use, and the Output defines limits for the logical filesystem queues. - -### Service section configuration - -The Service section refers to the section defined in the main [configuration file](configuring-fluent-bit/classic-mode/configuration-file.md): - -| Key | Description | Default | -| :--- | :--- | :--- | -| `storage.path` | Set an optional location in the file system to store streams and chunks of data. If this parameter isn't set, Input plugins can only use in-memory buffering. | _none_ | -| `storage.sync` | Configure the synchronization mode used to store the data in the file system. Using `full` increases the reliability of the filesystem buffer and ensures that data is guaranteed to be synced to the filesystem even if Fluent Bit crashes. On Linux, `full` corresponds with the `MAP_SYNC` option for [memory mapped files](https://man7.org/linux/man-pages/man2/mmap.2.html). Accepted values: `normal`, `full`. | `normal` | -| `storage.checksum` | Enable the data integrity check when writing and reading data from the filesystem. The storage layer uses the CRC32 algorithm. Accepted values: `Off`, `On`. | `Off` | -| `storage.max_chunks_up` | If the input plugin has enabled `filesystem` storage type, this property sets the maximum number of chunks that can be `up` in memory. Use this setting to control memory usage when you enable `storage.type filesystem`. | `128` | -| `storage.backlog.mem_limit` | If `storage.path` is set, Fluent Bit looks for data chunks that weren't delivered and are still in the storage layer. These are called _backlog_ data. _Backlog chunks_ are filesystem chunks that were left over from a previous Fluent Bit run; chunks that couldn't be sent before exit that Fluent Bit will pick up when restarted. Fluent Bit will check the `storage.backlog.mem_limit` value against the current memory usage from all `up` chunks for the input. If the `up` chunks currently consume less memory than the limit, it will bring the _backlog_ chunks up into memory so they can be sent by outputs. | `5M` | -| `storage.backlog.flush_on_shutdown` | When enabled, Fluent Bit will attempt to flush all backlog filesystem chunks to their destination during the shutdown process. This can help ensure data delivery before Fluent Bit stops, but can increase shutdown time. Accepted values: `Off`, `On`. | `Off` | -| `storage.metrics` | If `http_server` option is enabled in the main `[SERVICE]` section, this option registers a new endpoint where internal metrics of the storage layer can be consumed. For more details refer to the [Monitoring](monitoring.md) section. | `off` | -| `storage.delete_irrecoverable_chunks` | When enabled, [irrecoverable chunks](./buffering-and-storage.md#irrecoverable-chunks) will be deleted during runtime, and any other irrecoverable chunk located in the configured storage path directory will be deleted when Fluent Bit starts. Accepted values: `Off`, `On`. | `Off` | -| `storage.keep.rejected` | When enabled, the dead-letter queue feature stores failed chunks that can't be delivered. Accepted values: `Off`, `On`. | `Off`| -| `storage.rejected.path` | When specified, the dead-letter queue is stored in a subdirectory (stream) under `storage.path`. The default value `rejected` is used at runtime if not set. | _none_ | - -### Dead letter queue (DLQ) - -The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. Instead of losing this data, Fluent Bit copies the rejected chunks to a dedicated storage location for later analysis and troubleshooting. - -#### When dead letter queue is triggered - -Chunks are copied to the DLQ in the following failure scenarios: - -- **Permanent errors**: When an output plugin returns an unrecoverable error (`FLB_ERROR`). -- **Retry limit reached**: When a chunk exhausts all configured retry attempts. -- **Retries disabled**: When `retry_limit` is set to `no_retries` and a flush fails. -- **Scheduler failures**: When the retry scheduler can't schedule a retry (for example, due to resource constraints). - -#### Requirements - -The DLQ feature requires: - -- `storage.path` must be configured (filesystem storage must be enabled). -- `storage.keep.rejected` must be set to `On`. - -#### Dead letter queue file location and format - -Rejected chunks are stored in a subdirectory under `storage.path`. For example, with the following configuration: - -```yaml -service: - storage.path: /var/log/flb-storage/ - storage.keep.rejected: on - storage.rejected.path: rejected -``` - -Rejected chunks are stored at `/var/log/flb-storage/rejected/`. - -Each DLQ file is named using this format: - -```text -___.flb -``` - -For example: `kube_var_log_containers_test_400_http_0x7f8b4c.flb` - -The file contains the original chunk data in the internal format of Fluent Bit, preserving all records and metadata. - -#### Troubleshooting with dead letter queue - -The DLQ feature enables the following capabilities: - -- **Data preservation**: Invalid or rejected chunks are preserved instead of being permanently lost. -- **Root cause analysis**: Investigate why specific data failed to be delivered without impacting live processing. -- **Data recovery**: Replay or transform rejected chunks after fixing the underlying issue. -- **Debugging**: Analyze the exact content of problematic records. - -To examine DLQ chunks, you can use the storage metrics endpoint (when `storage.metrics` is enabled) or directly inspect the files in the rejected directory. - -{% hint style="info" %} -DLQ files remain on disk until manually removed. Monitor disk usage in the rejected directory and implement a cleanup policy for older files. -{% endhint %} - -A Service section will look like this: - -{% tabs %} -{% tab title="fluent-bit.yaml" %} - -```yaml -service: - flush: 1 - log_level: info - storage.path: /var/log/flb-storage/ - storage.sync: normal - storage.checksum: off - storage.backlog.mem_limit: 5M - storage.backlog.flush_on_shutdown: off - storage.keep.rejected: on - storage.rejected.path: rejected -``` - -{% endtab %} -{% tab title="fluent-bit.conf" %} - -```text -[SERVICE] - flush 1 - log_Level info - storage.path /var/log/flb-storage/ - storage.sync normal - storage.checksum off - storage.backlog.mem_limit 5M - storage.backlog.flush_on_shutdown off - storage.keep.rejected on - storage.rejected.path rejected -``` - -{% endtab %} -{% endtabs %} - -This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5 MB of memory when processing backlog data. Additionally, the dead letter queue is enabled, and rejected chunks are stored in `/var/log/flb-storage/rejected/`. - -### Input section configuration - -Optionally, any Input plugin can configure their storage preference. The following table describes the options available: - -| Key | Description | Default | -| :--- | :--- | :--- | -| `storage.type` | Specifies the buffering mechanism to use. Accepted values: `memory`, `filesystem`. | `memory` | -| `storage.pause_on_chunks_overlimit` | Specifies if the input plugin should pause (stop ingesting new data) when the `storage.max_chunks_up` value is reached. |`off` | - -The following example configures a service offering filesystem buffering capabilities and two input plugins being the first based in filesystem and the second with memory only. - -{% tabs %} -{% tab title="fluent-bit.yaml" %} - -```yaml -service: - flush: 1 - log_level: info - storage.path: /var/log/flb-storage/ - storage.sync: normal - storage.checksum: off - storage.max_chunks_up: 128 - storage.backlog.mem_limit: 5M - -pipeline: - inputs: - - name: cpu - storage.type: filesystem - - - name: mem - storage.type: memory -``` - -{% endtab %} -{% tab title="fluent-bit.conf" %} - -```text -[SERVICE] - flush 1 - log_Level info - storage.path /var/log/flb-storage/ - storage.sync normal - storage.checksum off - storage.max_chunks_up 128 - storage.backlog.mem_limit 5M - -[INPUT] - name cpu - storage.type filesystem - -[INPUT] - name mem - storage.type memory -``` - -{% endtab %} -{% endtabs %} - -### Output section configuration - -If certain chunks are filesystem `storage.type` based, it's possible to control the size of the logical queue for an output plugin. The following table describes the options available: - -| Key | Description | Default | -| :--- | :--- | :--- | -| `storage.total_limit_size` | Limit the maximum disk space size in bytes for buffering chunks in the filesystem for the current output logical destination. | _none_ | - -The following example creates records with CPU usage samples in the filesystem which are delivered to Google Stackdriver service while limiting the logical queue (buffering) to `5M`: - -{% tabs %} -{% tab title="fluent-bit.yaml" %} - -```yaml -service: - flush: 1 - log_level: info - storage.path: /var/log/flb-storage/ - storage.sync: normal - storage.checksum: off - storage.max_chunks_up: 128 - storage.backlog.mem_limit: 5M - -pipeline: - inputs: - - name: cpu - storage.type: filesystem - - outputs: - - name: stackdriver - match: '*' - storage.total_limit_size: 5M -``` - -{% endtab %} -{% tab title="fluent-bit.conf" %} - -```text -[SERVICE] - flush 1 - log_Level info - storage.path /var/log/flb-storage/ - storage.sync normal - storage.checksum off - storage.max_chunks_up 128 - storage.backlog.mem_limit 5M - -[INPUT] - name cpu - storage.type filesystem - -[OUTPUT] - name stackdriver - match * - storage.total_limit_size 5M -``` - -{% endtab %} -{% endtabs %} - -If Fluent Bit is offline because of a network issue, it will continue buffering CPU -samples, keeping a maximum of 5 MB of the newest data. diff --git a/administration/configuring-fluent-bit/yaml/service-section.md b/administration/configuring-fluent-bit/yaml/service-section.md index 3f024a730..0ace48b5a 100644 --- a/administration/configuring-fluent-bit/yaml/service-section.md +++ b/administration/configuring-fluent-bit/yaml/service-section.md @@ -31,20 +31,20 @@ The following storage-related keys can be set as children to the `storage` key: | Key | Description | Default Value | | --- | ----------- | ------------- | -| `storage.path` | Set a location in the file system to store streams and chunks of data. Required for filesystem buffering. | _none_ | -| `storage.sync` | Configure the synchronization mode used to store data in the file system. Accepted values: `normal` or `full`. | `normal` | -| `storage.checksum` | Enable data integrity check when writing and reading data from the filesystem. Accepted values: `off` or `on`. | `off` | -| `storage.max_chunks_up` | Set the maximum number of chunks that can be `up` in memory when using filesystem storage. | `128` | -| `storage.backlog.mem_limit` | Set the memory limit for backlog data chunks. | `5M` | -| `storage.backlog.flush_on_shutdown` | Attempt to flush all backlog chunks during shutdown. Accepted values: `off` or `on`. | `off` | -| `storage.metrics` | Enable storage layer metrics on the HTTP endpoint. Accepted values: `off` or `on`. | `off` | -| `storage.delete_irrecoverable_chunks` | Delete irrecoverable chunks during runtime and at startup. Accepted values: `off` or `on`. | `off` | -| `storage.keep.rejected` | Enable the Dead Letter Queue (DLQ) to preserve chunks that fail to be delivered. Accepted values: `off` or `on`. | `off` | -| `storage.rejected.path` | Subdirectory name under `storage.path` for storing rejected chunks. | `rejected` | +| `storage.path` | Sets a location to store streams and chunks of data. If this parameter isn't set, input plugins can't use filesystem buffering. | _none_ | +| `storage.sync` | Configures the synchronization mode used to store data in the file system. Using `full` increases the reliability of the filesystem buffer and ensures that data is guaranteed to be synced to the filesystem even if Fluent Bit crashes. On Linux, `full` corresponds with the `MAP_SYNC` option for [memory mapped files](https://man7.org/linux/man-pages/man2/mmap.2.html). Accepted values: `normal`, `full`. | `normal` | +| `storage.checksum` | Enables data integrity check when writing and reading data from the filesystem. The storage layer uses the CRC32 algorithm. Accepted values: `off` or `on`. | `off` | +| `storage.max_chunks_up` | Sets the number of chunks that can be `up` in memory for input plugins that use filesystem storage. | `128` | +| `storage.backlog.mem_limit` | Sets the memory allocated for storing buffered data for input plugins that use filesystem storage. | `5M` | +| `storage.backlog.flush_on_shutdown` | If enabled, Fluent Bit attempts to flush all backlog filesystem chunks to their destination during the shutdown process. This can help ensure data delivery before Fluent Bit stops, but can also increase shutdown time. Accepted values: `off` or `on`. | `off` | +| `storage.metrics` | If `http_server` option is enabled in the main `service` section, this option registers a new endpoint where internal metrics of the storage layer can be consumed. For more details, see [Monitoring](../../monitoring.md). Accepted values: `off` or `on`. | `off` | +| `storage.delete_irrecoverable_chunks` | If enabled, deletes irrecoverable chunks during runtime and at startup. Accepted values: `off` or `on`. | `off` | +| `storage.keep.rejected` | If enabled, the [dead letter queue](../../dead-letter-queue.md) stores failed chunks that can't be delivered. Accepted values: `off` or `on`. | `off` | +| `storage.rejected.path` | Sets the subdirectory name under `storage.path` for storing rejected chunks in the dead letter queue. | `rejected` | -For scheduler and retry details, see [scheduling and retries](../../scheduling-and-retries.md#Scheduling-and-Retries). +For storage and buffering details, see [Buffering](../../../pipeline/buffering.md) and [Backpressure](../../backpressure.md). -For storage and buffering details, see [buffering and storage](../../buffering-and-storage.md). +For scheduler and retry details, see [Scheduling and retries](../../scheduling-and-retries.md#Scheduling-and-Retries). ## Configuration example diff --git a/administration/dead-letter-queue.md b/administration/dead-letter-queue.md new file mode 100644 index 000000000..3fcad6960 --- /dev/null +++ b/administration/dead-letter-queue.md @@ -0,0 +1,89 @@ +# Dead letter queue + +The dead letter queue preserves [chunks](../pipeline/buffering.md#chunks) that Fluent Bit fails to deliver to output destinations. Instead of losing this data, Fluent Bit copies the rejected chunks to a dedicated storage location for future analysis and troubleshooting. + +To enable the dead letter queue, filesystem storage must be enabled by setting a value for [`storage.path`](./configuring-fluent-bit/yaml/service-section.md#storage-configuration), and [`storage.keep.rejected`](./configuring-fluent-bit/yaml/service-section.md#storage-configuration) must be set to `on`. + +Chunks are copied to the dead letter queue in the following failure scenarios: + +- **Permanent errors**: When an output plugin returns an unrecoverable error (`FLB_ERROR`). +- **Retry limit reached**: When a chunk exhausts all configured retry attempts. +- **Retries disabled**: When `retry_limit` is set to `no_retries` and a flush fails. +- **Scheduler failures**: When the retry scheduler can't schedule a retry (for example, due to resource constraints). + +## Location + +Rejected chunks are stored in the subdirectory defined by `storage.path`. For example, with the following configuration, rejected chunks are stored at `/var/log/flb-storage/rejected/`: + +```yaml +service: + storage.path: /var/log/flb-storage/ + storage.keep.rejected: on + storage.rejected.path: rejected +``` + +## Format + +Each dead letter queue file is named using this format: + +```text +___.flb +``` + +For example: `kube_var_log_containers_test_400_http_0x7f8b4c.flb` + +The file contains the original chunk data in the internal format of Fluent Bit, preserving all records and metadata. + +## Troubleshooting with dead letter queue + +The dead letter queue feature enables the following capabilities: + +- **Data preservation**: Invalid or rejected chunks are preserved instead of being permanently lost. +- **Root cause analysis**: Investigate why specific data failed to be delivered without impacting live processing. +- **Data recovery**: Replay or transform rejected chunks after fixing the underlying issue. +- **Debugging**: Analyze the exact content of problematic records. + +To examine dead letter queue chunks, you can use the storage metrics endpoint (when `storage.metrics` is enabled) or directly inspect the files in the rejected directory. + +{% hint style="info" %} +Dead letter queue files remain on disk until manually removed. Monitor disk usage in the rejected directory and implement a cleanup policy for older files. +{% endhint %} + +A Service section will look like this: + +{% tabs %} +{% tab title="fluent-bit.yaml" %} + +```yaml +service: + flush: 1 + log_level: info + storage.path: /var/log/flb-storage/ + storage.sync: normal + storage.checksum: off + storage.backlog.mem_limit: 5M + storage.backlog.flush_on_shutdown: off + storage.keep.rejected: on + storage.rejected.path: rejected +``` + +{% endtab %} +{% tab title="fluent-bit.conf" %} + +```text +[SERVICE] + flush 1 + log_Level info + storage.path /var/log/flb-storage/ + storage.sync normal + storage.checksum off + storage.backlog.mem_limit 5M + storage.backlog.flush_on_shutdown off + storage.keep.rejected on + storage.rejected.path rejected +``` + +{% endtab %} +{% endtabs %} + +This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5 MB of memory when processing backlog data. Additionally, the dead letter queue is enabled, and rejected chunks are stored in `/var/log/flb-storage/rejected/`. diff --git a/administration/memory-management.md b/administration/memory-management.md index 690ee8f9b..6a4f60cfc 100644 --- a/administration/memory-management.md +++ b/administration/memory-management.md @@ -4,8 +4,6 @@ You might need to estimate how much memory Fluent Bit could be using in scenarios like containerized environments where memory limits are essential. -To make an estimate, in-use input plugins must set the `Mem_Buf_Limit`option. Learn more about it in [Backpressure](backpressure.md). - ## Estimating Input plugins append data independently. To make an estimation, impose a limit with the `Mem_Buf_Limit` option. If the limit was set to `10MB`, you can estimate that in the worst case, the output plugin likely could use `20MB`. @@ -14,6 +12,10 @@ Fluent Bit has an internal binary representation for the data being processed. W When imposing a limit of `10MB` for the input plugins, and a worst case scenario of the output plugin consuming `20MB`, you need to allocate a minimum (`30MB` x 1.2) = `36MB`. +{% hint style="info" %} +For more information about `Mem_Buf_Limit`, see [Backpressure](backpressure.md). +{% endhint %} + ## Glibc and memory fragmentation In intensive environments where memory allocations happen in the orders of magnitude, the default memory allocator provided by Glibc could lead to high fragmentation, reporting a high memory usage by the service. diff --git a/administration/scheduling-and-retries.md b/administration/scheduling-and-retries.md index ea9672baf..b800ba731 100644 --- a/administration/scheduling-and-retries.md +++ b/administration/scheduling-and-retries.md @@ -96,7 +96,7 @@ The scheduler provides a configuration option called `Retry_Limit`, which can be | `Retry_Limit` | `no_retries` | When set, retries are disabled and scheduler doesn't try to send data to the destination if it failed the first time. | {% hint style="info" %} -When a chunk exhausts all retry attempts or retries are disabled, the data is discarded by default. To preserve rejected data for later analysis, enable the [Dead Letter Queue (DLQ)](buffering-and-storage.md#dead-letter-queue-dlq) feature by setting `storage.keep.rejected` to `on` in the Service section. +When a chunk exhausts all retry attempts or retries are disabled, the data is discarded by default. To preserve rejected data for later analysis, enable the [Dead Letter Queue (DLQ)](./dead-letter-queue.md) feature by setting `storage.keep.rejected` to `on` in the Service section. {% endhint %} ### Retry example @@ -141,4 +141,4 @@ pipeline: ``` {% endtab %} -{% endtabs %} \ No newline at end of file +{% endtabs %} diff --git a/administration/troubleshooting.md b/administration/troubleshooting.md index 1c7a0bc1e..f38f14fb1 100644 --- a/administration/troubleshooting.md +++ b/administration/troubleshooting.md @@ -58,7 +58,7 @@ For example, a file named `kube_var_log_containers_test_400_http_0x7f8b4c.flb` i DLQ files remain on disk until manually removed. Monitor disk usage and implement a cleanup policy. {% endhint %} -For more details on DLQ configuration, see [Buffering and Storage](buffering-and-storage.md#dead-letter-queue-dlq). +For more details on DLQ configuration, see [Dead letter queue](./dead-letter-queue.md). ## Tap diff --git a/pipeline/buffering.md b/pipeline/buffering.md index bcebbf79b..68e26daf1 100644 --- a/pipeline/buffering.md +++ b/pipeline/buffering.md @@ -4,9 +4,9 @@ description: Performance and data safety # Buffering -When Fluent Bit processes data, it uses the system memory (heap) as a primary and temporary place to store the record logs before they get delivered. The records are processed in this private memory area. + -Buffering is the ability to temporarily store incoming data before that data is processed and delivered. Buffering in memory is the fastest mechanism, but there are scenarios requiring special strategies to deal with [backpressure](../administration/backpressure.md), data safety, or to reduce memory consumption by the service in constrained environments. +After Fluent Bit ingests data, it temporarily stores that data in the system memory (heap) before processing and routing that data to its destination. This process is known as _buffering_. ```mermaid graph LR @@ -22,16 +22,174 @@ graph LR style D stroke:darkred,stroke-width:2px; ``` -Network failures or latency in third party service is common. When data can't be delivered fast enough and new data to process arrives, the system can face backpressure. +{% hint style="info" %} +Buffered data uses the Fluent Bit internal binary representation, which isn't raw text. This buffered data is immutable. +{% endhint %} -Fluent Bit buffering strategies are designed to solve problems associated with backpressure and general delivery failures. Fluent Bit offers a primary buffering mechanism in memory and an optional secondary one using the file system. With this hybrid solution you can accommodate any use case safely and keep a high performance while processing your data. +## Chunks -These mechanisms aren't mutually exclusive. When data is ready to be processed or delivered it's always be in memory. Other data in the queue might be in the file system until is ready to be processed and moved up to memory. +When an input plugin emits records, the engine groups records together into a _chunk_. Each chunk has an average size of 2 MB. The active [buffering mode](#buffering-modes) determines where these chunks are stored. -The `buffer` phase contains the data in an immutable state, meaning that no other filter can be applied. +Chunks that are stored simultaneously in memory and in filesystem storage are known as `up` chunks. Chunks that are stored only in filesystem storage are known as `down` chunks. A `down` chunk becomes an `up` chunk when a copy of the `down` chunk is written to memory. -Buffered data uses the Fluent Bit internal binary representation, which isn't raw text. +After an `up` chunk is processed and routed, the associated buffered data both in memory and in the filesystem is flushed. -To avoid data loss in case of system failures, Fluent Bit offers a buffering mechanism in the file system that acts as a backup system. +### Irrecoverable chunks -To learn more about the buffering configuration in Fluent Bit, see [Buffering and Storage](../administration/buffering-and-storage.md). +Fluent Bit marks a chunk as irrecoverable in the following scenarios: + +- When Fluent Bit encounters a bad layout in a chunk. A bad layout is a chunk that doesn't conform to the [expected format](https://github.com/fluent/fluent-bit/blob/master/CHUNKS.md). +- When Fluent Bit encounters an incorrect or invalid chunk header size. + +After marking a chunk as irrecoverable, Fluent Bit logs an error message and then discards the irrecoverable chunk. + +## Buffering modes + +Fluent Bit offers two modes for storing buffered data. Both modes store buffered data in memory, but filesystem buffering is a hybrid method that stores an additional copy of buffered data in the filesystem. + +You can set the buffering mode for each active [input plugin](#per-input-settings). + +### Memory-only buffering + +When memory-only buffering is enabled, Fluent Bit stores buffered data in memory until it's ready to process and route that data to its intended destinations. After Fluent Bit processes and routes the data, it flushes that data from memory. + +This buffering method is faster than filesystem buffering, and uses less system overhead, but is more prone to data loss. + +### Filesystem buffering (hybrid) + +When filesystem buffering is enabled, Fluent Bit stores each chunk of buffered data in the filesystem through [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html). If Fluent Bit has enough space in memory, an identical chunk of that buffered data is also written to memory. If Fluent Bit doesn't have enough space in memory, the chunk of buffered data remains only in the filesystem until there is enough space to write an identical chunk to memory. After Fluent Bit processes and routes the data, it flushes that data from memory and from the filesystem. + +This buffering method is less efficient than memory-only buffering, and uses more system overhead, but is less prone to data loss. + +## Configuration settings + +{% hint style="info" %} +For information about different strategies for managing backpressure, see [Backpressure](../administration/backpressure.md). +{% endhint %} + +Use the information in this section to configure buffering settings in Fluent Bit. Global settings configure the storage layer, per-input settings define which buffering mechanism to use, and per-output settings define limits for the logical filesystem queues. + +### Global settings + +In the [`service` section](../administration/configuring-fluent-bit/yaml/service-section.md) of Fluent Bit configuration files, several settings related to buffering are stored in the [`storage` key](../administration/configuring-fluent-bit/yaml/service-section.md#storage-configuration). These are global settings that affect all input and output plugins. + +### Per-input settings + +You can configure buffering settings for any input plugin by using these configuration parameters: + +| Key | Description | Default | +| :--- | :--- | :--- | +| `storage.type` | Specifies the buffering mechanism to use for this input plugin. To enable filesystem buffering, a global [`storage.path`](../administration/configuring-fluent-bit/yaml/service-section.md#storage-configuration) value must be set in the `service` section of your configuration file. Accepted values: `memory`, `filesystem`. | `memory` | +| `mem_buf_limit` | If memory-only buffering is enabled, sets a limit for how much buffered data the plugin can write to memory. After this limit is reached, the plugin will pause until more memory becomes available. This value must follow [unit size](../administration/configuring-fluent-bit.md#unit-sizes) specifications. If unspecified, no limit is enforced. | `0` | +| `storage.pause_on_chunks_overlimit` | If filesystem buffering is enabled, specifies how the input plugin should behave after the global `storage.max_chunks_up` limit is reached. When set to `off`, the plugin will stop buffering data to memory but continue buffering data to the filesystem. When set to `on`, the plugin will stop both memory buffering and filesystem buffering until more memory becomes available. Possible values: `on`, `off`. | `off` | + +The following configuration example sets global settings in `service` to support filesystem buffering, then configures one input plugin with filesystem buffering and one input plugin with memory-only buffering: + +{% tabs %} +{% tab title="fluent-bit.yaml" %} + +```yaml +service: + flush: 1 + log_level: info + storage.path: /var/log/flb-storage/ + storage.sync: normal + storage.checksum: off + storage.max_chunks_up: 128 + storage.backlog.mem_limit: 5M + +pipeline: + inputs: + - name: cpu + storage.type: filesystem + + - name: mem + storage.type: memory +``` + +{% endtab %} +{% tab title="fluent-bit.conf" %} + +```text +[SERVICE] + flush 1 + log_Level info + storage.path /var/log/flb-storage/ + storage.sync normal + storage.checksum off + storage.max_chunks_up 128 + storage.backlog.mem_limit 5M + +[INPUT] + name cpu + storage.type filesystem + +[INPUT] + name mem + storage.type memory +``` + +{% endtab %} +{% endtabs %} + +### Per-output settings + +If any active input plugins use filesystem buffering, you can limit how many chunks are buffered to the filesystem based on the output plugin where Fluent Bit intends to route that chunk. To do so, use this configuration parameter: + +| Key | Description | Default | +| :--- | :--- | :--- | +| `storage.total_limit_size` | Sets the size of the queue for this output plugin. This queue is the number of chunks buffered to the filesystem with this output as the intended destination. If the output plugin reaches its `storage.total_limit_size` capacity, the oldest chunk from its queue will be discarded to make room for new data. This value must follow [unit size](../administration/configuring-fluent-bit.md#unit-sizes) specifications. | _none_ | + +The following configuration example creates records with CPU usage samples in the filesystem which are delivered to Google Stackdriver service while limiting the logical queue to `5M`: + +{% tabs %} +{% tab title="fluent-bit.yaml" %} + +```yaml +service: + flush: 1 + log_level: info + storage.path: /var/log/flb-storage/ + storage.sync: normal + storage.checksum: off + storage.max_chunks_up: 128 + storage.backlog.mem_limit: 5M + +pipeline: + inputs: + - name: cpu + storage.type: filesystem + + outputs: + - name: stackdriver + match: '*' + storage.total_limit_size: 5M +``` + +{% endtab %} +{% tab title="fluent-bit.conf" %} + +```text +[SERVICE] + flush 1 + log_Level info + storage.path /var/log/flb-storage/ + storage.sync normal + storage.checksum off + storage.max_chunks_up 128 + storage.backlog.mem_limit 5M + +[INPUT] + name cpu + storage.type filesystem + +[OUTPUT] + name stackdriver + match * + storage.total_limit_size 5M +``` + +{% endtab %} +{% endtabs %} + +In this example, if Fluent Bit is offline because of a network issue, it will continue buffering CPU samples, keeping a maximum of 5 MB of the newest data. diff --git a/pipeline/inputs/tail.md b/pipeline/inputs/tail.md index 652b4c8fa..f0ad5efc9 100644 --- a/pipeline/inputs/tail.md +++ b/pipeline/inputs/tail.md @@ -76,7 +76,7 @@ Although Fluent Bit has a soft limit of 2 MB for chunks, input plugins like If Fluent Bit isn't configured to use filesystem buffering, it needs mechanisms to protect against high memory consumption during backpressure scenarios (for example, when destination endpoints are down or network issues occur). The `mem_buf_limit` option restricts how much memory in chunks an input plugin can use. -When filesystem buffering is enabled, memory management works differently. For more details, see [Buffering and Storage](../../administration/buffering-and-storage.md). +When filesystem buffering is enabled, memory management works differently. For more details, see [Buffering](../../pipeline/buffering.md) and [Backpressure](../../administration/backpressure.md). ## Database file