Skip to content

Commit cf29d3b

Browse files
cosmo0920alexakreizingeresmerel
authored
in_tail: Add descriptions for encoding parameters on in tail (#1870)
* in_tail: Add a description and note for Unicode.Encoding parameter Signed-off-by: Hiroshi Hatake <[email protected]> * Update pipeline/inputs/tail.md Co-authored-by: Alexa Kreizinger <[email protected]> Signed-off-by: Hiroshi Hatake <[email protected]> * in_tail: Add generic.encoding parameter descriptions Also I added the reason why we need to support these parameters and how to use them. Signed-off-by: Hiroshi Hatake <[email protected]> * Suppress lint warnings Signed-off-by: Hiroshi Hatake <[email protected]> * Apply suggestions from code review This should correct the severe vale errors and most of the suggestions, as well as matching current style. Signed-off-by: Lynette Miles <[email protected]> --------- Signed-off-by: Hiroshi Hatake <[email protected]> Signed-off-by: Hiroshi Hatake <[email protected]> Signed-off-by: Lynette Miles <[email protected]> Co-authored-by: Alexa Kreizinger <[email protected]> Co-authored-by: Lynette Miles <[email protected]>
1 parent ee2a27e commit cf29d3b

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

pipeline/inputs/tail.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ The plugin supports the following configuration parameters:
3939
| `file_cache_advise` | Set the `posix_fadvise` in `POSIX_FADV_DONTNEED` mode. This reduces the usage of the kernel file cache. This option is ignored if not running on Linux. | `on` |
4040
| `threaded` | Indicates whether to run this input in its own [thread](../../administration/multithreading.md#inputs). | `false` |
4141
| `Unicode.Encoding` | Set the Unicode character encoding of the file data. This parameter requests two-byte aligned chunk and buffer sizes. If data is not aligned for two bytes, Fluent Bit will use two-byte alignment automatically to avoid character breakages on consuming boundaries. Supported values: `UTF-16LE`, `UTF-16BE`, and `auto`. | `none` |
42+
| `Generic.Encoding` | Set the non-Unicode encoding of the file data. Supported values: `ShiftJIS`, `UHC`, `GBK`, `GB18030`, `Big5`, `Win866`, `Win874`, `Win1250`, `Win1251`, `Win1252`, `Win2513`, `Win1254`, `Win1255`, and `Win1256`. | `none` |
4243

4344
## Buffers and memory management
4445

@@ -85,6 +86,13 @@ The `Unicode.Encoding` parameter is dependent on the simdutf library, which is i
8586
Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all cases, and can make mistakes when it tries to guess the correct encoding. For best results, use either the `UTF-16LE` or `UTF-16BE` setting if you know the encoding type of the target file.
8687
{% endhint %}
8788

89+
{% hint style="info" %}
90+
The `Unicode.Encoding` parameter is dependent on the `simdutf` library, which is itself dependent on C++ version 11 or later. In environments that use earlier versions of C++, the `Unicode.Encoding` parameter will fail.
91+
92+
Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all cases, and can make mistakes when it tries to guess the correct encoding. For best results, use either the `UTF-16LE` or `UTF-16BE` setting if you know the encoding type of the target file.
93+
{% endhint %}
94+
95+
8896
## Monitor a large number of files
8997

9098
To monitor a large number of files, you can increase the `inotify` settings in your Linux environment by modifying the following `sysctl` parameters:
@@ -465,3 +473,89 @@ While file rotation is handled, there are risks of potential log loss when using
465473
- Final note: the `Path` patterns can't match the rotated files. Otherwise, the rotated file would be read again and lead to duplicate records.
466474

467475
{% endhint %}
476+
477+
## Character encoding conversion
478+
479+
This feature allows Fluent Bit to convert logs from various character encodings into the standard UTF-8 format.
480+
This is crucial for processing logs from systems, especially Windows, that use legacy or non-UTF-8 encodings.
481+
Proper conversion ensures that your log data is correctly parsed, indexed, and searchable.
482+
483+
### When to use this feature
484+
485+
You should use this feature if your log files or messages aren't in UTF-8 and you are seeing garbled or incorrectly rendered characters.
486+
This is common in environments that use:
487+
488+
- Modern Windows applications that log in UTF-16.
489+
490+
- Legacy Windows systems with applications that use traditional code pages (for example, ShiftJIS, GBK, Win1252).
491+
492+
### Configuration parameters
493+
494+
To enable encoding conversion, you will use one of the following two parameters within an input plugin configuration.
495+
496+
1. `Unicode.Encoding`
497+
498+
Use this parameter for high-performance conversion of UTF-16 encoded logs to UTF-8. This method utilizes modern processor features (SIMD instructions) to accelerate the conversion process, making it highly efficient.
499+
500+
- Use Case: Ideal for logs coming from modern Windows environments that default to UTF-16.
501+
- Supported Values:
502+
- UTF-16LE (Little-Endian)
503+
- UTF-16BE (Big-Endian)
504+
505+
1. `Generic.Encoding`
506+
507+
Use this parameter to convert from a wide variety of other character encodings, particularly legacy Windows code pages.
508+
509+
- Use Case: Essential for logs from older systems or applications configured for specific regions, common in East Asia and Eastern Europe.
510+
- Supported Values: You can use any of the names or aliases listed below.
511+
512+
### East Asian Encodings
513+
514+
- `ShiftJIS` (Aliases: `SJIS`, `CP932`, `Windows-31J`)
515+
- `GB18030`
516+
- `GBK`: (Alias: `CP936`)
517+
- `UHC` (Unified Hangul Code): (Aliases: `CP949` and `Windows-949`)
518+
- `Big5`: (Alias: `CP950`)
519+
520+
### Windows (ANSI) encodings
521+
522+
- `Win1250` (Central European): (Alias: `CP1250`)
523+
- `Win1251` (Cyrillic): (Alias: `CP1251`)
524+
- `Win1252` (Western European / Latin): (Alias: `CP1252`)
525+
- `Win1253` (Greek): (Alias: `CP1253`)
526+
- `Win1254` (Turkish): (Alias: `CP1254`)
527+
- `Win1255` (Hebrew): (Alias: `CP1255`)
528+
- `Win1256` (Arabic): (Alias: `CP1256`)
529+
530+
### DOS (OEM) encodings
531+
532+
- `Win866` (Cyrillic - DOS): (Alias: `CP866`)
533+
- `Win874` (Thai): (Alias: `CP874`)
534+
535+
### Configuration example
536+
537+
Here is an example of how to use `Generic.Encoding` with the Tail input plugin to read a log file encoded in ShiftJIS.
538+
539+
{% tabs %}
540+
{% tab title="fluent-bit.yaml" %}
541+
542+
```yaml
543+
pipeline:
544+
inputs:
545+
- name: tail
546+
path: /var/log/containers/*.log
547+
generic.encoding: ShiftJIS
548+
```
549+
550+
{% endtab %}
551+
{% tab title="fluent-bit.conf" %}
552+
553+
```text
554+
[INPUT]
555+
Name tail
556+
Path C:\path\to\your\sjis.log
557+
Generic.Encoding ShiftJIS
558+
```
559+
560+
{% endtab %}
561+
{% endtabs %}

0 commit comments

Comments
 (0)