WIP: UTF8 encoding support for in_tail and in_syslog #1668

nigels-com · 2019-10-19T11:00:36Z

This is a proof-of-concept integration of tutf8e "Tiny UTF-8 Encoder for C"
into fluent-bit modify filter. The test here is feeding iso-8859-2 via the tail input plugin
and using the modify filter UTF8 operation to do encoding via tutf8e.

@edsiper @bluebike

The more interesting part is in plugins/filter_modify/modify.c

In relation to #1180

https://github.com/nigels-com/tutf8e

$ cat test.utf8 
A quick brown fox jumps over the lazy dog
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.

$ iconv -o test.iso-8859-2 --from-code=utf8 --to-code=iso-8859-2 test.utf8 
$ cat test.iso-8859-2 
A quick brown fox jumps over the lazy dog
Nech� ji� h���n� saxofony ��bl� rozezvu�� s�� �d�sn�mi t�ny waltzu, tanga a quickstepu.

$ cat test.cfg
[INPUT]
    Name        tail
    Path        test.iso-8859-2

[FILTER]
    Name modify
    Match *
    Utf8 log iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

$ bin/fluent-bit -c test.cfg 
Fluent Bit v1.4.0
Copyright (C) Treasure Data

[2019/10/19 20:54:17] [ info] [storage] initializing...
[2019/10/19 20:54:17] [ info] [storage] in-memory
[2019/10/19 20:54:17] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2019/10/19 20:54:17] [ info] [engine] started (pid=14642)
[2019/10/19 20:54:17] [ info] [sp] stream processor started
[0] tail.0: [1571482457.695458890, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1571482457.695461931, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

nigels-com · 2019-10-20T10:52:20Z

I did another rev on the API and the filter_modify logic.

It is a two-pass approach. First determine the length of the required output buffer.
If the output size is unchanged, it's already UTF8 - no need to encode.
If the output size is small, encode to stack-allocated buffer.
If the otuput size is large, malloc/free enough heap memory.

        msgpack_pack_map(packer, map->via.map.size);
        for (i = 0; i < map->via.map.size; i++) {
            msgpack_pack_object(packer, map->via.map.ptr[i].key);

            /* Do UTF8 encoding for this value? */
            if (map->via.map.ptr[i].val.type == MSGPACK_OBJECT_STR &&
                kv_key_matches_str_rule_key(&map->via.map.ptr[i], rule)) {
                size_t size = 0;
                if (!tutf8e_buffer_length_iso_8859_2(map->via.map.ptr[i].val.via.str.ptr, map->via.map.ptr[i].val.via.str.size, &size) && size)
                {
                    const size_t TUTF8_DEFAULT_BUFFER = 256;

                    /* Already UTF8 encoded? */
                    if (size == map->via.map.ptr[i].val.via.str.size) {
                    }
                    /* Small enough for encoding to stack? */
                    else if (size<=TUTF8_DEFAULT_BUFFER)
                    {
                        size = TUTF8_DEFAULT_BUFFER;
                        char buffer[TUTF8_DEFAULT_BUFFER];
                        if (!tutf8e_buffer_encode_iso_8859_2(buffer, &size, map->via.map.ptr[i].val.via.str.ptr, map->via.map.ptr[i].val.via.str.size))
                        {
                            helper_pack_string(packer, buffer, size);
                            ret = FLB_FILTER_MODIFIED;
                            continue;
                        }
                    }
                    /* malloc/free the encoded copy */
                    else {
                        char *buffer = (char *) flb_malloc(size);
                        if (buffer && !tutf8e_buffer_encode_iso_8859_2(buffer, &size, map->via.map.ptr[i].val.via.str.ptr, map->via.map.ptr[i].val.via.str.size))
                        {
                            helper_pack_string(packer, buffer, size);
                            free(buffer);
                            ret = FLB_FILTER_MODIFIED;
                            continue;
                        }                        
                        free(buffer);
                    }
                }
            }
            msgpack_pack_object(packer, map->via.map.ptr[i].val);
        }

bluebike · 2019-10-20T10:57:00Z

Proof-of-concept: ok.
In real life encoding should be done before parsing, because onigmo parser is configured to accept only UTF-8. Onigmo supports different encodings...there is no configuration option for that. Also after running parser, messages are converted to msgpack... we UTF-8 is "assumed".

So re-encoding should be done before parsing or in flb_parser(?).
Doing that flb_parser doesn't work always because some inputs doesn't call it (optinally).

nigels-com · 2019-10-20T12:11:15Z

In real life encoding should be done before parsing

Oh! Drat! Never mind then...

nigels-com · 2019-10-20T12:41:32Z

It does seem correct utf8 encoding should happen upstream of the parser, in the input implementation.
That's a broader, more intricate job indeed concerning all the input plugin code paths.

nigels-com · 2019-10-21T12:54:50Z

I refactored this POC branch to move the UTF8 encoding into src/flb_encode.c as a wrapper for message packing of strings with an optional UTF8 encoding step. It produces the same output as previously for tail input plugin, without the use or need of filter.

cmake -DFLB_ENCODE=No

[0] tail.0: [1571661916.044150883, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1571661916.044157391, {"log"=>"Nech� ji� h���n� saxofony ��bl� rozezvu�� s�� �d�sn�mi t�ny waltzu, tanga a quickstepu."}]

cmake -DFLB_ENCODE=Yes

[0] tail.0: [1571662021.344263571, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1571662021.344278390, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

with

$ cat test.cfg
[INPUT]
    Name        tail
    Path        test.iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

nigels-com · 2019-10-22T12:32:30Z

Updated with in_tail configurable encoding.

$ cat test.cfg 
[INPUT]
    Name        tail
    Path        test.iso-8859-2
    Encoding    iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

$ cat test2.cfg 
[INPUT]
    Name        tail
    Path        test.iso-8859-1
    Encoding    iso-8859-1

[OUTPUT]
    Name   stdout
    Match  *

Producing output:

[1] tail.0: [1571747508.059319590, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

[1] tail.0: [1571747516.613772304, {"log"=>"Albert osti fagotin ja töräytti puhkuvan melodian."}]

nigels-com · 2019-10-25T11:38:59Z

I think this branch is ready for more serious consideration and seems functionally complete for utf8 encoding for iso_8859_* and windows_125* in_tail and in_syslog. I'd be happy to get some feedback, especially for real-world testing.

edsiper · 2019-11-01T10:48:47Z

@nigels-com is it the tutf8e lib as a standalone in a good shape to be included in master under lib/ ?

nigels-com · 2019-11-01T22:39:30Z

@edsiper I'm happy with the general shape and scope of tutf8e, as it is. I feel like the test coverage could be expanded, and the documentation could use some fleshing out some more. Would you like a separate (simple tidy history) pull request for integrating that?

edsiper · 2019-11-01T22:45:43Z

@nigels-com a simple PR with that inclusion under lib/ and proper options in the main CMakeLists.txt should be enough :)

nigels-com · 2019-11-01T22:58:02Z

@edsiper Sure thing. Too easy.

nigels-com · 2019-11-04T22:53:17Z

Rebasing onto mainline.

nigels-com · 2019-11-16T23:55:15Z

@edsiper What's the next step here?

edsiper · 2019-11-26T16:19:48Z

just a minor change request to merge this: would you please adjust the following commit message ?

from

filter-modify: ..

to

filter_modify: ...

nigels-com · 2019-11-27T02:59:03Z

@edsiper Yes, done.

bluebike · 2019-11-27T12:15:20Z

Why decoding is done msgpack generation???
I think that if onigmo parser is used.. it wouldn't like non-utf8 data here.
UTF-8 Encoding should be done before:

fluent-bit/plugins/in_syslog/syslog_prot.c

Lines 100 to 102 in e97dc53

    
           /* Process the string */ 
        
           ret = flb_parser_do(ctx->parser, p, len, 
        
                               &out_buf, &out_size, &out_time);

Also:
buffer is allocated with flb_mallloc but freed with regular free.

fluent-bit/src/flb_encoder.c

Line 56 in e97dc53

char *buffer = (char *) flb_malloc(size);

(sorry of late commenting)

nigels-com · 2019-11-28T09:56:23Z

Yes @bluebike I certainly see your point. It was about a month ago, but we did seem to agree that encoding should be upstream of the parser, and I do recall thinking that Omnigo could/should be stripped down to only UTF-8 support, if that all we intend to use.

flb_free is an easy-enough fix, but I'll need some time to consider if wrapping flb_parser_do will work as nicely as wrapping msgpack_pack_str. For example, what do we do if/when decoding fails, just fall back to the raw input data?

nigels-com · 2019-11-28T11:10:36Z

@bluebike Indeed the story of tail is a bit complicated - one path via the parser, other paths without a parser. The implication is that we need both flb_msgpack_encode_utf8 and flb_parser_do_encode_utf8.

I'll take a fresh look over the weekend, but I pushed an initial "work-in-progress" that does pass my simple test:

$ bin/fluent-bit -c test.cfg 
Fluent Bit v1.4.0
...
[0] tail.0: [1574939401.608484519, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1574939401.608490066, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

bluebike · 2019-12-18T23:08:26Z

(((ping)))

nigels-com · 2019-12-19T08:09:47Z

Yeah, I think this will need an overhaul based on the discussion so far.

moloch90 · 2020-01-23T13:37:51Z

Hello,
I realize POC for my project, I want to know if your solution works on the td-agent-bit, and how can i installed on it?

Thank!

edsiper · 2020-05-05T21:38:38Z

@nigels-com @bluebike

what needs to be done to simplify the implementation?

…ng values Signed-off-by: Nigel Stewart <[email protected]>

Signed-off-by: Nigel Stewart <[email protected]>

…R preprocessor Signed-off-by: Nigel Stewart <[email protected]>

…ne step Signed-off-by: Nigel Stewart <[email protected]>

Signed-off-by: Nigel Stewart <[email protected]>

nigels-com · 2020-05-12T11:37:11Z

This branch had gotten stale.
I did manage to rebase it and update for the current master.

$ cat test.utf8
A quick brown fox jumps over the lazy dog
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.

$ iconv -o test.iso-8859-2 --from-code=utf8 --to-code=iso-8859-2 test.utf8

$ cat test.cfg 
[INPUT]
    Name        tail
    Path        test.iso-8859-2
    Encoding    iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

$ bin/fluent-bit -c test.cfg 
Fluent Bit v1.5.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/05/12 21:31:51] [ info] [storage] version=1.0.3, initializing...
[2020/05/12 21:31:51] [ info] [storage] in-memory
[2020/05/12 21:31:51] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/05/12 21:31:51] [ info] [engine] started (pid=21646)
[2020/05/12 21:31:51] [ info] [sp] stream processor started
[0] tail.0: [1589283111.393193879, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1589283111.393202477, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

I'll circle back and reconsider the data-flow considerations, to do the UTF-8 conversion upstream of parsing.

bluebike · 2020-05-15T09:57:09Z

@nigels-com well... have been waiting some time.
So should we go with (my) #1180 even if it doesn't work with windows?

nigels-com · 2020-05-15T11:07:52Z

So should we go with (my) #1180 even if it doesn't work with windows?

One possible way forward is to rework #1180 to use #1703 rather than iconv.
Any objections or concerns with that @bluebike ?

bluebike · 2020-05-24T22:19:18Z

One possible way forward is to rework #1180 to use #1703 rather than iconv.

Ok... I can look for that... (sigh)

edsiper · 2020-06-30T18:20:55Z

ping

nigels-com · 2020-07-01T11:14:22Z

Conversation has moved over to #2287 concerning the tutf8e API and if it can be and should be more tolerant about invalid input characters. I would expect to refactor this to be more aligned to #1180 (but perhaps more complete).

bluebike · 2020-07-02T22:37:55Z

pong.
Yes I'm trying to make tutf8e more usable (invalid char handlng + non-changing decoding) in #2287,
after that I'll basically try to glue that to #1180, which is basically small job.
(but I had had a little hurry at work... ).

nigels-com · 2021-12-10T23:04:32Z

Seems like PR #2287 didn't make it in. Closing for now.

nigels-com force-pushed the filter-modify-utf8-encode branch from 77ba0f2 to aefdd27 Compare October 20, 2019 10:11

nigels-com force-pushed the filter-modify-utf8-encode branch from dfa92e2 to 90d35e9 Compare October 22, 2019 12:29

nigels-com changed the title ~~Proof-of-concept: filter_modify support for UTF8 encoding string values~~ Proof-of-concept: in_tail support for string UTF8 encoding Oct 22, 2019

nigels-com changed the title ~~Proof-of-concept: in_tail support for string UTF8 encoding~~ Proof-of-concept: in_tail and in_syslog support for string UTF8 encoding Oct 22, 2019

nigels-com changed the title ~~Proof-of-concept: in_tail and in_syslog support for string UTF8 encoding~~ UTF8 encoding support for in_tail and in_syslog Oct 25, 2019

nigels-com mentioned this pull request Oct 25, 2019

flb_iconv: charset decoding/encoding #1180

Closed

nigels-com force-pushed the filter-modify-utf8-encode branch from 84fc1be to 63bc0c8 Compare October 30, 2019 21:47

nigels-com mentioned this pull request Nov 1, 2019

lib: tutf8e: A tiny UTF-8 encoder for C #1703

Merged

nigels-com force-pushed the filter-modify-utf8-encode branch from 63bc0c8 to 5cefe93 Compare November 4, 2019 22:53

nigels-com mentioned this pull request Nov 6, 2019

Document UTF-8 encoding input plugin support fluent/fluent-bit-docs#235

Closed

nigels-com force-pushed the filter-modify-utf8-encode branch from 5cefe93 to e97dc53 Compare November 27, 2019 02:58

nigels-com changed the title ~~UTF8 encoding support for in_tail and in_syslog~~ WIP: UTF8 encoding support for in_tail and in_syslog Nov 28, 2019

edsiper self-assigned this May 5, 2020

edsiper added the waiting-for-user Waiting for more information, tests or requested changes label May 5, 2020

nigels-com added 7 commits May 12, 2020 20:59

filter_modify: Proof-of-concept integration of UTF8 encoding for stri…

e2b9449

…ng values Signed-off-by: Nigel Stewart <[email protected]>

in_tail: Encoding parameter for input plugin for conversion to UTF8

ea255c2

Signed-off-by: Nigel Stewart <[email protected]>

in_syslog: Encoding parameter for input plugin for conversion to UTF8

ea06fa5

Signed-off-by: Nigel Stewart <[email protected]>

build: FLB_UTF8_ENCODER to enable UTF8 encoding, FLB_HAVE_UTF8_ENCODE…

2fff129

…R preprocessor Signed-off-by: Nigel Stewart <[email protected]>

in_tail: flb_parser_do_encode_utf8 for UTF8 encoding and parsing in o…

f5d0371

…ne step Signed-off-by: Nigel Stewart <[email protected]>

in_tail: fixups for rebasing onto master

85b3f91

Signed-off-by: Nigel Stewart <[email protected]>

in_tail: config_map entry for encoding

7a16331

Signed-off-by: Nigel Stewart <[email protected]>

nigels-com force-pushed the filter-modify-utf8-encode branch from fb4b9ef to 7a16331 Compare May 12, 2020 11:30

nigels-com requested review from edsiper, fujimotos and koleini as code owners May 12, 2020 11:30

nigels-com closed this Dec 10, 2021

WIP: UTF8 encoding support for in_tail and in_syslog #1668

WIP: UTF8 encoding support for in_tail and in_syslog #1668

Uh oh!

Conversation

nigels-com commented Oct 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nigels-com commented Oct 20, 2019

Uh oh!

bluebike commented Oct 20, 2019

Uh oh!

nigels-com commented Oct 20, 2019

Uh oh!

nigels-com commented Oct 20, 2019

Uh oh!

nigels-com commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nigels-com commented Oct 22, 2019

Uh oh!

nigels-com commented Oct 25, 2019

Uh oh!

edsiper commented Nov 1, 2019

Uh oh!

nigels-com commented Nov 1, 2019

Uh oh!

edsiper commented Nov 1, 2019

Uh oh!

nigels-com commented Nov 1, 2019

Uh oh!

nigels-com commented Nov 4, 2019

Uh oh!

nigels-com commented Nov 16, 2019

Uh oh!

edsiper commented Nov 26, 2019

Uh oh!

nigels-com commented Nov 27, 2019

Uh oh!

bluebike commented Nov 27, 2019

Uh oh!

nigels-com commented Nov 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nigels-com commented Nov 28, 2019

Uh oh!

bluebike commented Dec 18, 2019

Uh oh!

nigels-com commented Dec 19, 2019

Uh oh!

moloch90 commented Jan 23, 2020

Uh oh!

edsiper commented May 5, 2020

Uh oh!

nigels-com commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluebike commented May 15, 2020

Uh oh!

nigels-com commented May 15, 2020

Uh oh!

bluebike commented May 24, 2020

Uh oh!

edsiper commented Jun 30, 2020

Uh oh!

nigels-com commented Jul 1, 2020

Uh oh!

bluebike commented Jul 2, 2020

Uh oh!

nigels-com commented Dec 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nigels-com commented Oct 19, 2019 •

edited

Loading

nigels-com commented Oct 21, 2019 •

edited

Loading

nigels-com commented Nov 28, 2019 •

edited

Loading

nigels-com commented May 12, 2020 •

edited

Loading