Skip to content

Conversation

nigels-com
Copy link
Contributor

@nigels-com nigels-com commented Oct 19, 2019

This is a proof-of-concept integration of tutf8e "Tiny UTF-8 Encoder for C"
into fluent-bit modify filter. The test here is feeding iso-8859-2 via the tail input plugin
and using the modify filter UTF8 operation to do encoding via tutf8e.

@edsiper @bluebike

The more interesting part is in plugins/filter_modify/modify.c

In relation to #1180

https://github.com/nigels-com/tutf8e

$ cat test.utf8 
A quick brown fox jumps over the lazy dog
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.

$ iconv -o test.iso-8859-2 --from-code=utf8 --to-code=iso-8859-2 test.utf8 
$ cat test.iso-8859-2 
A quick brown fox jumps over the lazy dog
Nech� ji� h���n� saxofony ��bl� rozezvu�� s�� �d�sn�mi t�ny waltzu, tanga a quickstepu.

$ cat test.cfg
[INPUT]
    Name        tail
    Path        test.iso-8859-2

[FILTER]
    Name modify
    Match *
    Utf8 log iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

$ bin/fluent-bit -c test.cfg 
Fluent Bit v1.4.0
Copyright (C) Treasure Data

[2019/10/19 20:54:17] [ info] [storage] initializing...
[2019/10/19 20:54:17] [ info] [storage] in-memory
[2019/10/19 20:54:17] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2019/10/19 20:54:17] [ info] [engine] started (pid=14642)
[2019/10/19 20:54:17] [ info] [sp] stream processor started
[0] tail.0: [1571482457.695458890, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1571482457.695461931, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

@nigels-com nigels-com force-pushed the filter-modify-utf8-encode branch from 77ba0f2 to aefdd27 Compare October 20, 2019 10:11
@nigels-com
Copy link
Contributor Author

I did another rev on the API and the filter_modify logic.

It is a two-pass approach. First determine the length of the required output buffer.
If the output size is unchanged, it's already UTF8 - no need to encode.
If the output size is small, encode to stack-allocated buffer.
If the otuput size is large, malloc/free enough heap memory.

        msgpack_pack_map(packer, map->via.map.size);
        for (i = 0; i < map->via.map.size; i++) {
            msgpack_pack_object(packer, map->via.map.ptr[i].key);

            /* Do UTF8 encoding for this value? */
            if (map->via.map.ptr[i].val.type == MSGPACK_OBJECT_STR &&
                kv_key_matches_str_rule_key(&map->via.map.ptr[i], rule)) {
                size_t size = 0;
                if (!tutf8e_buffer_length_iso_8859_2(map->via.map.ptr[i].val.via.str.ptr, map->via.map.ptr[i].val.via.str.size, &size) && size)
                {
                    const size_t TUTF8_DEFAULT_BUFFER = 256;

                    /* Already UTF8 encoded? */
                    if (size == map->via.map.ptr[i].val.via.str.size) {
                    }
                    /* Small enough for encoding to stack? */
                    else if (size<=TUTF8_DEFAULT_BUFFER)
                    {
                        size = TUTF8_DEFAULT_BUFFER;
                        char buffer[TUTF8_DEFAULT_BUFFER];
                        if (!tutf8e_buffer_encode_iso_8859_2(buffer, &size, map->via.map.ptr[i].val.via.str.ptr, map->via.map.ptr[i].val.via.str.size))
                        {
                            helper_pack_string(packer, buffer, size);
                            ret = FLB_FILTER_MODIFIED;
                            continue;
                        }
                    }
                    /* malloc/free the encoded copy */
                    else {
                        char *buffer = (char *) flb_malloc(size);
                        if (buffer && !tutf8e_buffer_encode_iso_8859_2(buffer, &size, map->via.map.ptr[i].val.via.str.ptr, map->via.map.ptr[i].val.via.str.size))
                        {
                            helper_pack_string(packer, buffer, size);
                            free(buffer);
                            ret = FLB_FILTER_MODIFIED;
                            continue;
                        }                        
                        free(buffer);
                    }
                }
            }
            msgpack_pack_object(packer, map->via.map.ptr[i].val);
        }

@bluebike
Copy link
Contributor

Proof-of-concept: ok.
In real life encoding should be done before parsing, because onigmo parser is configured to accept only UTF-8. Onigmo supports different encodings...there is no configuration option for that. Also after running parser, messages are converted to msgpack... we UTF-8 is "assumed".

So re-encoding should be done before parsing or in flb_parser(?).
Doing that flb_parser doesn't work always because some inputs doesn't call it (optinally).

@nigels-com
Copy link
Contributor Author

In real life encoding should be done before parsing

Oh! Drat! Never mind then...

@nigels-com
Copy link
Contributor Author

It does seem correct utf8 encoding should happen upstream of the parser, in the input implementation.
That's a broader, more intricate job indeed concerning all the input plugin code paths.

@nigels-com
Copy link
Contributor Author

nigels-com commented Oct 21, 2019

I refactored this POC branch to move the UTF8 encoding into src/flb_encode.c as a wrapper for message packing of strings with an optional UTF8 encoding step. It produces the same output as previously for tail input plugin, without the use or need of filter.

cmake -DFLB_ENCODE=No

[0] tail.0: [1571661916.044150883, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1571661916.044157391, {"log"=>"Nech� ji� h���n� saxofony ��bl� rozezvu�� s�� �d�sn�mi t�ny waltzu, tanga a quickstepu."}]

cmake -DFLB_ENCODE=Yes

[0] tail.0: [1571662021.344263571, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1571662021.344278390, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

with

$ cat test.cfg
[INPUT]
    Name        tail
    Path        test.iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

@nigels-com nigels-com force-pushed the filter-modify-utf8-encode branch from dfa92e2 to 90d35e9 Compare October 22, 2019 12:29
@nigels-com nigels-com changed the title Proof-of-concept: filter_modify support for UTF8 encoding string values Proof-of-concept: in_tail support for string UTF8 encoding Oct 22, 2019
@nigels-com
Copy link
Contributor Author

Updated with in_tail configurable encoding.

$ cat test.cfg 
[INPUT]
    Name        tail
    Path        test.iso-8859-2
    Encoding    iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

$ cat test2.cfg 
[INPUT]
    Name        tail
    Path        test.iso-8859-1
    Encoding    iso-8859-1

[OUTPUT]
    Name   stdout
    Match  *

Producing output:

[1] tail.0: [1571747508.059319590, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

[1] tail.0: [1571747516.613772304, {"log"=>"Albert osti fagotin ja töräytti puhkuvan melodian."}]

@nigels-com nigels-com changed the title Proof-of-concept: in_tail support for string UTF8 encoding Proof-of-concept: in_tail and in_syslog support for string UTF8 encoding Oct 22, 2019
@nigels-com nigels-com changed the title Proof-of-concept: in_tail and in_syslog support for string UTF8 encoding UTF8 encoding support for in_tail and in_syslog Oct 25, 2019
@nigels-com
Copy link
Contributor Author

I think this branch is ready for more serious consideration and seems functionally complete for utf8 encoding for iso_8859_* and windows_125* in_tail and in_syslog. I'd be happy to get some feedback, especially for real-world testing.

@edsiper
Copy link
Member

edsiper commented Nov 1, 2019

@nigels-com is it the tutf8e lib as a standalone in a good shape to be included in master under lib/ ?

@nigels-com
Copy link
Contributor Author

@edsiper I'm happy with the general shape and scope of tutf8e, as it is. I feel like the test coverage could be expanded, and the documentation could use some fleshing out some more. Would you like a separate (simple tidy history) pull request for integrating that?

@edsiper
Copy link
Member

edsiper commented Nov 1, 2019

@nigels-com a simple PR with that inclusion under lib/ and proper options in the main CMakeLists.txt should be enough :)

@nigels-com
Copy link
Contributor Author

@edsiper Sure thing. Too easy.

@nigels-com
Copy link
Contributor Author

Rebasing onto mainline.

@nigels-com
Copy link
Contributor Author

@edsiper What's the next step here?

@edsiper
Copy link
Member

edsiper commented Nov 26, 2019

just a minor change request to merge this: would you please adjust the following commit message ?

from

filter-modify: ..

to

filter_modify: ...

@nigels-com nigels-com force-pushed the filter-modify-utf8-encode branch from 5cefe93 to e97dc53 Compare November 27, 2019 02:58
@nigels-com
Copy link
Contributor Author

@edsiper Yes, done.

@bluebike
Copy link
Contributor

Why decoding is done msgpack generation???
I think that if onigmo parser is used.. it wouldn't like non-utf8 data here.
UTF-8 Encoding should be done before:

/* Process the string */
ret = flb_parser_do(ctx->parser, p, len,
&out_buf, &out_size, &out_time);

Also:
buffer is allocated with flb_mallloc but freed with regular free.

char *buffer = (char *) flb_malloc(size);

(sorry of late commenting)

@nigels-com
Copy link
Contributor Author

nigels-com commented Nov 28, 2019

Yes @bluebike I certainly see your point. It was about a month ago, but we did seem to agree that encoding should be upstream of the parser, and I do recall thinking that Omnigo could/should be stripped down to only UTF-8 support, if that all we intend to use.

flb_free is an easy-enough fix, but I'll need some time to consider if wrapping flb_parser_do will work as nicely as wrapping msgpack_pack_str. For example, what do we do if/when decoding fails, just fall back to the raw input data?

@nigels-com nigels-com changed the title UTF8 encoding support for in_tail and in_syslog WIP: UTF8 encoding support for in_tail and in_syslog Nov 28, 2019
@nigels-com
Copy link
Contributor Author

@bluebike Indeed the story of tail is a bit complicated - one path via the parser, other paths without a parser. The implication is that we need both flb_msgpack_encode_utf8 and flb_parser_do_encode_utf8.

I'll take a fresh look over the weekend, but I pushed an initial "work-in-progress" that does pass my simple test:

$ bin/fluent-bit -c test.cfg 
Fluent Bit v1.4.0
...
[0] tail.0: [1574939401.608484519, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1574939401.608490066, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

@bluebike
Copy link
Contributor

(((ping)))

@nigels-com
Copy link
Contributor Author

Yeah, I think this will need an overhaul based on the discussion so far.

@moloch90
Copy link

Hello,
I realize POC for my project, I want to know if your solution works on the td-agent-bit, and how can i installed on it?

Thank!

@edsiper
Copy link
Member

edsiper commented May 5, 2020

@nigels-com @bluebike

what needs to be done to simplify the implementation?

@edsiper edsiper self-assigned this May 5, 2020
@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label May 5, 2020
@nigels-com nigels-com force-pushed the filter-modify-utf8-encode branch from fb4b9ef to 7a16331 Compare May 12, 2020 11:30
@nigels-com
Copy link
Contributor Author

nigels-com commented May 12, 2020

This branch had gotten stale.
I did manage to rebase it and update for the current master.

$ cat test.utf8
A quick brown fox jumps over the lazy dog
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.

$ iconv -o test.iso-8859-2 --from-code=utf8 --to-code=iso-8859-2 test.utf8

$ cat test.cfg 
[INPUT]
    Name        tail
    Path        test.iso-8859-2
    Encoding    iso-8859-2

[OUTPUT]
    Name   stdout
    Match  *

$ bin/fluent-bit -c test.cfg 
Fluent Bit v1.5.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/05/12 21:31:51] [ info] [storage] version=1.0.3, initializing...
[2020/05/12 21:31:51] [ info] [storage] in-memory
[2020/05/12 21:31:51] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/05/12 21:31:51] [ info] [engine] started (pid=21646)
[2020/05/12 21:31:51] [ info] [sp] stream processor started
[0] tail.0: [1589283111.393193879, {"log"=>"A quick brown fox jumps over the lazy dog"}]
[1] tail.0: [1589283111.393202477, {"log"=>"Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu."}]

I'll circle back and reconsider the data-flow considerations, to do the UTF-8 conversion upstream of parsing.

@bluebike
Copy link
Contributor

@nigels-com well... have been waiting some time.
So should we go with (my) #1180 even if it doesn't work with windows?

@nigels-com
Copy link
Contributor Author

So should we go with (my) #1180 even if it doesn't work with windows?

One possible way forward is to rework #1180 to use #1703 rather than iconv.
Any objections or concerns with that @bluebike ?

@bluebike
Copy link
Contributor

One possible way forward is to rework #1180 to use #1703 rather than iconv.

Ok... I can look for that... (sigh)

@edsiper
Copy link
Member

edsiper commented Jun 30, 2020

ping

@nigels-com
Copy link
Contributor Author

Conversation has moved over to #2287 concerning the tutf8e API and if it can be and should be more tolerant about invalid input characters. I would expect to refactor this to be more aligned to #1180 (but perhaps more complete).

@bluebike
Copy link
Contributor

bluebike commented Jul 2, 2020

pong.
Yes I'm trying to make tutf8e more usable (invalid char handlng + non-changing decoding) in #2287,
after that I'll basically try to glue that to #1180, which is basically small job.
(but I had had a little hurry at work... ).

@nigels-com
Copy link
Contributor Author

Seems like PR #2287 didn't make it in. Closing for now.

@nigels-com nigels-com closed this Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants