-
Notifications
You must be signed in to change notification settings - Fork 1.8k
WIP: UTF8 encoding support for in_tail and in_syslog #1668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
77ba0f2
to
aefdd27
Compare
I did another rev on the API and the filter_modify logic. It is a two-pass approach. First determine the length of the required output buffer.
|
Proof-of-concept: ok. So re-encoding should be done before parsing or in flb_parser(?). |
Oh! Drat! Never mind then... |
It does seem correct utf8 encoding should happen upstream of the parser, in the input implementation. |
I refactored this POC branch to move the UTF8 encoding into cmake -DFLB_ENCODE=No
cmake -DFLB_ENCODE=Yes
with
|
dfa92e2
to
90d35e9
Compare
Updated with
Producing output:
|
I think this branch is ready for more serious consideration and seems functionally complete for utf8 encoding for iso_8859_* and windows_125* in_tail and in_syslog. I'd be happy to get some feedback, especially for real-world testing. |
84fc1be
to
63bc0c8
Compare
@nigels-com is it the tutf8e lib as a standalone in a good shape to be included in master under lib/ ? |
@edsiper I'm happy with the general shape and scope of tutf8e, as it is. I feel like the test coverage could be expanded, and the documentation could use some fleshing out some more. Would you like a separate (simple tidy history) pull request for integrating that? |
@nigels-com a simple PR with that inclusion under lib/ and proper options in the main CMakeLists.txt should be enough :) |
@edsiper Sure thing. Too easy. |
Rebasing onto mainline. |
63bc0c8
to
5cefe93
Compare
@edsiper What's the next step here? |
just a minor change request to merge this: would you please adjust the following commit message ? from
to
|
5cefe93
to
e97dc53
Compare
@edsiper Yes, done. |
Why decoding is done msgpack generation??? fluent-bit/plugins/in_syslog/syslog_prot.c Lines 100 to 102 in e97dc53
Also: Line 56 in e97dc53
(sorry of late commenting) |
Yes @bluebike I certainly see your point. It was about a month ago, but we did seem to agree that encoding should be upstream of the parser, and I do recall thinking that Omnigo could/should be stripped down to only UTF-8 support, if that all we intend to use.
|
@bluebike Indeed the story of I'll take a fresh look over the weekend, but I pushed an initial "work-in-progress" that does pass my simple test:
|
(((ping))) |
Yeah, I think this will need an overhaul based on the discussion so far. |
Hello, Thank! |
what needs to be done to simplify the implementation? |
…ng values Signed-off-by: Nigel Stewart <[email protected]>
Signed-off-by: Nigel Stewart <[email protected]>
Signed-off-by: Nigel Stewart <[email protected]>
…R preprocessor Signed-off-by: Nigel Stewart <[email protected]>
…ne step Signed-off-by: Nigel Stewart <[email protected]>
Signed-off-by: Nigel Stewart <[email protected]>
Signed-off-by: Nigel Stewart <[email protected]>
fb4b9ef
to
7a16331
Compare
This branch had gotten stale.
I'll circle back and reconsider the data-flow considerations, to do the UTF-8 conversion upstream of parsing. |
@nigels-com well... have been waiting some time. |
ping |
Seems like PR #2287 didn't make it in. Closing for now. |
This is a proof-of-concept integration of tutf8e "Tiny UTF-8 Encoder for C"
into fluent-bit modify filter. The test here is feeding iso-8859-2 via the tail input plugin
and using the modify filter
UTF8
operation to do encoding via tutf8e.@edsiper @bluebike
The more interesting part is in
plugins/filter_modify/modify.c
In relation to #1180
https://github.com/nigels-com/tutf8e