lib/tutf8e: refactoring for flb_encoding (WIP) #2287

bluebike · 2020-06-23T14:43:17Z

WIP: in process of enable encoding first few changes to tutf8e encoding library.

Summary

separate encoding* functions removed.
length_functions return TUTF8E_SAME if no changes needed in string
multiple modes modes for handling invalid chars:
- KEEP: (default) just encode byte to utf8
- IGNORE: skip character
- FAIL: fail if invalid chars
- REPLACEMENT: replace invalid character with unicode 0xFFFD
- QUESTION: relace invalid characters with '?' char.
tests functions made for invalid char handling.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

bluebike · 2020-06-23T14:45:01Z

@nigels-com would like to see some comments of my changes...
Mainly I tried to make this handle invalid chars + get info is string doesn't need changes (optimization).

nigels-com · 2020-06-24T02:29:42Z

I'll take a closer look.

Seems like a significant rework, some further discussion of the goals and rationale would be helpful, I think.

nigels-com · 2020-06-24T02:32:02Z

My first question from a fluent-bit perspective is if there is a realistic need for all the modes?
The expectation is to configure this per input? Which mode is the intended default?
From an embedded perspective, I'd like to only link the code/mode that is needed for a particular application.

bluebike · 2020-06-24T09:18:58Z

@nigels-com Probably overkill having different modes.
But different users of fluent-bit could have different requirements.
Having different modes doesn't really add that much code I think, mostly test code.

Anyway. Just failing with bad chars is not enough (for me).
Also way to see that is encoding really needed is important for me, because in our case probably 99.9% of all messages are just plain ASCII and don't really need encoding/decoding.

nigels-com · 2020-06-24T10:48:14Z

@bluebike You're saying that the advantage of REPLACEMENT or QUESTION modes are that they'll flag message streams that are not actually encoded as expected, but at least the ASCII parts will succeed. Do you feel that KEEP and IGNORE are also important modes?

nigels-com · 2020-06-24T11:18:46Z

I'm not convinced the TUTF8E_SAME mode is worth the additional complexity. Have you been able to measure any actual throughput, memory or CPU improvements?

nigels-com · 2020-06-24T11:32:07Z

As an aside, I filed some issues over on the tutf8e repo, as reminders to myself.

nigels-com/tutf8e#8
nigels-com/tutf8e#6

lib/tutf8e/src/tutf8e.c

* separate _encoding_* functions removed. * length_functions return TUTF8E_SAME if no changes needed in string * multiple modes modes for handlig invalid chars: KEEP(default),IGNORE,FAIL,REPLACEMENT,QUESTION Signed-off-by: Jukka Pihl <[email protected]>

bluebike · 2020-06-24T17:57:29Z

Ok.. .did some changes...

*olen used as giving output buffer length + returning length of output utf8 string.
some comments fixed.
removed extra strlen
added test for too short output buffer returning TUTF8E_TOOLONG

nigels-com · 2020-07-03T22:47:58Z

@bluebike Please take a look at PR #2326 to see if this arrangement strikes a good balance of features and "tinyness".

Rather than the various iconv modes, it adds support for a UTF-8 proxy for each invalid input character.
If the proxy is NULL, the API will continue erroring for invalid input.

bluebike · 2020-07-04T22:58:43Z

(this goes nowhere)

bluebike requested review from edsiper, fujimotos and koleini as code owners June 23, 2020 14:43

bluebike changed the title ~~lib/tutf8e: refactoring for flb_encoding~~ lib/tutf8e: refactoring for flb_encoding (WIP) Jun 23, 2020

bluebike force-pushed the flb_encoding_tutf8e branch from 6a84fd0 to d40af29 Compare June 23, 2020 14:51

nigels-com mentioned this pull request Jun 24, 2020

Support replacement string rather than erroring for invalid input? nigels-com/tutf8e#8

Closed

nigels-com reviewed Jun 24, 2020

View reviewed changes

lib/tutf8e/src/tutf8e.c Show resolved Hide resolved

nigels-com reviewed Jun 24, 2020

View reviewed changes

lib/tutf8e/src/tutf8e.c Outdated Show resolved Hide resolved

nigels-com reviewed Jun 24, 2020

View reviewed changes

lib/tutf8e/src/tutf8e.c Outdated Show resolved Hide resolved

bluebike force-pushed the flb_encoding_tutf8e branch from d40af29 to 127111f Compare June 24, 2020 17:54

This was referenced Jul 1, 2020

WIP: UTF8 encoding support for in_tail and in_syslog #1668

Closed

lib: tutf8e: Updated API to support conversion of invalid input input #2326

Merged

bluebike closed this Jul 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lib/tutf8e: refactoring for flb_encoding (WIP) #2287

lib/tutf8e: refactoring for flb_encoding (WIP) #2287

Uh oh!

bluebike commented Jun 23, 2020

Uh oh!

bluebike commented Jun 23, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

bluebike commented Jun 24, 2020 •

edited

Loading

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bluebike commented Jun 24, 2020

Uh oh!

nigels-com commented Jul 3, 2020

Uh oh!

bluebike commented Jul 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lib/tutf8e: refactoring for flb_encoding (WIP) #2287

lib/tutf8e: refactoring for flb_encoding (WIP) #2287

Uh oh!

Conversation

bluebike commented Jun 23, 2020

Uh oh!

bluebike commented Jun 23, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

bluebike commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

nigels-com commented Jun 24, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bluebike commented Jun 24, 2020

Uh oh!

nigels-com commented Jul 3, 2020

Uh oh!

bluebike commented Jul 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bluebike commented Jun 24, 2020 •

edited

Loading