Skip to content

Conversation

bluebike
Copy link
Contributor

WIP: in process of enable encoding first few changes to tutf8e encoding library.


Summary

  • separate encoding* functions removed.
  • length_functions return TUTF8E_SAME if no changes needed in string
  • multiple modes modes for handling invalid chars:
    • KEEP: (default) just encode byte to utf8
    • IGNORE: skip character
    • FAIL: fail if invalid chars
    • REPLACEMENT: replace invalid character with unicode 0xFFFD
    • QUESTION: relace invalid characters with '?' char.
  • tests functions made for invalid char handling.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@bluebike
Copy link
Contributor Author

@nigels-com would like to see some comments of my changes...
Mainly I tried to make this handle invalid chars + get info is string doesn't need changes (optimization).

@bluebike bluebike changed the title lib/tutf8e: refactoring for flb_encoding lib/tutf8e: refactoring for flb_encoding (WIP) Jun 23, 2020
@bluebike bluebike force-pushed the flb_encoding_tutf8e branch from 6a84fd0 to d40af29 Compare June 23, 2020 14:51
@nigels-com
Copy link
Contributor

I'll take a closer look.

Seems like a significant rework, some further discussion of the goals and rationale would be helpful, I think.

@nigels-com
Copy link
Contributor

My first question from a fluent-bit perspective is if there is a realistic need for all the modes?
The expectation is to configure this per input? Which mode is the intended default?
From an embedded perspective, I'd like to only link the code/mode that is needed for a particular application.

@bluebike
Copy link
Contributor Author

bluebike commented Jun 24, 2020

@nigels-com Probably overkill having different modes.
But different users of fluent-bit could have different requirements.
Having different modes doesn't really add that much code I think, mostly test code.

Anyway. Just failing with bad chars is not enough (for me).
Also way to see that is encoding really needed is important for me, because in our case probably 99.9% of all messages are just plain ASCII and don't really need encoding/decoding.

@nigels-com
Copy link
Contributor

@bluebike You're saying that the advantage of REPLACEMENT or QUESTION modes are that they'll flag message streams that are not actually encoded as expected, but at least the ASCII parts will succeed. Do you feel that KEEP and IGNORE are also important modes?

@nigels-com
Copy link
Contributor

I'm not convinced the TUTF8E_SAME mode is worth the additional complexity. Have you been able to measure any actual throughput, memory or CPU improvements?

@nigels-com
Copy link
Contributor

As an aside, I filed some issues over on the tutf8e repo, as reminders to myself.

nigels-com/tutf8e#8
nigels-com/tutf8e#6

* separate _encoding_* functions removed.
* length_functions return TUTF8E_SAME if no changes needed in string
* multiple modes modes for handlig invalid chars: KEEP(default),IGNORE,FAIL,REPLACEMENT,QUESTION

Signed-off-by: Jukka Pihl <[email protected]>
@bluebike bluebike force-pushed the flb_encoding_tutf8e branch from d40af29 to 127111f Compare June 24, 2020 17:54
@bluebike
Copy link
Contributor Author

Ok.. .did some changes...

  • *olen used as giving output buffer length + returning length of output utf8 string.
  • some comments fixed.
  • removed extra strlen
  • added test for too short output buffer returning TUTF8E_TOOLONG

@nigels-com
Copy link
Contributor

@bluebike Please take a look at PR #2326 to see if this arrangement strikes a good balance of features and "tinyness".

Rather than the various iconv modes, it adds support for a UTF-8 proxy for each invalid input character.
If the proxy is NULL, the API will continue erroring for invalid input.

@bluebike
Copy link
Contributor Author

bluebike commented Jul 4, 2020

(this goes nowhere)

@bluebike bluebike closed this Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants