-
Notifications
You must be signed in to change notification settings - Fork 1.8k
flb_iconv: charset decoding/encoding #1180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b91b3df
to
a2725d7
Compare
Made Iconv detection in CMakeLists.txt to work with working older than 3.11.0 cmake versions. |
f5de3a6
to
b5f5f12
Compare
Looks interesting. If I understand correctly, the idea is to convert from supported encodings to UTF-8 for transport? |
fluent-bit (basically onigmo, json, msgpack) assume everything to be in UTF-8. I have "legacy" inputs (tail, syslog) that data encoded in Latin1 (iso-8859-1) (+ letters like öäåÖÄÅ, which are much used in finnish/swedish). So their data should converted to UTF-8. Off course reverse feature could be added to some outputs (where non UTF-8) output is used. |
I have the same problem and this PR work for me too. Please merge it! |
would be using apache iconv version an option ? https://github.com/apache/apr-iconv @bluebike would you please validate if the apache version satisfies the requirement ? |
@edsiper ok. I’ll apr-iconv check that tomorrow. |
I had a look on apr-iconv. https://apr.apache.org/ So first I must ask: why? I assume that question of about Windows-port (!?). So let's assume that we have apr-iconv. Problems/Questions/Notes/Tasks:
Opinions? |
I think the http://www.apache.org/LICENSE.txt |
This is not originally GNU library and many systems have own "native" non-gnu implementations. |
Does this concern iso-8859-1 specifically, or encoding conversion generally? I think we could easily add support for converting 8-bit encodings such as iso-8859-1 with far less development effort than integrating a broader solution such as https://en.wikipedia.org/wiki/ISO/IEC_8859-1 Also for related discussion: https://stackoverflow.com/questions/4059775/convert-iso-8859-1-strings-to-utf-8-in-c-c
|
Seems to me we could generate compact tables or logic for 8-bit encodings In Python:
|
Aside from the back-end implementation of the conversions, a thought: Could the conversion be formulated as a filter plugin, leaving the various inputs and outputs agnostic about text encoding? |
Proof-of-concept looks plausible using Python to generate C encoders to UTF-8. https://en.wikipedia.org/wiki/ISO/IEC_8859-1
https://en.wikipedia.org/wiki/Windows-1252
|
I made a library from scratch. It seems to work. |
@nigels-com I have tought some even simpler table based conversion for 8-bit charsets. There is a little bit work to get apr-iconv work (cmake + other code path), I'll try that first. |
Did some testing/coding/implementation apr-iconv .
Basically almost all unix implementations have working iconv, so this could be made Windows only, but I don't have much experience of that. |
@bluebike Any suggestions for making A tiny UTF-8 encoder for C meet your needs? |
that would be great, I am happy to merge that new tiny lib if it does the right job |
@nigels-com Your library ( https://github.com/nigels-com/tutf8e ) has nice code generation. APIAPI could be like // flags (maybe overkill)
#define TUTF8E_KEEP_BAD 0x01 /* keep bad input characters */
#define TUTF8E_SKIP_BAD 0x02 /* skip bad input characters */
#define TUTF8E_USE_REPLACEMENT 0x04 /* use unicode replacement character for bad chars */
tutf8e_t tutf8e_find(char *encoding, uint32 flags);
int tutf8e_length_utf8(tutf8e_t enc, const char *ibuf, size_t ileft, size_t *length);
int tutf8e_encode_utf8(tutf8e_t enc, const uint16_t *table, const char *ibuf, size_t ilen, char *obuf, size_t olen); Encoding could be stateless If we assume that we handling only 8-bit encodings. For handling "bad characters" there could be way like in Simple OptimizationAlso we could have simple optimization: if input and output are same Fate of iconvIn aboult all unix/linux systems (event with musl libc) iconv is available by platform. Maybe we could have more like. Option
|
If the tiny lib provides support for what is needed, I would prefer not to offer and maintain iconv support. |
@bluebike Thanks for your thoughts on this. I was indeed looking at perhaps aligning I don't speak for @edsiper but there are a few things about iconv don't align well with fluent-bit, in my opinion. One goal of fluent-bit is to be small, in contrast to iconv trying to be comprehensive. My opinion is that fluent-bit ought to support utf-8 from various sources of information, but conversion to other encodings is not needed. Another consideration is that fluent-bit ought to provide the same output for the same input, regardless of the platform or iconv version or availability. To have good test coverage it can be helpful to design for testability - narrow scope, comprehensive test coverage. For maintainers there is a consideration of minimising the element of surprise and consequential bug reports and consequential pull requests. And there is also a consideration of license entanglement - a statically linked binary is very nice for dropping into any-old-container, VM or bare metal and expecting it to "just work" every time. As for the "simple optimisation" where the input string is already utf8 and there is no need for an allocation and copy. I see the appeal of that. At the code level this seems like a trade-off between scanning once (assuming the output buffer is sufficiently large) and scanning twice (an initial scan to determine the output length and/or if a conversion is needed. For short strings perhaps the one-pass approach is better (rarely need to retry with larger output buffer) and long strings perhaps the two-pass approach is better (bounded time/space since we never need to retry, allocate only exactly what is needed). My assumption for fluent-bit is that we're dealing with lots of small strings and these will have to be copied into messagepack datastructures, so we can afford a generous "scratch space" for a single-pass approach. But the downside is this additional copy from input to output that could be avoided in the two-pass approach. Certainly If there is some discussion or clarity about these trade-offs from a real-world point of view, that would be helpful. I lean towards minimising mallocs and assume that there is already a copy into messagepack datastructure. Cost of utf-8 conversion is one extra scan of the string and one extra copy of the string, but at least that is O(n), bounded and adds no malloc/free traffic. (Is there a way to pass ownership of the output string to messagepack? In that case I'd lean the other way and also do the two-pass approach to minimise malloc/free/copy). Error handling was also something I was thinking about. If it turns out that the input is not valid, what ought we do about that? Drop the whole string? Skip the bad character(s)? Use a specified substitution character? To conclude I'd like to know a little more precisely what fluent-bit needs for utf-8 encoding, rather than adventures in matching iconv feature-for-feature and bug-for-bug. |
I made some revisions to tutf8e: There are now three flavours of encoding to UTF8.
In reference to the message pack API, it looks like the buffer flavour fits best:
|
@edsiper @nigels-com If (and when) this done without iconv, it's better to rename this to someghing else: flb_encoding? |
@bluebike Ah, I realise now the purpose of |
@bluebike flb_encodimg should be fine. should I merge @nigels-com library in a specific branch so you can adapt the interface? |
I haven't have time to check @nigels-com PR/library #1668 . |
Make possible to have other encodings that UTF-8 in fluent-bit, by creating object flb_iconv-library. Enabled by -D FLB_ICONV=Yes (default No) Requires libiconv-library (can be embedded to libc) Signed-off-by: Jukka Pihl <[email protected]>
With "from_encoding" parameter you can define input charset for in_tail plugin. Requires FLB_ICONV (flb_iconv) Signed-off-by: Jukka Pihl <[email protected]>
Enable charset deocding with "from_encoding"/"encoding" with syslog module. Useful syslog is not using UTF-8 (example: latin1 / CP1252) Requires: FLB_ICONV Signed-off-by: Jukka Pihl <[email protected]>
what's the status of this PR? |
since we have the tiny lib encoder in place, can we get rid of iconv ? cc: @bluebike @nigels-com |
ping |
PR #1668 had fallen off my radar, my apologies for that. |
pong! Well.. That would not be that big change to this. #define FLB_ENCODING_ACCEPT_NOT_CHANGED 0x01
#define FLB_ENCODING_SUCCESS 0
#define FLB_ENCODING_NO_CHANGED 1
struct flb_encoding flb_encoding_open(const char *from_encoding, const char *to_encoding);
struct flb_encoding flb_encoding_open_to_utf8(const char *from_encoding);
struct flb_encoding flb_encoding_open_from_utf8(const char *to_encoding);
int flb_encoding_execute(struct flb_encoding *encoding,
const char *input, size_t inputlen,
char **output, size_t *outputlen,
int flags);
void flb_encoding_close(struct flb_encoding *encoding); In configuration.
of course modules could be plugins.. but maybe ovekill. Later add other implementations like @nigels-com #1668 ??? |
@bluebike Is iconv non-negotiable from your point of view? Could you expand on the reasoning for that? |
@nigels-com iconv not "mandatory" module itself, but need way to decode non UTF-8 charsets. I have been waiting your PR rather long time. In unix world iconv(3) is just the standard/easy way to do encoding/decoding charsets. |
@bluebike Which other encodings do you think are important for fluent bit to support? |
@nigels-com I need just Latin-1 and cp-1252 (windows latin-1, used my mysql). |
@nigels-com @edsiper Ok ... |
Related to #1703 (tutf8e encoding library).
yes... I can make those changes myself. |
PR #2420 replaces this. |
Current fluent-bit understands basically only UTF-8 encoding.
Some inputs needs other charset encodings like Latin1/CP1252
this PR adds support for iconv to core, in_tail, in_syslog
Features: