Skip to content

Conversation

bluebike
Copy link
Contributor

@bluebike bluebike commented Mar 8, 2019

Current fluent-bit understands basically only UTF-8 encoding.
Some inputs needs other charset encodings like Latin1/CP1252
this PR adds support for iconv to core, in_tail, in_syslog

Features:

  • configuration: FLB_ICONV option to compile iconv support
  • core: flb_iconv lib.
  • in_tail: support for "from_encoding" / "encoding" configuration
  • in_syslogl: support for "from_encoding" / "encoding" configuration

@bluebike bluebike force-pushed the flb_iconv branch 2 times, most recently from b91b3df to a2725d7 Compare May 15, 2019 14:07
@bluebike
Copy link
Contributor Author

Made Iconv detection in CMakeLists.txt to work with working older than 3.11.0 cmake versions.

@bluebike bluebike force-pushed the flb_iconv branch 2 times, most recently from f5de3a6 to b5f5f12 Compare August 12, 2019 08:32
@bluebike bluebike marked this pull request as ready for review September 8, 2019 19:36
@nigels-com
Copy link
Contributor

Looks interesting.

If I understand correctly, the idea is to convert from supported encodings to UTF-8 for transport?

@bluebike
Copy link
Contributor Author

fluent-bit (basically onigmo, json, msgpack) assume everything to be in UTF-8.

I have "legacy" inputs (tail, syslog) that data encoded in Latin1 (iso-8859-1) (+ letters like öäåÖÄÅ, which are much used in finnish/swedish). So their data should converted to UTF-8.
Fluentd has similiar feature. For me this basically showstopper.

Off course reverse feature could be added to some outputs (where non UTF-8) output is used.

@belaytzev
Copy link

I have the same problem and this PR work for me too. Please merge it!

@edsiper
Copy link
Member

edsiper commented Oct 2, 2019

would be using apache iconv version an option ?

https://github.com/apache/apr-iconv

@bluebike would you please validate if the apache version satisfies the requirement ?

@bluebike
Copy link
Contributor Author

bluebike commented Oct 2, 2019

@edsiper ok. I’ll apr-iconv check that tomorrow.

@bluebike
Copy link
Contributor Author

bluebike commented Oct 3, 2019

I had a look on apr-iconv. https://apr.apache.org/

So first I must ask: why?

I assume that question of about Windows-port (!?).

So let's assume that we have apr-iconv. Problems/Questions/Notes/Tasks:

  1. apr-iconv is depended on apr (apr pools are used by apr-iconv)
  2. cmake must be configured to find apr and apr-iconv.
    If apr was used by other modules, detection should be generic (TASK)
  3. api is kind of compatible iconv. Not too much work (TASK)
  4. Are platform-iconv and apr-iconv mutually exclusive?
    Can it be possible to configure both and in runtime choose
    like: from_encoding apr:cp1252 utf8 or from_encoding platform:cp1252 utf8
  5. Configuration variables FLB_ICONV_PLATFORM, FLB_ICONV_APR?
    (FLB_APR for finding apr location?.. if not default)
    Also if just FLB_ICONV defined, then probably platform iconv is tried to find first
    the apr-iconv
  6. I have thought also possibility of iconv-simple (FLB_ICONV_SIMPLE).
    Simple charset conversion library which work only from eight-bit-charsets (latin*)
    to UTF8. If would be configured with simple configuration files... where just unicode
    of 0-255 bytes values are given. Usage would be then something like
    from_encoding simple:/etc/fluent-bit/charsets/KOI-8.txt
    That would not need any external libraries... just some code.

Opinions?

@nigels-com
Copy link
Contributor

I think the apr-iconv Apache license is more amenable to static linking than the LGPL GNU iconv.
Unfortunately the cmake for apr-iconv looks like it's Windows-only.

http://www.apache.org/LICENSE.txt
https://www.gnu.org/software/libiconv/

@bluebike
Copy link
Contributor Author

bluebike commented Oct 9, 2019

This is not originally GNU library and many systems have own "native" non-gnu implementations.
If you believe this: https://en.wikipedia.org/wiki/Iconv

@nigels-com
Copy link
Contributor

Does this concern iso-8859-1 specifically, or encoding conversion generally?

I think we could easily add support for converting 8-bit encodings such as iso-8859-1 with far less development effort than integrating a broader solution such as iconv. Is there a small set of 8-bit encodings that would be broadly useful enough?

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Also for related discussion: https://stackoverflow.com/questions/4059775/convert-iso-8859-1-strings-to-utf-8-in-c-c

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

@nigels-com
Copy link
Contributor

Seems to me we could generate compact tables or logic for 8-bit encodings In Python:

>>> for i in range(0,255): print('%d %s'%(i, bytes([i]).decode('iso-8859-1').encode('utf-8')))
..
97 b'a'
98 b'b'
99 b'c'
100 b'd'
101 b'e'
102 b'f'
103 b'g'
..
218 b'\xc3\x9a'
219 b'\xc3\x9b'
220 b'\xc3\x9c'
221 b'\xc3\x9d'
222 b'\xc3\x9e'
223 b'\xc3\x9f'

@nigels-com
Copy link
Contributor

Aside from the back-end implementation of the conversions, a thought:

Could the conversion be formulated as a filter plugin, leaving the various inputs and outputs agnostic about text encoding?

@nigels-com
Copy link
Contributor

nigels-com commented Oct 12, 2019

Proof-of-concept looks plausible using Python to generate C encoders to UTF-8.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Encoding: iso-8859-1
[[0, 127], [129, 255]]
  unsigned char *o = output;
  unsigned char *end = output + size;
  for (const unsigned char *i = input; *i; ++i) {
    if (*i<=127) {
      if (end-o < 1) goto bufferTooSmall;
      *o++ = *i;
      continue;
    }
    if (*i<129) goto undef;
    /* if (*i<=255) */ {
      if (end-o < 2) goto bufferTooSmall;
      *o++ = 0xc0 | (*i>>6);
      *o++ = 0x80 | (*i&0x3f);
      continue;
    }
    goto undef;
  }

https://en.wikipedia.org/wiki/Windows-1252

Encoding: Windows-1252
[[0, 127], [130, 130], [132, 132], [134, 135], [137, 137], [139, 139], [142, 142], [146, 146], [148, 148], [150, 151], [153, 153], [155, 155], [158, 158], [160, 255]]
  unsigned char *o = output;
  unsigned char *end = output + size;
  for (const unsigned char *i = input; *i; ++i) {
    if (*i<=127) {
      if (end-o < 1) goto bufferTooSmall;
      *o++ = *i;
      continue;
    }
    if (*i<130) goto undef;
    if (*i==130) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x201a>>12);
      *o++ = 0x80 | ((0x201a>>6)&0x3f);
      *o++ = 0x80 | (0x201a&0x3f);
      continue;
    }
    if (*i<132) goto undef;
    if (*i==132) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x201e>>12);
      *o++ = 0x80 | ((0x201e>>6)&0x3f);
      *o++ = 0x80 | (0x201e&0x3f);
      continue;
    }
    if (*i<134) goto undef;
    if (*i<=135) {
      if (end-o < 3) goto bufferTooSmall;
      uint16_t v = 0x2020 + *i - 134u;
      *o++ = 0xe0 | (v>>12);
      *o++ = 0x80 | ((v>>6)&0x3f);
      *o++ = 0x80 | (v&0x3f);
      continue;
    }
    if (*i<137) goto undef;
    if (*i==137) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x2030>>12);
      *o++ = 0x80 | ((0x2030>>6)&0x3f);
      *o++ = 0x80 | (0x2030&0x3f);
      continue;
    }
    if (*i<139) goto undef;
    if (*i==139) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x2039>>12);
      *o++ = 0x80 | ((0x2039>>6)&0x3f);
      *o++ = 0x80 | (0x2039&0x3f);
      continue;
    }
    if (*i<142) goto undef;
    if (*i==142) {
      if (end-o < 2) goto bufferTooSmall;
      *o++ = 0xc0 | (0x017d>>6);
      *o++ = 0x80 | (0x017d&0x3f);
      continue;
    }
    if (*i<146) goto undef;
    if (*i==146) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x2019>>12);
      *o++ = 0x80 | ((0x2019>>6)&0x3f);
      *o++ = 0x80 | (0x2019&0x3f);
      continue;
    }
    if (*i<148) goto undef;
    if (*i==148) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x201d>>12);
      *o++ = 0x80 | ((0x201d>>6)&0x3f);
      *o++ = 0x80 | (0x201d&0x3f);
      continue;
    }
    if (*i<150) goto undef;
    if (*i<=151) {
      if (end-o < 3) goto bufferTooSmall;
      uint16_t v = 0x2013 + *i - 150u;
      *o++ = 0xe0 | (v>>12);
      *o++ = 0x80 | ((v>>6)&0x3f);
      *o++ = 0x80 | (v&0x3f);
      continue;
    }
    if (*i<153) goto undef;
    if (*i==153) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x2122>>12);
      *o++ = 0x80 | ((0x2122>>6)&0x3f);
      *o++ = 0x80 | (0x2122&0x3f);
      continue;
    }
    if (*i<155) goto undef;
    if (*i==155) {
      if (end-o < 3) goto bufferTooSmall;
      *o++ = 0xe0 | (0x203a>>12);
      *o++ = 0x80 | ((0x203a>>6)&0x3f);
      *o++ = 0x80 | (0x203a&0x3f);
      continue;
    }
    if (*i<158) goto undef;
    if (*i==158) {
      if (end-o < 2) goto bufferTooSmall;
      *o++ = 0xc0 | (0x017e>>6);
      *o++ = 0x80 | (0x017e&0x3f);
      continue;
    }
    if (*i<160) goto undef;
    /* if (*i<=255) */ {
      if (end-o < 2) goto bufferTooSmall;
      *o++ = 0xc0 | (*i>>6);
      *o++ = 0x80 | (*i&0x3f);
      continue;
    }
    goto undef;
  }

@nigels-com
Copy link
Contributor

I made a library from scratch. It seems to work.
Does this seem suitable for the task?

https://github.com/nigels-com/tutf8e

@bluebike
Copy link
Contributor Author

@nigels-com I have tought some even simpler table based conversion for 8-bit charsets.
Right, Latin1 is of course dead simple .

There is a little bit work to get apr-iconv work (cmake + other code path), I'll try that first.

@bluebike
Copy link
Contributor Author

Did some testing/coding/implementation apr-iconv .

  • cmake needed some work (found examples for apr,apr-util but really for apr-iconv)
  • API Documentation is quite poor. Error codes (apr_status_t) in different not very well defined. Handling edge cases (bad characters, not enough room in output buffer, input is partial) is not very clear and needs some source code spelunking.

Basically almost all unix implementations have working iconv, so this could be made Windows only, but I don't have much experience of that.

@nigels-com
Copy link
Contributor

@bluebike Any suggestions for making A tiny UTF-8 encoder for C meet your needs?

https://github.com/nigels-com/tutf8e

@edsiper
Copy link
Member

edsiper commented Oct 17, 2019

that would be great, I am happy to merge that new tiny lib if it does the right job

@bluebike
Copy link
Contributor Author

@nigels-com Your library ( https://github.com/nigels-com/tutf8e ) has nice code generation.
What is missing is like iconv_open to find conversion (table) by symbol.

API

API could be like

// flags  (maybe overkill)
#define TUTF8E_KEEP_BAD           0x01  /* keep bad input characters */
#define TUTF8E_SKIP_BAD           0x02 /* skip bad input characters */
#define TUTF8E_USE_REPLACEMENT    0x04  /* use unicode replacement character for bad chars */

tutf8e_t tutf8e_find(char *encoding, uint32 flags);
int tutf8e_length_utf8(tutf8e_t enc, const char *ibuf, size_t ileft, size_t *length);
int tutf8e_encode_utf8(tutf8e_t enc, const uint16_t *table, const char *ibuf, size_t ilen, char *obuf, size_t olen);

Encoding could be stateless If we assume that we handling only 8-bit encodings.

For handling "bad characters" there could be way like in
( https://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.15/iconv_open.3.html )
//TRANSLIT or //IGNORE to control output.

Simple Optimization

Also we could have simple optimization: if input and output are same tutf8e_length_utf8(…) could return special value ENO.... (?).
We could avoid output memory allocation.
In typical log-files I have been handling about 99% strings would be just ascii (ord < 128),
where most latin1/cp1252 == utf8.

Fate of iconv

In aboult all unix/linux systems (event with musl libc) iconv is available by platform.
So I cannot see that problem there.

Maybe we could have more like. Option FLB_ENCODING (default: Yes) which would
contain this tutf8e library by default (with well known latin-encodings?).
FLB_ENCODING_ICONV (default: No) would enable iconv if platform has it.

from_encoding ISO-8859-1 configuration would first look tutf8e encoding,
if no match and and iconv is enabled iconv is tried.
With prefix iconv:ISO-8859-1 you could force to use iconv.

@edsiper
Copy link
Member

edsiper commented Oct 18, 2019

If the tiny lib provides support for what is needed, I would prefer not to offer and maintain iconv support.

@nigels-com
Copy link
Contributor

@bluebike Thanks for your thoughts on this. I was indeed looking at perhaps aligning tutf8e with the iconv API and for the most part I'm feeling shy of doing that. For one it results in a multitude of scenarios to test, few of which seem relevant to fluent-bit. Secondly it positions tutf8e as an alternative iconv implementation, which seems overblown considering the requirements for fluent-bit. Thirdly it increases the complexity of the interface making it more work to develop and more work to use.

I don't speak for @edsiper but there are a few things about iconv don't align well with fluent-bit, in my opinion. One goal of fluent-bit is to be small, in contrast to iconv trying to be comprehensive. My opinion is that fluent-bit ought to support utf-8 from various sources of information, but conversion to other encodings is not needed. Another consideration is that fluent-bit ought to provide the same output for the same input, regardless of the platform or iconv version or availability. To have good test coverage it can be helpful to design for testability - narrow scope, comprehensive test coverage. For maintainers there is a consideration of minimising the element of surprise and consequential bug reports and consequential pull requests. And there is also a consideration of license entanglement - a statically linked binary is very nice for dropping into any-old-container, VM or bare metal and expecting it to "just work" every time.

As for the "simple optimisation" where the input string is already utf8 and there is no need for an allocation and copy. I see the appeal of that. At the code level this seems like a trade-off between scanning once (assuming the output buffer is sufficiently large) and scanning twice (an initial scan to determine the output length and/or if a conversion is needed. For short strings perhaps the one-pass approach is better (rarely need to retry with larger output buffer) and long strings perhaps the two-pass approach is better (bounded time/space since we never need to retry, allocate only exactly what is needed). My assumption for fluent-bit is that we're dealing with lots of small strings and these will have to be copied into messagepack datastructures, so we can afford a generous "scratch space" for a single-pass approach. But the downside is this additional copy from input to output that could be avoided in the two-pass approach.

Certainly tutf8e can support a "malloc if needed" two-pass API and there is some codegen already that I was exploring this: codegen.py

If there is some discussion or clarity about these trade-offs from a real-world point of view, that would be helpful. I lean towards minimising mallocs and assume that there is already a copy into messagepack datastructure. Cost of utf-8 conversion is one extra scan of the string and one extra copy of the string, but at least that is O(n), bounded and adds no malloc/free traffic.

(Is there a way to pass ownership of the output string to messagepack? In that case I'd lean the other way and also do the two-pass approach to minimise malloc/free/copy).

Error handling was also something I was thinking about. If it turns out that the input is not valid, what ought we do about that? Drop the whole string? Skip the bad character(s)? Use a specified substitution character? tutf8e could probably generate a rich API that provides all of these options, but if there is a strong precedent of how fluentd deals with that, perhaps we only need to implement (and test) that particular path.

To conclude I'd like to know a little more precisely what fluent-bit needs for utf-8 encoding, rather than adventures in matching iconv feature-for-feature and bug-for-bug.

@nigels-com
Copy link
Contributor

I made some revisions to tutf8e:

There are now three flavours of encoding to UTF8.
tutf8e_string_ for NUL-terminated and tutf8e_buffer_ for known-length buffers.

/* Encode NUL-terminated string to UTF8 */
extern int tutf8e_string_encode_iso_8859_1  (char *output, size_t olen, const char *input);

/* Encode NUL-terminated string to UTF8, realloc as necessary */
extern char * tutf8e_string_encode_iso_8859_1_realloc(char *input);

/* Encode buffer to UTF8 */
extern int tutf8e_buffer_encode_iso_8859_1  (char *output, size_t *olen, const char *input, size_t ilen);

In reference to the message pack API, it looks like the buffer flavour fits best:

  static int msgpack_pack_str(msgpack_packer* pk, size_t l);
  static int msgpack_pack_str_body(msgpack_packer* pk, const void* b, size_t l);

@bluebike
Copy link
Contributor Author

@edsiper @nigels-com If (and when) this done without iconv, it's better to rename this to someghing else: flb_encoding?
Better make new pull request and close this one (so big change).

@nigels-com
Copy link
Contributor

@bluebike Ah, I realise now the purpose of tutf8e_find that you suggested in the API. I'll tackle that in my next iteration.

@nigels-com
Copy link
Contributor

@bluebike If you'd consider giving #1668 some real-world testing, I'd appreciate that. As good as iconv, or a full refund. (fingers crossed!)

@edsiper
Copy link
Member

edsiper commented Oct 28, 2019

@bluebike flb_encodimg should be fine.

should I merge @nigels-com library in a specific branch so you can adapt the interface?

@bluebike
Copy link
Contributor Author

I haven't have time to check @nigels-com PR/library #1668 .
I'll look that in next few days...
Sorry about delay of answer.

Make possible to have other encodings that UTF-8 in fluent-bit,
by creating object flb_iconv-library.

Enabled by  -D FLB_ICONV=Yes (default No)

Requires libiconv-library (can be embedded to libc)

Signed-off-by: Jukka Pihl <[email protected]>
With "from_encoding" parameter you can define
input charset for in_tail plugin.

Requires FLB_ICONV (flb_iconv)

Signed-off-by: Jukka Pihl <[email protected]>
Enable charset deocding with "from_encoding"/"encoding"
with syslog module.  Useful syslog is not using UTF-8
(example: latin1 / CP1252)

Requires: FLB_ICONV

Signed-off-by: Jukka Pihl <[email protected]>
@edsiper
Copy link
Member

edsiper commented May 5, 2020

what's the status of this PR?

@edsiper
Copy link
Member

edsiper commented May 5, 2020

since we have the tiny lib encoder in place, can we get rid of iconv ?

cc: @bluebike @nigels-com

@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label May 5, 2020
@edsiper
Copy link
Member

edsiper commented May 11, 2020

ping

@nigels-com
Copy link
Contributor

PR #1668 had fallen off my radar, my apologies for that.

@bluebike
Copy link
Contributor Author

bluebike commented May 15, 2020

pong!

Well..
One way is that I'll make generic flb_encoding module...
which could have many implementations (or none) and originally just flb_encoding_iconv.
That would just basically give interface for inputs (and outputs, filters) to interface encoding.

That would not be that big change to this.
Important is that to have just standard interface... and configuration.
Standard methods would like.

#define FLB_ENCODING_ACCEPT_NOT_CHANGED  0x01

#define FLB_ENCODING_SUCCESS   0
#define FLB_ENCODING_NO_CHANGED 1

struct  flb_encoding  flb_encoding_open(const char *from_encoding, const char *to_encoding);
struct  flb_encoding  flb_encoding_open_to_utf8(const char *from_encoding);
struct  flb_encoding  flb_encoding_open_from_utf8(const char *to_encoding);
int flb_encoding_execute(struct flb_encoding *encoding,
    const char *input, size_t inputlen,
    char **output, size_t  *outputlen,
    int flags);
void flb_encoding_close(struct flb_encoding *encoding);

In configuration.

FLB_ENCODING          have encoding main module
FLB_ENCODING_ICONV    have iconv 

of course modules could be plugins.. but maybe ovekill.

Later add other implementations like @nigels-com #1668 ???

@nigels-com
Copy link
Contributor

@bluebike Is iconv non-negotiable from your point of view?

Could you expand on the reasoning for that?

@bluebike
Copy link
Contributor Author

bluebike commented May 15, 2020

@nigels-com iconv not "mandatory" module itself, but need way to decode non UTF-8 charsets. I have been waiting your PR rather long time.

In unix world iconv(3) is just the standard/easy way to do encoding/decoding charsets.
8-bit charsets (Latin-X) are easy to do convert from, but other encodings could be error prone. In I think that iconv should do decent job there.

@nigels-com
Copy link
Contributor

@bluebike Which other encodings do you think are important for fluent bit to support?

@bluebike
Copy link
Contributor Author

bluebike commented May 15, 2020

@nigels-com I need just Latin-1 and cp-1252 (windows latin-1, used my mysql).
Other people can have other needs, maybe even non-8-bit encodings (CJK stuff?)
Important is that we get something useful working that is not hindrance in future.
I'm a little bit confused: what should I explain here?

@bluebike
Copy link
Contributor Author

@nigels-com @edsiper Ok ...
I'll look possibility of using #1703 as encoding (or decoding) library.

@bluebike
Copy link
Contributor Author

bluebike commented May 25, 2020

@nigels-com

Related to #1703 (tutf8e encoding library).
I think it needs (in my opinion) few updates to it at first.

  • Optional ways to handle invalid characters. Currently it just fails.
    Most cases other strategy is is better, strategies and length-functions should also calculate size of output based on this strategy:
    • KEEP: just utf8 encode original byte (maybe default)
    • ABORT: well... we just abort... no encoding (ignore whole message)
    • IGNORE: ignore/skip bad charateres.
    • REPLACEMENT: convert bad chars to 0xFFFD (unicode replacament character)
    • QUESTION: convert bad chars to '?'
  • Shortcut for unchanged source string. In our case (99%) and many other most messages would contain only valid ASCII, which converts directly to utf8. So having tutf8e_string_lengthcould return success results like TUTF8E_SAME=0, TUTF8E_CHANGED=1)..
  • I'll probably do new PR for flb_encoding

yes... I can make those changes myself.

@bluebike
Copy link
Contributor Author

PR #2420 replaces this.
Let's look iconv support as extension to that if needed.

@bluebike bluebike closed this Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants