Skip to content

unescape: Decode \u escaped characters for surrogate pairs correctly#9799

Merged
edsiper merged 5 commits intomasterfrom
cosmo0920-decode-slash-u-escaped-characters
Mar 29, 2025
Merged

unescape: Decode \u escaped characters for surrogate pairs correctly#9799
edsiper merged 5 commits intomasterfrom
cosmo0920-decode-slash-u-escaped-characters

Conversation

@cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Jan 6, 2025

Currently, we ignore surrogate pairs for \u escape on Unicode representation.
To handle this, we need to process with surrogate pairs manner.
Noe that this representation is also encoded \uXXXX representation on creating JSON.
On creating msgpack, this unescaping operation is effective.

Closes #9712.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
$ bin/fluent-bit -i stdin -o stdout

and send {"text": "\ud83e\udd17"} in the same terminal.

  • Debug log output from testing the change
Fluent Bit v4.0.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/01/06 18:17:45] [ info] Configuration:
[2025/01/06 18:17:45] [ info]  flush time     | 1.000000 seconds
[2025/01/06 18:17:45] [ info]  grace          | 5 seconds
[2025/01/06 18:17:45] [ info]  daemon         | 0
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  inputs:
[2025/01/06 18:17:45] [ info]      stdin
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  filters:
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  outputs:
[2025/01/06 18:17:45] [ info]      stdout.0
[2025/01/06 18:17:45] [ info] ___________
[2025/01/06 18:17:45] [ info]  collectors:
[2025/01/06 18:17:45] [ info] [fluent bit] version=4.0.0, commit=09214ebc7b, pid=74663
[2025/01/06 18:17:45] [debug] [engine] coroutine stack size: 36864 bytes (36.0K)
[2025/01/06 18:17:45] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/01/06 18:17:45] [ info] [simd    ] NEON
[2025/01/06 18:17:45] [ info] [cmetrics] version=0.9.9
[2025/01/06 18:17:45] [ info] [ctraces ] version=0.5.7
[2025/01/06 18:17:45] [ info] [input:stdin:stdin.0] initializing
[2025/01/06 18:17:45] [ info] [input:stdin:stdin.0] storage_strategy='memory' (memory only)
[2025/01/06 18:17:45] [debug] [stdin:stdin.0] created event channels: read=25 write=26
[2025/01/06 18:17:45] [debug] [input:stdin:stdin.0] buf_size=16000
[2025/01/06 18:17:45] [debug] [stdout:stdout.0] created event channels: read=28 write=29
[2025/01/06 18:17:45] [ info] [sp] stream processor started
[2025/01/06 18:17:45] [ info] [output:stdout:stdout.0] worker #0 started
{"text": "\ud83e\udd17"}
[2025/01/06 18:17:47] [debug] [task] created task=0x6000032ec000 id=0 OK
[2025/01/06 18:17:47] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1736155066.742995000, {}], {"text"=>"🤗"}]
[2025/01/06 18:17:47] [debug] [out flush] cb_destroy coro_id=0
[2025/01/06 18:17:47] [debug] [task] destroy task=0x6000032ec000 (task_id=0)
^C[2025/01/06 18:17:48] [engine] caught signal (SIGINT)
[2025/01/06 18:17:48] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2025/01/06 18:17:48] [ info] [output:stdout:stdout.0] thread worker #0 stopped
  • Attached Valgrind output that shows no leaks or memory corruption was found
==80122== 
==80122== HEAP SUMMARY:
==80122==     in use at exit: 0 bytes in 0 blocks
==80122==   total heap usage: 2,999 allocs, 2,999 frees, 1,341,761 bytes allocated
==80122== 
==80122== All heap blocks were freed -- no leaks are possible
==80122== 
==80122== For lists of detected and suppressed errors, rerun with: -s
==80122== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from 27a363e to b758796 Compare January 6, 2025 10:10
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from b758796 to dfe15aa Compare January 6, 2025 10:45
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from e02c52c to b4c023e Compare January 6, 2025 10:58
@vit-zikmund
Copy link
Contributor

vit-zikmund commented Jan 6, 2025

Thanks for following up this quick @cosmo0920!
I see you're struggling with the error cases there. As the unescaping function is expected to return the number of processed bytes, wouldn't it be better to stick to that and in the error cases set the ch being returned to the replacement character (ch = L'\uFFFD') I suggested in my issue comment footnote?

On the other hand, staying strict and rejecting that sequence is likely much better for the user, who won't suddenly find magic replacements in their data.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from b4c023e to daee871 Compare January 8, 2025 04:12
cosmo0920 and others added 4 commits January 14, 2025 14:43
@vit-zikmund 's suggestion is very helpful to get working for handling
surrogate pairs.

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Co-authored-by: Vit Zikmund <vit.zikmund@themama.ai>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from 8febcae to 991691e Compare January 14, 2025 06:17
@cosmo0920 cosmo0920 force-pushed the cosmo0920-decode-slash-u-escaped-characters branch from 991691e to 7930bcd Compare January 14, 2025 06:18
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920
Copy link
Contributor Author

Now, I'd succeeded to make a green result for OSS-Fuzz task. 💪

@jvystrcil-mama-ai
Copy link

@cosmo0920 run into this issue again today - is it planned to be reviewed merged soon?

@edsiper edsiper merged commit ab685d4 into master Mar 29, 2025
52 checks passed
@edsiper edsiper deleted the cosmo0920-decode-slash-u-escaped-characters branch March 29, 2025 16:56
@jvystrcil-mama-ai
Copy link

@cosmo0920 @edsiper @vit-zikmund thank you all for the fix 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect parsing of escaped characters from higher unicode planes in a JSON string

4 participants