utils: Add utf8_is_ valid function #80779

rruuaanng · 2024-11-02T11:12:11Z

Add 'utf8_check' function to check if a given string is utf8 encoded.

josuah

Thank you for this addition! It sounds useful to sometimes validate a string but not decode it, i.e. input validation.

Some suggestions were proposed above as feedback. Let me know if these suggestions make sense.

Maybe the solution is to completely decode the UTF-8 string, as then it would check for extra things:

And all other bizzare situations contained here:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

lib/utils/utf8.c

josuah · 2024-11-02T12:58:52Z

lib/utils/utf8.c

If we skip nbyte at once, this does not check if the UTF-8 string bytes skipped are valid (i.e. starts with 0b10xxxxxx).

It seems that it doesn't continue through the entire string. Instead, it returns immediately when the conditions are not met. When the conditions are met, it only performs a single match.

IMO this comment is still valid.

include/zephyr/sys/util.h

rruuaanng · 2024-11-02T13:34:37Z

Please ignore CI, which is related to function signature. I'll change the signature and function.

lib/utils/utf8.c

jukkar · 2024-11-02T16:32:25Z

lib/utils/utf8.c

This should be const as we are not modifying the string.

It will trigger a ci warning, that is STATIC_CONST.

Usually you should never cast away const pointer as that would indicate that you could modify the thing the pointer points to. Usually compiler should warn about this, and I recall some static checkers will.
So it is hard to understand why

const unsigned char *buf = (const unsigned char *)str;

would give a warning.

Unresolving this. #80779 (comment) claims it's answered here, but I disagree. Accessing str directly should be possible without the additional local buf and should be possible to use const.

Please provide additional information about the warning you see

I won't add const, jukkar has already tested it. This will make my PR fail CI.

@rruuaanng I am once again going to ask you not to resolve comments.

In the implementation you don't need the buf at all, you can just use str directly. Are you seeing issues with that approach?

printf("%d %d %d", 0xe4, (const char)'\xe4', (const unsigned char)'\xe4'); output: 228 -28 228

I need to reiterate that you should read the above review carefully :)

Edit
I have said that when I did not use a method and optimization it was for a reason, not because I did not discover it.

Can you share the log of the build warning/error you get? #80779 (comment) is just a screenshot of your editor, which doesn't really say much :)

The other utf8 functions works fine with const char *, so I'm curious to see what the issue here is

In fact, other implementations have problems. They do not check the extended bytes or the abnormal bytes. The above case has explained why I want to convert it to unsigned.

I won't add const, jukkar has already tested it. This will make my PR fail CI.

I have not said anything about testing or accepting this, please don't claim that.

What I have said is that you should not cast away constness of a variable. I don't get why you talk about static variable in this case.

include/zephyr/sys/util.h

lib/utils/utf8.c

rruuaanng · 2024-11-06T11:58:43Z

I have pushed the original changes. If CI has any related issues, we can check it.

rettichschnidi · 2024-11-06T14:48:33Z

lib/utils/utf8.c

This else is redundant, please move code to the left:

Suggested change

} else {

if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE &&

str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {

nbyte = 2;

} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE &&

str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {

nbyte = 3;

} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE &&

str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {

nbyte = 4;

} else {

return false;

}

}

}

if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {

nbyte = 2;

} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {

nbyte = 3;

} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {

nbyte = 4;

} else {

return false;

}

It was over 100 columns and I had to wrap. So, I need to resolve this comment.

Last time: Do not resolve other peoples comments!

I do not follow your logic. Moving code to the left makes the 100 column "issue" less, not worse. Please explain.

I changed it to

str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE)

and it made CI throw out that the code exceeded 100 lines.

Edit

If you want me to give you an explanation, I will change it and you can look at the warnings CI throws. And it's just a look, it's not important. We should focus on the functionality of the decoder, not other.

If you want me to give you an explanation, I will change it and you can look at the warnings CI throws.

Please post a link to where the CI is failing and I'll have a look.

And it's just a look, it's not important. We should focus on the functionality of the decoder, not other.

One thing does not rule out the other. We can have both.

Okay, I will revise it today. Please look forward to it ;)

rruuaanng · 2024-11-08T00:59:12Z

CI thrown

zephyr/CMakeFiles/zephyr.dir/lib/utils/utf8.c.obj -c /__w/zephyr/zephyr/lib/utils/utf8.c
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: error: call to undeclared function 'strnlen'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        size_t len = strnlen(str, maxlen);
                     ^
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: note: did you mean 'strlen'?
/usr/include/string.h:407:15: note: 'strlen' declared here
extern size_t strlen (const char *__s)
              ^
1 error generated.
ninja: build stopped: subcommand failed.

Thalley · 2024-11-08T08:43:14Z

CI thrown

zephyr/CMakeFiles/zephyr.dir/lib/utils/utf8.c.obj -c /__w/zephyr/zephyr/lib/utils/utf8.c
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: error: call to undeclared function 'strnlen'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        size_t len = strnlen(str, maxlen);
                     ^
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: note: did you mean 'strlen'?
/usr/include/string.h:407:15: note: 'strlen' declared here
extern size_t strlen (const char *__s)
              ^
1 error generated.
ninja: build stopped: subcommand failed.

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

aescolar · 2024-11-08T09:03:04Z

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included.
You can just add

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

at the beginning of the file that uses strnlen (before any header inclusion).

Thalley · 2024-11-08T09:29:07Z

lib/utils/utf8.c

Given the challenges with strnlen, could we just do the following? Or would that be considered insecure?

Suggested change

size_t len = strnlen(str, maxlen);

size_t len = MIN(strlen(str), maxlen);

I would consider that insecure. Why not do the check when looping?

Why not do the check when looping?

Can you elaborate?

while (i < maxlen) { if (str[i] == '\0') { break; } if (str[i] <= ASCII_CHAR && str[i] > '\0') { i++; continue; } // ... }

I'm sorry to tell you, that empty characters(0x00) also belong to utf8 characters. If you do this, the check will be cut off.

The check would already be cut off, as strlen is simply looking for a \0 byte, no?

UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte.

From https://www.oreilly.com/library/view/secure-programming-cookbook/0596003943/ch03s12.html#:~:text=Discussion,of%20an%20actual%20NULL%20byte.

Agreed.

[0x30, 0x31, 0x00, 0x30, 0xff] should be considered a valid UTF8 string as only the first 2 characters (0x30, 0x31) and the NULL terminator should be considered.

For example

char str[100]; str[0] = 0x30; /* or '0' */ str[1] = 0x31; /* or '1' */ str[2] = 0x00; /* or '\0' */

is a valid UTF8 string, even if octets [3..99] may be invalid, since the string stops at the NULL terminator.

I like the suggestion from @pdgendt as it also optimizes the check by avoiding the call to strlen.

Yes. So we don't need to check the empty character into looping.

lib/utils/utf8.c

tests/unit/util/main.c

include/zephyr/sys/util.h

Thalley · 2024-11-08T09:49:37Z

tests/unit/util/main.c

We should have tests where maxlen is larger and smaller than the provided string as well

Yes, but we only need to add one. That is to test its behavior.

I believe we should have all 3 cases:

maxlen smaller than strlen(str)

maxlen equal to strlen(str) (covered already)

maxlen larger than strlen(str)

Otherwise you aren't covering all checks in the code, right?

If we use strnlen, I don't think we need to test when maxlen is greater than len (and when maxlen less than len).

Edit
By the way, I would like to reply here about the readability of the test strings. Most of the UTF8 string in the test do not have a readable representation. They are only used to test the decoder's recognition of boundary values (but those with readable characters, I've changed them)

rruuaanng · 2024-11-11T12:52:47Z

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included. You can just add
#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L
at the beginning of the file that uses strnlen (before any header inclusion).

I added

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

before #include <string.h>. Hope it works.

lib/utils/utf8.c

rruuaanng · 2024-11-11T13:01:04Z

If it passes CI, I hope you guys can approve it (it’s just a small feature, but it received so many reviews, which surprised me.)

lib/utils/utf8.c

Thalley · 2024-11-11T14:51:29Z

Removing myself as reviewer. I trust the other reviewers to do remaining reviews, but I can't spend more time on this when the author keeps resolving my comments without proper resolutions, but looks like we are nearly there

pdgendt · 2024-11-11T15:59:55Z

lib/utils/utf8.c

I'm sorry, but I fail to see the need to have strnlen or strlen here. Worst case scenario is that we loop the entire string twice, without any benefit. If the first two characters aren't valid utf8, we could already stop after 2 steps.

Good proposal. Maybe I can use len as a parameter. I won't deal with the exception caused by the wrong len. WDYT?

Just use the maxlen argument directly in the loop? See my comment.

I will modify it later and change i < len to i < maxlen.

Edit

And remove strnlen ;)

If the first two characters aren't valid utf8, we could already stop after 2 steps.

if (str[i] <= ASCII_CHAR && str[i] >= '\0') { i++; continue; } else { if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) { nbyte = 2; } else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && str[i] >= SEQUENCE_MIN_LEN_3_BYTE) { nbyte = 3; } else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) { nbyte = 4; } else { return false; } }

I almost forgot, my implementation includes this function, it is in the recognition. It will exit the function when it matches non-UTF8. The same is true for the first character.

And, I have another question. You mentioned that you can put maxlen directly into the loop instead of using strnlen, which seems to be no different from directly using len as a function parameter. In fact, it can be len directly instead of maxlen. Right?
@pdgendt

rruuaanng · 2024-11-15T12:48:38Z

This PR has been reviewed by multiple people. However, no perfect and specific solution has been discussed, and they are usually scattered. This makes it inconvenient for me to implement them, and I hope we can summarize and list them (do not accept requests for changes in code style, it is useless). I think if there are no exceptions in terms of functionality, it has met the requirements for merging, which can make it available to users as soon as possible. And find bugs in practice. For minimal performance optimization, I don't think we need it, and this can be distributed by others after release.

rruuaanng · 2024-11-15T12:52:58Z

I list the following points that I learned from the discussion:

Test items should be annotated to explain their purpose
maxlen is not necessary, you can just provide the len parameter and let the user pass the string length. If the length is abnormal, the user will be responsible for any abnormal behavior.

Edit

And, thank you for your review and time! My changes may not be suitable for everyone, because everyone has a different perspective on this matter. In fact, we can't force uniformity, but we can summarize a solution that everyone can agree on.

lib/utils/utf8.c

rruuaanng · 2024-11-15T14:34:30Z

@pdgendt I've changes them, Please review again! :)

andyross

Some nitpickery, but one design issue I think needs to be addressed: this code doesn't inspect the actual code point itself, so what does "valid" mean here?

Specifically, this will return "true" for the sequence { 0xf0, 0x80, 0x80, 0xc1 }, which is a "correct" but non-canonical and very surprising encoding of the ASCII character "A".

Most serious applications view this as bad, as it very often escapes string validation and escaping code and causes security bugs. See the relevant section in the Wikipedia page: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

Now, I'm not personally a security nut and won't take a strong position on the way this is "supposed" to work. But absolutely if we don't do the "Standard Security Best Practices" thing we should call that fact out very loudly in the docs for the function.

-1 just to make sure someone adds the docs. Feel free to override and merge if you can't find me to remove it.

andyross · 2024-11-15T17:01:48Z

lib/utils/utf8.c

Style: "while (str[i] != '\0') { is IMHO clearer and four lines shorter.

Edit: while (i < len && str[i] != '\0') { that is. Which brings up another nitpick: this will return true if the null-terminated prefix of the passed string is valid utf-8, it doesn't necessarily validate the whole string.

Is a buffer with trailing garbage "valid" or not? Documentation should specify.

Very good, I will change it. But I can't find where it is documented, and I'm not that good at writing documentation, if possible, can you help me write documentation for it?

Doxygen comments go above the function declaration in the relevant header. You can copy the format of existing ones (see include/zephyr/kernel.h for lots and lots and lots of examples).

lib/utils/utf8.c

andyross · 2024-11-15T17:09:49Z

lib/utils/utf8.c

Is there a reason for using comparisons here? UTF-8 sequences are canonically described as being selected by bitmasks (e.g. "0b1110xxxx" indicates a three byte sequence), and optimized code dealing with it usually exploits things like specialized instructions to count the leading one bits. Doing it by mapping those to range comparisons isn't wrong, it just looks weird to my eyes and seems likely to be needlessly large.

FWIW, my immediate thought here would be to try something like (completely untested!):

nbytes = MIN(4, __builtin_clz(~str[i]));

The question is very high-level. The reason why I don't use the mask is that I hope it can be more intuitive in implementation, just like judging whether a character is ascii.

Like this

c <= 0xff && c >= 0

I dunno, we're an RTOS here. I think "intuitive" needs necessarily to take a back seat to code size where there is a significant delta (the expression above[1] should be 4-5 instructions long on most architectures).

[1] If it actually works as written, which I haven't validated!

I like what Andy is proposing here. It if feels unintuitive, we could always add a comment describing what the following line is doing.

I will revise it soon, thanks for your review!

Add 'utf8_is_valid' function to check if a given string is utf8 encoded. Signed-off-by: James Roy <[email protected]>

github-actions · 2025-01-18T00:31:31Z

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

rruuaanng force-pushed the utf8-check branch 3 times, most recently from 285b4b0 to ed28e46 Compare November 2, 2024 11:41

rruuaanng added the area: Utilities label Nov 2, 2024

rruuaanng marked this pull request as ready for review November 2, 2024 12:15

rruuaanng requested a review from Thalley November 2, 2024 12:16

zephyrbot added area: Testsuite Testsuite area: Base OS Base OS Library (lib/os) labels Nov 2, 2024

zephyrbot requested review from aaronemassey, andyross, asemjonovs, dcpleung, jeremybettis, nashif, peter-mitsis and yperess November 2, 2024 12:16

zephyrbot assigned andyross and nashif Nov 2, 2024

rruuaanng force-pushed the utf8-check branch from ed28e46 to 020bf9c Compare November 2, 2024 12:31

josuah requested changes Nov 2, 2024

View reviewed changes

rruuaanng force-pushed the utf8-check branch 4 times, most recently from 30419f6 to d68c1b4 Compare November 2, 2024 15:38

jukkar reviewed Nov 2, 2024

View reviewed changes

rettichschnidi requested changes Nov 2, 2024

View reviewed changes

include/zephyr/sys/util.h Outdated Show resolved Hide resolved

rruuaanng force-pushed the utf8-check branch 2 times, most recently from fbec8b3 to bf92c6a Compare November 3, 2024 02:22

josuah reviewed Nov 3, 2024

View reviewed changes