Skip to content

Conversation

@rruuaanng
Copy link
Contributor

Add 'utf8_check' function to check if a given string is utf8 encoded.

@rruuaanng rruuaanng force-pushed the utf8-check branch 3 times, most recently from 285b4b0 to ed28e46 Compare November 2, 2024 11:41
@rruuaanng rruuaanng marked this pull request as ready for review November 2, 2024 12:15
@rruuaanng rruuaanng requested a review from Thalley November 2, 2024 12:16
@zephyrbot zephyrbot added area: Testsuite Testsuite area: Base OS Base OS Library (lib/os) labels Nov 2, 2024
Copy link
Contributor

@josuah josuah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this addition! It sounds useful to sometimes validate a string but not decode it, i.e. input validation.

Some suggestions were proposed above as feedback. Let me know if these suggestions make sense.

Maybe the solution is to completely decode the UTF-8 string, as then it would check for extra things:

And all other bizzare situations contained here:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

lib/utils/utf8.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip nbyte at once, this does not check if the UTF-8 string bytes skipped are valid (i.e. starts with 0b10xxxxxx).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it doesn't continue through the entire string. Instead, it returns immediately when the conditions are not met. When the conditions are met, it only performs a single match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this comment is still valid.

@rruuaanng
Copy link
Contributor Author

Please ignore CI, which is related to function signature. I'll change the signature and function.

@rruuaanng rruuaanng force-pushed the utf8-check branch 4 times, most recently from 30419f6 to d68c1b4 Compare November 2, 2024 15:38
lib/utils/utf8.c Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be const as we are not modifying the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will trigger a ci warning, that is STATIC_CONST.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually you should never cast away const pointer as that would indicate that you could modify the thing the pointer points to. Usually compiler should warn about this, and I recall some static checkers will.
So it is hard to understand why

const unsigned char *buf = (const unsigned char *)str;

would give a warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this. #80779 (comment) claims it's answered here, but I disagree. Accessing str directly should be possible without the additional local buf and should be possible to use const.

Please provide additional information about the warning you see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't add const, jukkar has already tested it. This will make my PR fail CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rruuaanng I am once again going to ask you not to resolve comments.

In the implementation you don't need the buf at all, you can just use str directly. Are you seeing issues with that approach?

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printf("%d %d %d", 0xe4, (const char)'\xe4', (const unsigned char)'\xe4');

output:
228 -28 228

I need to reiterate that you should read the above review carefully :)

Edit
I have said that when I did not use a method and optimization it was for a reason, not because I did not discover it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the log of the build warning/error you get? #80779 (comment) is just a screenshot of your editor, which doesn't really say much :)

The other utf8 functions works fine with const char *, so I'm curious to see what the issue here is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, other implementations have problems. They do not check the extended bytes or the abnormal bytes. The above case has explained why I want to convert it to unsigned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't add const, jukkar has already tested it. This will make my PR fail CI.

I have not said anything about testing or accepting this, please don't claim that.

What I have said is that you should not cast away constness of a variable. I don't get why you talk about static variable in this case.

@rruuaanng rruuaanng force-pushed the utf8-check branch 2 times, most recently from fbec8b3 to bf92c6a Compare November 3, 2024 02:22
@rruuaanng
Copy link
Contributor Author

I have pushed the original changes. If CI has any related issues, we can check it.

lib/utils/utf8.c Outdated
Comment on lines 97 to 114
Copy link
Contributor

@rettichschnidi rettichschnidi Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else is redundant, please move code to the left:

Suggested change
} else {
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
nbyte = 2;
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
nbyte = 3;
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE &&
str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
nbyte = 4;
} else {
return false;
}
}
}
if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
nbyte = 2;
} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE && str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
nbyte = 3;
} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
nbyte = 4;
} else {
return false;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was over 100 columns and I had to wrap. So, I need to resolve this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time: Do not resolve other peoples comments!

I do not follow your logic. Moving code to the left makes the 100 column "issue" less, not worse. Please explain.

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to

str[i] <= SEQUENCE_MAX_LEN_4_BYTE && str[i] >= SEQUENCE_MIN_LEN_4_BYTE) 

and it made CI throw out that the code exceeded 100 lines.

Edit

If you want me to give you an explanation, I will change it and you can look at the warnings CI throws. And it's just a look, it's not important. We should focus on the functionality of the decoder, not other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want me to give you an explanation, I will change it and you can look at the warnings CI throws.

Please post a link to where the CI is failing and I'll have a look.

And it's just a look, it's not important. We should focus on the functionality of the decoder, not other.

One thing does not rule out the other. We can have both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will revise it today. Please look forward to it ;)

@rruuaanng
Copy link
Contributor Author

CI thrown

zephyr/CMakeFiles/zephyr.dir/lib/utils/utf8.c.obj -c /__w/zephyr/zephyr/lib/utils/utf8.c
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: error: call to undeclared function 'strnlen'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        size_t len = strnlen(str, maxlen);
                     ^
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: note: did you mean 'strlen'?
/usr/include/string.h:407:15: note: 'strlen' declared here
extern size_t strlen (const char *__s)
              ^
1 error generated.
ninja: build stopped: subcommand failed.

@Thalley
Copy link
Contributor

Thalley commented Nov 8, 2024

CI thrown

zephyr/CMakeFiles/zephyr.dir/lib/utils/utf8.c.obj -c /__w/zephyr/zephyr/lib/utils/utf8.c
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: error: call to undeclared function 'strnlen'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        size_t len = strnlen(str, maxlen);
                     ^
/__w/zephyr/zephyr/lib/utils/utf8.c:91:15: note: did you mean 'strlen'?
/usr/include/string.h:407:15: note: 'strlen' declared here
extern size_t strlen (const char *__s)
              ^
1 error generated.
ninja: build stopped: subcommand failed.

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

@aescolar
Copy link
Member

aescolar commented Nov 8, 2024

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included.
You can just add

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

at the beginning of the file that uses strnlen (before any header inclusion).

lib/utils/utf8.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the challenges with strnlen, could we just do the following? Or would that be considered insecure?

Suggested change
size_t len = strnlen(str, maxlen);
size_t len = MIN(strlen(str), maxlen);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider that insecure. Why not do the check when looping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the check when looping?

Can you elaborate?

Copy link
Contributor

@pdgendt pdgendt Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	while (i < maxlen) {
		if (str[i] == '\0') {
			break;
		}

		if (str[i] <= ASCII_CHAR && str[i] > '\0') {
			i++;
			continue;
		}

		// ...
	}

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to tell you, that empty characters(0x00) also belong to utf8 characters. If you do this, the check will be cut off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check would already be cut off, as strlen is simply looking for a \0 byte, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte.

From https://www.oreilly.com/library/view/secure-programming-cookbook/0596003943/ch03s12.html#:~:text=Discussion,of%20an%20actual%20NULL%20byte.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

[0x30, 0x31, 0x00, 0x30, 0xff] should be considered a valid UTF8 string as only the first 2 characters (0x30, 0x31) and the NULL terminator should be considered.

For example

char str[100];
str[0] = 0x30; /* or '0' */
str[1] = 0x31; /* or '1' */ 
str[2] = 0x00; /* or '\0' */

is a valid UTF8 string, even if octets [3..99] may be invalid, since the string stops at the NULL terminator.

I like the suggestion from @pdgendt as it also optimizes the check by avoiding the call to strlen.

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So we don't need to check the empty character into looping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have tests where maxlen is larger and smaller than the provided string as well

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we only need to add one. That is to test its behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should have all 3 cases:

  1. maxlen smaller than strlen(str)
  2. maxlen equal to strlen(str) (covered already)
  3. maxlen larger than strlen(str)

Otherwise you aren't covering all checks in the code, right?

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use strnlen, I don't think we need to test when maxlen is greater than len (and when maxlen less than len).

Edit
By the way, I would like to reply here about the readability of the test strings. Most of the UTF8 string in the test do not have a readable representation. They are only used to test the decoder's recognition of boundary values ​​(but those with readable characters, I've changed them)

@rruuaanng
Copy link
Contributor Author

Interesting. It would appear that the libc implementation used for that test does not implement strnlen. The platform seems to be the native_sim. @aescolar can we not use strnlen on native_sim?

Yes, you can certainly use it. But as it is an extension to the standard C library you need to "ask" for its prototype to be included. You can just add

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

at the beginning of the file that uses strnlen (before any header inclusion).

I added

#undef _POSIX_C_SOURCE
#define _POSIX_C_SOURCE 200809L

before #include <string.h>. Hope it works.

@rruuaanng
Copy link
Contributor Author

If it passes CI, I hope you guys can approve it (it’s just a small feature, but it received so many reviews, which surprised me.)

@Thalley Thalley self-requested a review November 11, 2024 14:49
@Thalley
Copy link
Contributor

Thalley commented Nov 11, 2024

Removing myself as reviewer. I trust the other reviewers to do remaining reviews, but I can't spend more time on this when the author keeps resolving my comments without proper resolutions, but looks like we are nearly there

lib/utils/utf8.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I fail to see the need to have strnlen or strlen here. Worst case scenario is that we loop the entire string twice, without any benefit. If the first two characters aren't valid utf8, we could already stop after 2 steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good proposal. Maybe I can use len as a parameter. I won't deal with the exception caused by the wrong len. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use the maxlen argument directly in the loop? See my comment.

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify it later and change i < len to i < maxlen.

Edit

And remove strnlen ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first two characters aren't valid utf8, we could already stop after 2 steps.

if (str[i] <= ASCII_CHAR && str[i] >= '\0') {
			i++;
			continue;
		} else {
			if (str[i] <= SEQUENCE_MAX_LEN_2_BYTE
			 && str[i] >= SEQUENCE_MIN_LEN_2_BYTE) {
				nbyte = 2;
			} else if (str[i] <= SEQUENCE_MAX_LEN_3_BYTE
					&& str[i] >= SEQUENCE_MIN_LEN_3_BYTE) {
				nbyte = 3;
			} else if (str[i] <= SEQUENCE_MAX_LEN_4_BYTE
					&& str[i] >= SEQUENCE_MIN_LEN_4_BYTE) {
				nbyte = 4;
			} else {
				return false;
			}
		}

I almost forgot, my implementation includes this function, it is in the recognition. It will exit the function when it matches non-UTF8. The same is true for the first character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, I have another question. You mentioned that you can put maxlen directly into the loop instead of using strnlen, which seems to be no different from directly using len as a function parameter. In fact, it can be len directly instead of maxlen. Right?
@pdgendt

@rruuaanng
Copy link
Contributor Author

This PR has been reviewed by multiple people. However, no perfect and specific solution has been discussed, and they are usually scattered. This makes it inconvenient for me to implement them, and I hope we can summarize and list them (do not accept requests for changes in code style, it is useless). I think if there are no exceptions in terms of functionality, it has met the requirements for merging, which can make it available to users as soon as possible. And find bugs in practice. For minimal performance optimization, I don't think we need it, and this can be distributed by others after release.

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Nov 15, 2024

I list the following points that I learned from the discussion:

  1. Test items should be annotated to explain their purpose
  2. maxlen is not necessary, you can just provide the len parameter and let the user pass the string length. If the length is abnormal, the user will be responsible for any abnormal behavior.

Edit

And, thank you for your review and time! My changes may not be suitable for everyone, because everyone has a different perspective on this matter. In fact, we can't force uniformity, but we can summarize a solution that everyone can agree on.

@rruuaanng
Copy link
Contributor Author

@pdgendt I've changes them, Please review again! :)

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpickery, but one design issue I think needs to be addressed: this code doesn't inspect the actual code point itself, so what does "valid" mean here?

Specifically, this will return "true" for the sequence { 0xf0, 0x80, 0x80, 0xc1 }, which is a "correct" but non-canonical and very surprising encoding of the ASCII character "A".

Most serious applications view this as bad, as it very often escapes string validation and escaping code and causes security bugs. See the relevant section in the Wikipedia page: https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

Now, I'm not personally a security nut and won't take a strong position on the way this is "supposed" to work. But absolutely if we don't do the "Standard Security Best Practices" thing we should call that fact out very loudly in the docs for the function.

-1 just to make sure someone adds the docs. Feel free to override and merge if you can't find me to remove it.

lib/utils/utf8.c Outdated
Copy link
Contributor

@andyross andyross Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: "while (str[i] != '\0') { is IMHO clearer and four lines shorter.

Edit: while (i < len && str[i] != '\0') { that is. Which brings up another nitpick: this will return true if the null-terminated prefix of the passed string is valid utf-8, it doesn't necessarily validate the whole string.

Is a buffer with trailing garbage "valid" or not? Documentation should specify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, I will change it. But I can't find where it is documented, and I'm not that good at writing documentation, if possible, can you help me write documentation for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen comments go above the function declaration in the relevant header. You can copy the format of existing ones (see include/zephyr/kernel.h for lots and lots and lots of examples).

lib/utils/utf8.c Outdated
Copy link
Contributor

@andyross andyross Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using comparisons here? UTF-8 sequences are canonically described as being selected by bitmasks (e.g. "0b1110xxxx" indicates a three byte sequence), and optimized code dealing with it usually exploits things like specialized instructions to count the leading one bits. Doing it by mapping those to range comparisons isn't wrong, it just looks weird to my eyes and seems likely to be needlessly large.

FWIW, my immediate thought here would be to try something like (completely untested!):

nbytes = MIN(4, __builtin_clz(~str[i]));

Copy link
Contributor Author

@rruuaanng rruuaanng Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is very high-level. The reason why I don't use the mask is that I hope it can be more intuitive in implementation, just like judging whether a character is ascii.

Like this

c <= 0xff && c >= 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno, we're an RTOS here. I think "intuitive" needs necessarily to take a back seat to code size where there is a significant delta (the expression above[1] should be 4-5 instructions long on most architectures).

[1] If it actually works as written, which I haven't validated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what Andy is proposing here. It if feels unintuitive, we could always add a comment describing what the following line is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revise it soon, thanks for your review!

Add 'utf8_is_valid' function to check if a given string is utf8 encoded.

Signed-off-by: James Roy <[email protected]>
@github-actions
Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Jan 18, 2025
@github-actions github-actions bot closed this Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants