-
Notifications
You must be signed in to change notification settings - Fork 707
Fix issue with partial UTF-8 string #6317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6317
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f7f6eaf with merge base 2c43190 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
7ef46d2 to
2d3307b
Compare
|
@larryliu0820 do we want to move it to runner? |
|
@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
| bool utf8_check_validity(const char* str, size_t length) { | ||
| for (size_t i = 0; i < length; ++i) { | ||
| uint8_t byte = static_cast<uint8_t>(str[i]); | ||
| if (byte >= 0x80) { // Non-ASCII byte | ||
| if (i + 1 >= length) { // Incomplete sequence | ||
| return false; | ||
| } | ||
| uint8_t next_byte = static_cast<uint8_t>(str[i + 1]); | ||
| if ((byte & 0xE0) == 0xC0 && | ||
| (next_byte & 0xC0) == 0x80) { // 2-byte sequence | ||
| i += 2; | ||
| } else if ( | ||
| (byte & 0xF0) == 0xE0 && (next_byte & 0xC0) == 0x80 && | ||
| (i + 2 < length) && | ||
| (static_cast<uint8_t>(str[i + 2]) & 0xC0) == | ||
| 0x80) { // 3-byte sequence | ||
| i += 3; | ||
| } else if ( | ||
| (byte & 0xF8) == 0xF0 && (next_byte & 0xC0) == 0x80 && | ||
| (i + 2 < length) && | ||
| (static_cast<uint8_t>(str[i + 2]) & 0xC0) == 0x80 && | ||
| (i + 3 < length) && | ||
| (static_cast<uint8_t>(str[i + 3]) & 0xC0) == | ||
| 0x80) { // 4-byte sequence | ||
| i += 4; | ||
| } else { | ||
| return false; // Invalid sequence | ||
| } | ||
| } | ||
| } | ||
| return true; // All bytes were valid | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this util be used by runner as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether we want to bring this to runner. It's only causing problem for Java though. For C++ runner, and iOS, seems that they use uint8_t as string, and they can print partial string, so won't cause this issue.
|
@kirklandsign merged this pull request in 6b2a082. |
It will cause JNI exception if we don't pass in UTF-8 string.
Alternative 1 (this): wait until we have complete UTF-8 tokens.
Alternative 2 (?): Fix this from runner layer
Alternative 3 (no): Change the API to use uint8_t array, but if we want to display on app in real time, this is still an issue.