Skip to content

Conversation

@kirklandsign
Copy link
Contributor

@kirklandsign kirklandsign commented Oct 17, 2024

It will cause JNI exception if we don't pass in UTF-8 string.

Alternative 1 (this): wait until we have complete UTF-8 tokens.
Alternative 2 (?): Fix this from runner layer
Alternative 3 (no): Change the API to use uint8_t array, but if we want to display on app in real time, this is still an issue.

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6317

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f7f6eaf with merge base 2c43190 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2024
@kirklandsign kirklandsign changed the title Debug UTF-16 issue Fix issue with partial UTF-8 string Oct 18, 2024
@kirklandsign
Copy link
Contributor Author

@larryliu0820 do we want to move it to runner?

@facebook-github-bot
Copy link
Contributor

@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +35 to +66
bool utf8_check_validity(const char* str, size_t length) {
for (size_t i = 0; i < length; ++i) {
uint8_t byte = static_cast<uint8_t>(str[i]);
if (byte >= 0x80) { // Non-ASCII byte
if (i + 1 >= length) { // Incomplete sequence
return false;
}
uint8_t next_byte = static_cast<uint8_t>(str[i + 1]);
if ((byte & 0xE0) == 0xC0 &&
(next_byte & 0xC0) == 0x80) { // 2-byte sequence
i += 2;
} else if (
(byte & 0xF0) == 0xE0 && (next_byte & 0xC0) == 0x80 &&
(i + 2 < length) &&
(static_cast<uint8_t>(str[i + 2]) & 0xC0) ==
0x80) { // 3-byte sequence
i += 3;
} else if (
(byte & 0xF8) == 0xF0 && (next_byte & 0xC0) == 0x80 &&
(i + 2 < length) &&
(static_cast<uint8_t>(str[i + 2]) & 0xC0) == 0x80 &&
(i + 3 < length) &&
(static_cast<uint8_t>(str[i + 3]) & 0xC0) ==
0x80) { // 4-byte sequence
i += 4;
} else {
return false; // Invalid sequence
}
}
}
return true; // All bytes were valid
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this util be used by runner as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether we want to bring this to runner. It's only causing problem for Java though. For C++ runner, and iOS, seems that they use uint8_t as string, and they can print partial string, so won't cause this issue.

@facebook-github-bot
Copy link
Contributor

@kirklandsign merged this pull request in 6b2a082.

@kirklandsign kirklandsign deleted the debug-u16 branch October 18, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants