-
Couldn't load subscription status.
- Fork 46
feat: Throw exception when non-utf8 characters are in a data.frame #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
b891555 to
253e2d7
Compare
253e2d7 to
0a48fd4
Compare
|
Thanks. Are there any other locations where we need to care about this? Thinking mainly about parameter binding via |
0a48fd4 to
2286bda
Compare
|
Coming back to this. What are the costs of this preemptive check? I assume that the code must read the entire string? What are the consequences of avoiding this check? Example for constructing a broken string: bork <- iconv("börk", to = "latin1")
Encoding(bork) <- "UTF-8"
bork
#> [1] "b\xf6rk"Created on 2024-03-11 with reprex v2.1.0 |
|
There's a faster way: x <- "Est\xe2ncia"
Encoding(x)
#> [1] "unknown"
x <- "Estância"
Encoding(x)
#> [1] "UTF-8"
x <- enc2utf8("Estância")
Encoding(x)
#> [1] "UTF-8"
Encoding(iconv("Est\xe2ncia", from = "latin1", to = "UTF-8"))
#> [1] "UTF-8"Created on 2024-03-21 with reprex v2.1.0 This shows different results when ran line by line in the RStudio IDE: the second example also yields "unknown". I propose to do the fast check first, and only use utf8proc if that returns "unknown". |
|
I agree, I think the fast check first is better. Otherwise we have to copy the whole string. Will work on this now |
|
I actually wonder now if the issue is in the function > enc2utf8("Est\xe2ncia")
# [1] "Est\xe2ncia"
> iconv(enc2utf8("Est\xe2ncia"), from="latin1", to="UTF-8")
# [1] "Estância"
> iconv(enc2utf8("Est\xe2ncia"), from="UTF-8", to="UTF-8")
# [1] NA
> iconv(enc2utf8("hello"), from="UTF-8", to="UTF-8")
# [1] "hello"
> iconv(enc2utf8("hello"), from="latin1", to="UTF-8")
# [1] "hello"Now I just check that Encoding(x) is valid for every value of a varcher column. This seems like a lot of overhead though, so open to other ideas. EDIT: changes for readability |
The duckdb engine assumes all strings are valid utf-8. In the R-client, we forgot to check if strings were in fact utf8. Here we check them when the scanning the df when we register it.
Closes #12.