feat: Throw exception when non-utf8 characters are in a data.frame#16
feat: Throw exception when non-utf8 characters are in a data.frame#16Tmonster wants to merge 24 commits intoduckdb:mainfrom
Conversation
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
b891555 to
253e2d7
Compare
253e2d7 to
0a48fd4
Compare
|
Thanks. Are there any other locations where we need to care about this? Thinking mainly about parameter binding via |
0a48fd4 to
2286bda
Compare
|
Coming back to this. What are the costs of this preemptive check? I assume that the code must read the entire string? What are the consequences of avoiding this check? Example for constructing a broken string: bork <- iconv("börk", to = "latin1")
Encoding(bork) <- "UTF-8"
bork
#> [1] "b\xf6rk"Created on 2024-03-11 with reprex v2.1.0 |
|
There's a faster way: x <- "Est\xe2ncia"
Encoding(x)
#> [1] "unknown"
x <- "Estância"
Encoding(x)
#> [1] "UTF-8"
x <- enc2utf8("Estância")
Encoding(x)
#> [1] "UTF-8"
Encoding(iconv("Est\xe2ncia", from = "latin1", to = "UTF-8"))
#> [1] "UTF-8"Created on 2024-03-21 with reprex v2.1.0 This shows different results when ran line by line in the RStudio IDE: the second example also yields "unknown". I propose to do the fast check first, and only use utf8proc if that returns "unknown". |
|
I agree, I think the fast check first is better. Otherwise we have to copy the whole string. Will work on this now |
|
I actually wonder now if the issue is in the function > enc2utf8("Est\xe2ncia")
# [1] "Est\xe2ncia"
> iconv(enc2utf8("Est\xe2ncia"), from="latin1", to="UTF-8")
# [1] "Estância"
> iconv(enc2utf8("Est\xe2ncia"), from="UTF-8", to="UTF-8")
# [1] NA
> iconv(enc2utf8("hello"), from="UTF-8", to="UTF-8")
# [1] "hello"
> iconv(enc2utf8("hello"), from="latin1", to="UTF-8")
# [1] "hello"Now I just check that Encoding(x) is valid for every value of a varcher column. This seems like a lot of overhead though, so open to other ideas. EDIT: changes for readability |
The duckdb engine assumes all strings are valid utf-8. In the R-client, we forgot to check if strings were in fact utf8. Here we check them when the scanning the df when we register it.
Closes #12.