-
Notifications
You must be signed in to change notification settings - Fork 1k
unique and duplicated warn on other encodings than UTF8 #7379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7379 +/- ##
=======================================
Coverage 99.12% 99.12%
=======================================
Files 85 85
Lines 16637 16640 +3
=======================================
+ Hits 16492 16495 +3
Misses 145 145 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Let's see revdeps. If none affected then I would keep it like this. |
|
This may bite fread() users who rely on the default value of the
encoding= argument:
``` r
fwrite(list(x = c('a møøse', 'bit my sister once')), 'foo.txt')
fread('foo.txt', sep='\t') |> setkey(x) |> unique() |> _$x |> str()
# chr [1:2] "a møøse" "bit my sister once"
# Warning message:
# In unique.data.table(setkey(fread("foo.txt", sep = "\t"), x)) :
# Mixed encodings detected. Strings were coerced to UTF-8 before
# unique(x).
```
A similar problem may happen to users of old versions of R or old
versions of Windows where non-ASCII string literals would result in
CE_NATIVE strings instead of CE_UTF8.
Should the warning recommend enc2utf8()?
Could someone please share the original context from R-Forge #5758?
|
Basically, it links to https://stackoverflow.com/questions/24085906/unique-data-table-do-not-handle-keys-properly |
|
Late warning about unexpected consequences is better than no warning, so still despite that I see value in it. |
|
Is lower performance meant to be the unexpected consequence? Otherwise
we seem to do the same thing as what R does in identical().
Given how most systems already speak UTF-8 as the native encoding,
changing the default fread(encoding=) argument to "UTF-8" should be
mostly harmless. Only people who read invalid UTF-8 or use a non-UTF-8
system will notice a difference.
|
|
Good points! I've added I have also filed this now as breaking change. |
|
The warning is not caused by mixed encodings. The previous example
sorts a vector of CE_NATIVE strings, which is actually safe to perform
without conversion to UTF-8, but the warning still appears. It would be
more precise to say "Detected strings not marked as ASCII or UTF-8 and
converted them to UTF-8 for detection of duplicates. Please convert
your character columns using enc2utf8() to avoid the on-the-fly
conversion."
Could you please explain why is the result unexpected when calling
duplicated() or unique() without converting strings to UTF-8 first?
After all, identical(enc2utf8('ø'), iconv('ø', to='latin1')) is TRUE,
even if those are different CHARSXPs under the hood.
|
|
No obvious timing issues in HEAD=warn_encodings Generated via commit 2f49a9a Download link for the artifact containing the test results: ↓ atime-results.zip
|
|
I guess what Jan meant by unexptected consequences is that |
|
Simply longer is not a problem, I thought that something that was giving T for == can now start to return F |

Closes #469
Not exactly what Arun suggested but seems like the best option since we encode to UTF8 in
forderv. Is a warning too much here?