Skip to content

Commit 4b8028d

Browse files
committed
add warning for encodings other than utf8 in unique and duplicated
1 parent 67129f0 commit 4b8028d

File tree

3 files changed

+13
-0
lines changed

3 files changed

+13
-0
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,8 @@
357357

358358
7. In rare situations a data.table object may lose its internal attribute that holds a self-reference. New helper function `.selfref.ok()` tests just that. It is only intended for technical use cases. See manual for examples.
359359

360+
8. `unique()` and `duplicated()` warn now if columns with encodings other than UTF-8 are present, since these are converted to UTF-8 for comparison, which may lead to unexpected results, [#469](https://github.com/Rdatatable/data.table/issues/469). Thanks to @arunsrinivasan for the request and @ben-schwen for the implementation.
361+
360362
## data.table [v1.17.8](https://github.com/Rdatatable/data.table/milestone/41) (6 July 2025)
361363

362364
1. Internal functions used to signal errors are now marked as non-returning, silencing a compiler warning about potentially unchecked allocation failure. Thanks to Prof. Brian D. Ripley for the report and @aitap for the fix, [#7070](https://github.com/Rdatatable/data.table/pull/7070).

R/duplicated.R

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ duplicated.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_
1313
if (fromLast) f = cumsum(uniqlengths(f, nrow(x)))
1414
} else {
1515
o = forderv(x, by=query$by, sort=FALSE, retGrp=TRUE)
16+
if (isTRUE(as.logical(attr(o, "anynotutf8", exact=TRUE))))
17+
warningf("Mixed encodings detected. Strings were coerced to UTF-8 before duplicated(x).")
1618
if (attr(o, 'maxgrpn', exact=TRUE) == 1L) return(rep.int(FALSE, nrow(x)))
1719
f = attr(o, "starts", exact=TRUE)
1820
if (fromLast) f = cumsum(uniqlengths(f, nrow(x)))
@@ -31,6 +33,9 @@ unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_alon
3133
if (nrow(x) <= 1L) return(copy(x)) # unique(x)[, col := val] should not alter x, #5932
3234
if (!length(by)) by = NULL #4594
3335
o = forderv(x, by=by, sort=FALSE, retGrp=TRUE)
36+
if (isTRUE(as.logical(attr(o, "anynotutf8", exact=TRUE)))) {
37+
warningf("Mixed encodings detected. Strings were coerced to UTF-8 before unique(x).")
38+
}
3439
if (!is.null(cols)) {
3540
x = .shallow(x, c(by, cols), retain.key=TRUE)
3641
}

inst/tests/tests.Rraw

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21688,3 +21688,9 @@ d3 = unserialize(serialize(d2, NULL))
2168821688
test(2340.05, .selfref.ok(d3), FALSE)
2168921689
setDT(d3)
2169021690
test(2340.06, .selfref.ok(d3), TRUE)
21691+
21692+
# warn about different encodings in unique and duplicated, #469
21693+
dt = data.table(x=c(iconv("\u00E9","UTF-8","latin1"), "\u00E9"))
21694+
test(2341.1, unique(dt), data.table(x="\u00E9"), warning="Mixed encodings.*")
21695+
test(2341.2, duplicated(dt), c(FALSE, TRUE), warning="Mixed encodings.*")
21696+
test(2341.3, unique(dt[c(2L,2L)]), data.table(x="\u00E9"))

0 commit comments

Comments
 (0)