Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,8 @@

7. In rare situations a data.table object may lose its internal attribute that holds a self-reference. New helper function `.selfref.ok()` tests just that. It is only intended for technical use cases. See manual for examples.

8. `unique()` and `duplicated()` warn now if columns with encodings other than UTF-8 are present, since these are converted to UTF-8 for comparison, which may lead to unexpected results, [#469](https://github.com/Rdatatable/data.table/issues/469). Thanks to @arunsrinivasan for the request and @ben-schwen for the implementation.

## data.table [v1.17.8](https://github.com/Rdatatable/data.table/milestone/41) (6 July 2025)

1. Internal functions used to signal errors are now marked as non-returning, silencing a compiler warning about potentially unchecked allocation failure. Thanks to Prof. Brian D. Ripley for the report and @aitap for the fix, [#7070](https://github.com/Rdatatable/data.table/pull/7070).
Expand Down
5 changes: 5 additions & 0 deletions R/duplicated.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ duplicated.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_
if (fromLast) f = cumsum(uniqlengths(f, nrow(x)))
} else {
o = forderv(x, by=query$by, sort=FALSE, retGrp=TRUE)
if (isTRUE(as.logical(attr(o, "anynotutf8", exact=TRUE))))
warningf("Mixed encodings detected. Strings were coerced to UTF-8 before duplicated(x).")
if (attr(o, 'maxgrpn', exact=TRUE) == 1L) return(rep.int(FALSE, nrow(x)))
f = attr(o, "starts", exact=TRUE)
if (fromLast) f = cumsum(uniqlengths(f, nrow(x)))
Expand All @@ -31,6 +33,9 @@ unique.data.table = function(x, incomparables=FALSE, fromLast=FALSE, by=seq_alon
if (nrow(x) <= 1L) return(copy(x)) # unique(x)[, col := val] should not alter x, #5932
if (!length(by)) by = NULL #4594
o = forderv(x, by=by, sort=FALSE, retGrp=TRUE)
if (isTRUE(as.logical(attr(o, "anynotutf8", exact=TRUE)))) {
warningf("Mixed encodings detected. Strings were coerced to UTF-8 before unique(x).")
}
if (!is.null(cols)) {
x = .shallow(x, c(by, cols), retain.key=TRUE)
}
Expand Down
6 changes: 6 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -21688,3 +21688,9 @@ d3 = unserialize(serialize(d2, NULL))
test(2340.05, .selfref.ok(d3), FALSE)
setDT(d3)
test(2340.06, .selfref.ok(d3), TRUE)

# warn about different encodings in unique and duplicated, #469
dt = data.table(x=c(iconv("\u00E9","UTF-8","latin1"), "\u00E9"))
test(2341.1, unique(dt), data.table(x="\u00E9"), warning="Mixed encodings.*")
test(2341.2, duplicated(dt), c(FALSE, TRUE), warning="Mixed encodings.*")
test(2341.3, unique(dt[c(2L,2L)]), data.table(x="\u00E9"))
Loading