Skip to content

Commit 6f49bf1

Browse files
memrecycle(): match factor levels in UTF-8 (#6890)
* memrecycle(): match factor levels in UTF-8 Previously, by-reference sub-assignment to a factor column could fail to match strings with identical content if they had different encoding bits (even CE_NATIVE UTF-8 vs. CE_UTF8), causing duplicate levels. Fixes: #6886 * Fix tests - Correct encoding of source string in non-UTF-8 locale - Use nlevels(.) instead of length(levels(.)) * NEWS entry * Expand test Test assignment from a string as well as from a different factor. Co-authored-by: Michael Chirico <[email protected]>
1 parent 476de7e commit 6f49bf1

File tree

3 files changed

+13
-0
lines changed

3 files changed

+13
-0
lines changed

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818

1919
5. `as.data.table()` on `x` avoids an infinite loop if the output of the corresponding `as.data.frame()` method has the same class as the input, [#6874](https://github.com/Rdatatable/data.table/issues/6874). Concretely, we had `class(x) = c('foo', 'data.frame')` and `class(as.data.frame(x)) = c('foo', 'data.frame')`, so `as.data.frame.foo` wound up getting called repeatedly. Thanks @matschmitz for the report and @ben-schwen for the fix.
2020

21+
6. By-reference sub-assignments to factor columns now match the levels in UTF-8, preventing their duplication when the same level exists in different encodings, [#6886](https://github.com/Rdatatable/data.table/issues/6886). Thanks @iagogv3 for the report and @aitap for the fix.
22+
2123
## NOTES
2224

2325
1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.

inst/tests/tests.Rraw

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21112,3 +21112,12 @@ test(2309.09, as.data.table(df, keep.rownames=TRUE), data.table(rn = c("a","b"),
2111221112
as.data.frame.no.reset = function(x) x
2111321113
DF = structure(list(a = 1:2), class = c("data.frame", "no.reset"), row.names = c(NA, -2L))
2111421114
test(2310.01, as.data.table(DF), data.table(a=1:2))
21115+
21116+
# memrecycle() did not consider string encodings for factor levels #6886
21117+
DT = data.table(factor(rep("\uf8", 3)))
21118+
# identical() to V1's only level but stored in a different CHARSXP
21119+
samelevel = iconv(levels(DT$V1), from = "UTF-8", to = "latin1")
21120+
DT[1, V1 := samelevel]
21121+
test(2311.1, nlevels(DT$V1), 1L) # used to be 2
21122+
DT[1, V1 := factor("a", levels = c("a", samelevel))]
21123+
test(2311.2, nlevels(DT$V1), 2L) # used to be 3

src/assign.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -823,8 +823,10 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
823823
SEXP targetLevels = PROTECT(getAttrib(target, R_LevelsSymbol)); protecti++;
824824
SEXP sourceLevels = source; // character source
825825
if (sourceIsFactor) { sourceLevels=PROTECT(getAttrib(source, R_LevelsSymbol)); protecti++; }
826+
sourceLevels = PROTECT(coerceUtf8IfNeeded(sourceLevels)); protecti++;
826827
if (!sourceIsFactor || !R_compute_identical(sourceLevels, targetLevels, 0)) { // !sourceIsFactor for test 2115.6
827828
const int nTargetLevels=length(targetLevels), nSourceLevels=length(sourceLevels);
829+
targetLevels = PROTECT(coerceUtf8IfNeeded(targetLevels)); protecti++;
828830
const SEXP *targetLevelsD=STRING_PTR_RO(targetLevels), *sourceLevelsD=STRING_PTR_RO(sourceLevels);
829831
SEXP newSource = PROTECT(allocVector(INTSXP, length(source))); protecti++;
830832
savetl_init();

0 commit comments

Comments
 (0)