Skip to content

Commit d76662c

Browse files
aitapben-schwen
andauthored
fwrite: pre-encode strings and factor levels (#6889)
* fwrite: pre-encode strings and factor levels Previously, fwrite() deferred encoding of the strings to fwriteR.c:getString,getCategString called from OpenMP threads. Calling translateChar[UTF8] to encode a string results in memory allocation unless it is already in the desired encoding, which is unsafe to perform on a non-main thread. Fixes: #6883 * tests: fix iconv() source encoding Since iconv() ignores the encoding bits, we must provide the correct from=... argument. The default from="" would only work with a UTF-8 locale. Instead, assume that "\uXX" strings are UTF-8-encoded. * Fix test Use test data that is more likely to reproduce the crash. Fix the number of threads, too. Co-Authored-By: Benjamin Schwendinger <[email protected]> * NEWS entry --------- Co-authored-by: Benjamin Schwendinger <[email protected]>
1 parent 6f49bf1 commit d76662c

File tree

3 files changed

+16
-0
lines changed

3 files changed

+16
-0
lines changed

NEWS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@
2020

2121
6. By-reference sub-assignments to factor columns now match the levels in UTF-8, preventing their duplication when the same level exists in different encodings, [#6886](https://github.com/Rdatatable/data.table/issues/6886). Thanks @iagogv3 for the report and @aitap for the fix.
2222

23+
7. `fwrite()` now avoids a crash when translating strings into a different encoding, [#6883](https://github.com/Rdatatable/data.table/issues/6883). Thanks @filipemsc for the report and @aitap for the fix.
24+
25+
2326
## NOTES
2427

2528
1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.

R/fwrite.R

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,15 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
111111
}
112112
# nocov end
113113
file = enc2native(file) # CfwriteR cannot handle UTF-8 if that is not the native encoding, see #3078.
114+
# pre-encode any strings or factor levels to avoid translateChar trying to allocate from OpenMP threads
115+
if (encoding %chin% c("UTF-8", "native")) {
116+
enc = switch(encoding, "UTF-8" = enc2utf8, "native" = enc2native)
117+
x = lapply(x, function(x) {
118+
if (is.character(x)) x = enc(x)
119+
if (is.factor(x)) levels(x) = enc(levels(x))
120+
x
121+
})
122+
}
114123
.Call(CfwriteR, x, file, sep, sep2, eol, na, dec, quote, qmethod=="escape", append,
115124
row.names, col.names, logical01, scipen, dateTimeAs, buffMB, nThread,
116125
showProgress, is_gzip, compressLevel, bom, yaml, verbose, encoding)

inst/tests/tests.Rraw

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21121,3 +21121,7 @@ DT[1, V1 := samelevel]
2112121121
test(2311.1, nlevels(DT$V1), 1L) # used to be 2
2112221122
DT[1, V1 := factor("a", levels = c("a", samelevel))]
2112321123
test(2311.2, nlevels(DT$V1), 2L) # used to be 3
21124+
21125+
# avoid translateChar*() in OpenMP threads, #6883
21126+
DF = list(rep(iconv("\uf8", from = "UTF-8", to = "latin1"), 2e6))
21127+
test(2312, fwrite(DF, nullfile(), encoding = "UTF-8", nThread = 2L), NULL)

0 commit comments

Comments
 (0)