fwrite: pre-encode strings and factor levels (#6889)

aitap · ben-schwen · web-flow · commit d76662c5a572 · 2025-04-02T08:11:57.000+02:00
* fwrite: pre-encode strings and factor levels Previously, fwrite() deferred encoding of the strings to fwriteR.c:getString,getCategString called from OpenMP threads. Calling translateChar[UTF8] to encode a string results in memory allocation unless it is already in the desired encoding, which is unsafe to perform on a non-main thread. Fixes: #6883 * tests: fix iconv() source encoding Since iconv() ignores the encoding bits, we must provide the correct from=... argument. The default from="" would only work with a UTF-8 locale. Instead, assume that "\uXX" strings are UTF-8-encoded. * Fix test Use test data that is more likely to reproduce the crash. Fix the number of threads, too. Co-Authored-By: Benjamin Schwendinger <benjaminschwe@gmail.com> * NEWS entry --------- Co-authored-by: Benjamin Schwendinger <benjaminschwe@gmail.com>
diff --git a/NEWS.md b/NEWS.md
@@ -20,6 +20,9 @@
 
 6. By-reference sub-assignments to factor columns now match the levels in UTF-8, preventing their duplication when the same level exists in different encodings, [#6886](https://github.com/Rdatatable/data.table/issues/6886). Thanks @iagogv3 for the report and @aitap for the fix.
 
+7. `fwrite()` now avoids a crash when translating strings into a different encoding, [#6883](https://github.com/Rdatatable/data.table/issues/6883). Thanks @filipemsc for the report and @aitap for the fix.
+
+
 ## NOTES
 
 1. Continued work to remove non-API C functions, [#6180](https://github.com/Rdatatable/data.table/issues/6180). Thanks Ivan Krylov for the PRs and for writing a clear and concise guide about the R API: https://aitap.codeberg.page/R-api/.
diff --git a/R/fwrite.R b/R/fwrite.R
@@ -111,6 +111,15 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
   }
   # nocov end
   file = enc2native(file) # CfwriteR cannot handle UTF-8 if that is not the native encoding, see #3078.
+  # pre-encode any strings or factor levels to avoid translateChar trying to allocate from OpenMP threads
+  if (encoding %chin% c("UTF-8", "native")) {
+    enc = switch(encoding, "UTF-8" = enc2utf8, "native" = enc2native)
+    x = lapply(x, function(x) {
+      if (is.character(x)) x = enc(x)
+      if (is.factor(x)) levels(x) = enc(levels(x))
+      x
+    })
+  }
   .Call(CfwriteR, x, file, sep, sep2, eol, na, dec, quote, qmethod=="escape", append,
         row.names, col.names, logical01, scipen, dateTimeAs, buffMB, nThread,
         showProgress, is_gzip, compressLevel, bom, yaml, verbose, encoding)
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -21121,3 +21121,7 @@ DT[1, V1 := samelevel]
 test(2311.1, nlevels(DT$V1), 1L) # used to be 2
 DT[1, V1 := factor("a", levels = c("a", samelevel))]
 test(2311.2, nlevels(DT$V1), 2L) # used to be 3
+
+# avoid translateChar*() in OpenMP threads, #6883
+DF = list(rep(iconv("\uf8", from = "UTF-8", to = "latin1"), 2e6))
+test(2312, fwrite(DF, nullfile(), encoding = "UTF-8", nThread = 2L), NULL)