-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi,
I have a question that is related internals of data.table and potentially related to feature request. (cross posting SO)
I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.
My code is as simple as this:
DT[, var := some_function(var2)]If I'm not mistaken, data.table uses multithread when it is called with by (maybe not always), and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as
DT[, grouper := .I %/% 100] and do
DT[, var := some_function(var2), by = grouper]I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. However, with one yet another experiment, I found groupby operation improved the speed, and it made me confused.
So my questions are:
- Am I right that
data.tableuses multithreading when it's called withby? - If so, is there a condition that multithreading is enabled / disabled?
- Is there a way that the user can force
data.tableto use multithreading here?
FYI, I see that multithreading is enabled with half of my cores when I import data.table, so I guess there's no openMP issue.