Skip to content

parallelize j during groupby #5200

@matthewgson

Description

@matthewgson

Hi,

I have a question that is related internals of data.table and potentially related to feature request. (cross posting SO)
I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.

My code is as simple as this:

DT[, var := some_function(var2)]

If I'm not mistaken, data.table uses multithread when it is called with by (maybe not always), and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as

DT[, grouper := .I %/% 100]  

and do

DT[, var := some_function(var2), by = grouper]

I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. However, with one yet another experiment, I found groupby operation improved the speed, and it made me confused.

So my questions are:

  • Am I right that data.table uses multithreading when it's called with by?
  • If so, is there a condition that multithreading is enabled / disabled?
  • Is there a way that the user can force data.table to use multithreading here?

FYI, I see that multithreading is enabled with half of my cores when I import data.table, so I guess there's no openMP issue.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions