parallelize j during groupby

Hi,

I have a question that is related internals of `data.table` and potentially related to feature request. (cross posting SO)
I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.

My code is as simple as this:

```r
DT[, var := some_function(var2)]
``` 

If I'm not mistaken, `data.table` uses multithread when it is called with `by` (maybe not always), and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as
```r
DT[, grouper := .I %/% 100]  
```
and do
```r
DT[, var := some_function(var2), by = grouper]
```

I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. However, with one yet another experiment, I found groupby operation improved the speed, and it made me confused.

So my questions are:

- Am I right that `data.table` uses multithreading when it's called with `by`?
- If so, is there a condition that multithreading is enabled / disabled?
- Is there a way that the user can force `data.table` to use multithreading here?


FYI, I see that multithreading is enabled with half of my cores when I import `data.table`, so I guess there's no openMP issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parallelize j during groupby #5200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

parallelize j during groupby #5200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions