Skip to content

Commit 213295c

Browse files
committed
address feedback on bench-vign improvements
1 parent 2556b05 commit 213295c

File tree

1 file changed

+39
-16
lines changed

1 file changed

+39
-16
lines changed

vignettes/datatable-benchmarking.Rmd

Lines changed: 39 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,38 @@ date: "`r Sys.Date()`"
44
output:
55
rmarkdown::html_vignette:
66
toc: true
7-
number_sections: false
87
vignette: >
98
%\VignetteIndexEntry{Benchmarking data.table}
109
%\VignetteEngine{knitr::rmarkdown}
1110
\usepackage[utf8]{inputenc}
1211
---
1312

14-
This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid.
13+
***
1514

16-
# fread: clear caches
15+
## General suggestions
16+
17+
Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
18+
What does it mean and how to approach such measurements?
19+
The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
20+
There are multiple dimensions that you should consider when examining scaling of your process.
21+
- increase numbers of rows on input
22+
- cardinality of data
23+
- skewness of data - for most cases this should have the least importance
24+
- increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.
25+
- presence of NAs in input
26+
- sortedness of input
27+
28+
To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.
29+
Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.
30+
31+
You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking.
32+
You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly.
33+
34+
***
35+
36+
## Best practices
37+
38+
### fread: clear caches
1739

1840
Ideally each `fread` call should be run in fresh session with the following commands preceding R execution. This clears OS cache file in RAM and HD cache.
1941

@@ -26,7 +48,7 @@ sudo hdparm -t /dev/sda
2648

2749
When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently as well timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline.
2850

29-
# subset: threshold for index optimization on compound queries
51+
### subset: threshold for index optimization on compound queries
3052

3153
Index optimization for compound filter queries will be not be used when cross product of elements provided to filter on exceeds 1e4 elements.
3254

@@ -49,7 +71,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
4971
#...
5072
```
5173

52-
# index aware benchmarking
74+
### index aware benchmarking
5375

5476
For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
5577
To control usage of index use following options:
@@ -70,32 +92,33 @@ options(datatable.optimize=3L)
7092
`options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on.
7193
Those options affects much more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`.
7294

73-
# _by reference_ operations
95+
### _by reference_ operations
7496

7597
When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed `data.table` on input.
7698

7799
Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking.
78100

79-
# try to benchmark atomic processes
101+
### try to benchmark atomic processes
80102

81103
If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results.
82104
Of course if your benchmark is meant to present _full workflow_ then it perfectly make sense to present total timing, still spliting timings might give good insight into bottlenecks in such workflow.
83105
There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires to populate _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.
84106

85-
# avoid class coercion
107+
### avoid class coercion
86108

87109
Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.
88110

89-
# avoid `microbenchmark(..., times=100)`
111+
### avoid `microbenchmark(..., times=100)`
90112

113+
Be sure to read _General suggestions_ section in the top of this document as it also well covers that topic.
91114
Repeating benchmarking many times usually does not fit well for data processing tools. Of course it perfectly make sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once.
92115
Matt once said:
93116

94117
> I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.
95118
96-
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios.
119+
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
97120

98-
# multithreaded processing
121+
### multithreaded processing
99122

100123
One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
101124
You can control how much threads you want to use with `setDTthreads`.
@@ -107,7 +130,7 @@ getDTthreads() # check how many cores are currently used
107130

108131
Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
109132

110-
# inside a loop prefer `set` instead of `:=`
133+
### inside a loop prefer `set` instead of `:=`
111134

112135
Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call.
113136

@@ -124,13 +147,13 @@ setindex(DT, a)
124147
# }
125148
```
126149

127-
# inside a loop prefer `setDT` instead of `data.table()`
150+
### inside a loop prefer `setDT` instead of `data.table()`
128151

129152
As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
130153

131-
# lazy evaluation aware benchmarking
154+
### lazy evaluation aware benchmarking
132155

133-
## let applications to optimize queries
156+
#### let applications to optimize queries
134157

135158
In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
136159

@@ -145,6 +168,6 @@ DT[DT[[col]] == filter]
145168

146169
R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
147170

148-
## force applications to finish computation
171+
#### force applications to finish computation
149172

150173
The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).

0 commit comments

Comments
 (0)