You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/datatable-benchmarking.Rmd
+39-16Lines changed: 39 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -4,16 +4,38 @@ date: "`r Sys.Date()`"
4
4
output:
5
5
rmarkdown::html_vignette:
6
6
toc: true
7
-
number_sections: false
8
7
vignette: >
9
8
%\VignetteIndexEntry{Benchmarking data.table}
10
9
%\VignetteEngine{knitr::rmarkdown}
11
10
\usepackage[utf8]{inputenc}
12
11
---
13
12
14
-
This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid.
13
+
***
15
14
16
-
# fread: clear caches
15
+
## General suggestions
16
+
17
+
Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
18
+
What does it mean and how to approach such measurements?
19
+
The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
20
+
There are multiple dimensions that you should consider when examining scaling of your process.
21
+
- increase numbers of rows on input
22
+
- cardinality of data
23
+
- skewness of data - for most cases this should have the least importance
24
+
- increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.
25
+
- presence of NAs in input
26
+
- sortedness of input
27
+
28
+
To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.
29
+
Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.
30
+
31
+
You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking.
32
+
You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly.
33
+
34
+
***
35
+
36
+
## Best practices
37
+
38
+
### fread: clear caches
17
39
18
40
Ideally each `fread` call should be run in fresh session with the following commands preceding R execution. This clears OS cache file in RAM and HD cache.
19
41
@@ -26,7 +48,7 @@ sudo hdparm -t /dev/sda
26
48
27
49
When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently as well timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline.
28
50
29
-
# subset: threshold for index optimization on compound queries
51
+
###subset: threshold for index optimization on compound queries
30
52
31
53
Index optimization for compound filter queries will be not be used when cross product of elements provided to filter on exceeds 1e4 elements.
32
54
@@ -49,7 +71,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
49
71
#...
50
72
```
51
73
52
-
# index aware benchmarking
74
+
###index aware benchmarking
53
75
54
76
For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
`options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on.
71
93
Those options affects much more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`.
72
94
73
-
# _by reference_ operations
95
+
###_by reference_ operations
74
96
75
97
When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed `data.table` on input.
76
98
77
99
Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking.
78
100
79
-
# try to benchmark atomic processes
101
+
###try to benchmark atomic processes
80
102
81
103
If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results.
82
104
Of course if your benchmark is meant to present _full workflow_ then it perfectly make sense to present total timing, still spliting timings might give good insight into bottlenecks in such workflow.
83
105
There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires to populate _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.
84
106
85
-
# avoid class coercion
107
+
###avoid class coercion
86
108
87
109
Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.
88
110
89
-
# avoid `microbenchmark(..., times=100)`
111
+
###avoid `microbenchmark(..., times=100)`
90
112
113
+
Be sure to read _General suggestions_ section in the top of this document as it also well covers that topic.
91
114
Repeating benchmarking many times usually does not fit well for data processing tools. Of course it perfectly make sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once.
92
115
Matt once said:
93
116
94
117
> I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.
95
118
96
-
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios.
119
+
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
97
120
98
-
# multithreaded processing
121
+
###multithreaded processing
99
122
100
123
One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
101
124
You can control how much threads you want to use with `setDTthreads`.
@@ -107,7 +130,7 @@ getDTthreads() # check how many cores are currently used
107
130
108
131
Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
109
132
110
-
# inside a loop prefer `set` instead of `:=`
133
+
###inside a loop prefer `set` instead of `:=`
111
134
112
135
Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call.
113
136
@@ -124,13 +147,13 @@ setindex(DT, a)
124
147
# }
125
148
```
126
149
127
-
# inside a loop prefer `setDT` instead of `data.table()`
150
+
###inside a loop prefer `setDT` instead of `data.table()`
128
151
129
152
As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
130
153
131
-
# lazy evaluation aware benchmarking
154
+
###lazy evaluation aware benchmarking
132
155
133
-
## let applications to optimize queries
156
+
####let applications to optimize queries
134
157
135
158
In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
136
159
@@ -145,6 +168,6 @@ DT[DT[[col]] == filter]
145
168
146
169
R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
147
170
148
-
## force applications to finish computation
171
+
####force applications to finish computation
149
172
150
173
The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).
0 commit comments