You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/datatable-benchmarking.Rmd
+40-7Lines changed: 40 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -106,7 +106,7 @@ There are another cases when it might not be desired, for example when benchmark
106
106
107
107
### avoid class coercion
108
108
109
-
Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.
109
+
Unless this is what you truly want to measure you should prepare input objects for every tool you are benchmarking in their expected class, so benchmark can measure timing of an actual computation rather than class coercion time and computation.
110
110
111
111
### avoid `microbenchmark(..., times=100)`
112
112
@@ -118,6 +118,41 @@ Matt once said:
118
118
119
119
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
@@ -149,9 +184,9 @@ setindex(DT, a)
149
184
# }
150
185
```
151
186
152
-
### inside a loop prefer `setDT` instead of `data.table()`
187
+
### inside a loop prefer `setDT()` instead of `data.table()`
153
188
154
-
As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
189
+
As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above.
155
190
156
191
### lazy evaluation aware benchmarking
157
192
@@ -163,13 +198,11 @@ In languages like python which does not support _lazy evaluation_ the following
163
198
DT= data.table(a=1L, b=2L)
164
199
DT[a==1L]
165
200
166
-
col="a"
167
-
filter=1L
168
-
DT[DT[[col]] ==filter]
201
+
DT[DT[["a"]] ==1L]
169
202
```
170
203
171
204
R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
172
205
173
206
#### force applications to finish computation
174
207
175
-
The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).
208
+
The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed (or even only partially computed) when its results were required. Because of that you should ensure that computation took place completely. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).
0 commit comments