Skip to content

Commit 4e4578a

Browse files
committed
add example requested by Matt
1 parent 1cda608 commit 4e4578a

File tree

1 file changed

+40
-7
lines changed

1 file changed

+40
-7
lines changed

vignettes/datatable-benchmarking.Rmd

Lines changed: 40 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ There are another cases when it might not be desired, for example when benchmark
106106

107107
### avoid class coercion
108108

109-
Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.
109+
Unless this is what you truly want to measure you should prepare input objects for every tool you are benchmarking in their expected class, so benchmark can measure timing of an actual computation rather than class coercion time and computation.
110110

111111
### avoid `microbenchmark(..., times=100)`
112112

@@ -118,6 +118,41 @@ Matt once said:
118118
119119
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
120120

121+
Example below represents the problem discussed:
122+
```r
123+
library(microbenchmark)
124+
library(data.table)
125+
set.seed(108)
126+
127+
N = 1e5L
128+
dt = data.table(id=sample(N), value=rnorm(N))
129+
setindex(dt, "id")
130+
df = as.data.frame(dt)
131+
microbenchmark(
132+
dt[id==5e4L, value],
133+
df[df$id==5e4L, "value"],
134+
times = 1000
135+
)
136+
#Unit: microseconds
137+
# expr min lq mean median uq max neval
138+
# dt[id == 50000L, value] 1237.964 1359.5635 1466.9513 1392.1735 1443.1725 14500.751 1000
139+
#df[df$id == 50000L, "value"] 355.063 391.2695 430.3884 404.7575 429.5605 2481.442 1000
140+
141+
N = 1e7L
142+
dt = data.table(id=sample(N), value=rnorm(N))
143+
setindex(dt, "id")
144+
df = as.data.frame(dt)
145+
microbenchmark(
146+
dt[id==5e6L, value],
147+
df[df$id==5e6L, "value"],
148+
times = 5
149+
)
150+
#Unit: milliseconds
151+
# expr min lq mean median uq max neval
152+
# dt[id == 5000000L, value] 1.306013 1.367846 1.59317 1.709714 1.748953 1.833324 5
153+
#df[df$id == 5000000L, "value"] 47.359246 47.858230 50.83947 51.774551 53.020058 54.185249 5
154+
```
155+
121156
### multithreaded processing
122157

123158
One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
@@ -149,9 +184,9 @@ setindex(DT, a)
149184
# }
150185
```
151186

152-
### inside a loop prefer `setDT` instead of `data.table()`
187+
### inside a loop prefer `setDT()` instead of `data.table()`
153188

154-
As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
189+
As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above.
155190

156191
### lazy evaluation aware benchmarking
157192

@@ -163,13 +198,11 @@ In languages like python which does not support _lazy evaluation_ the following
163198
DT = data.table(a=1L, b=2L)
164199
DT[a == 1L]
165200

166-
col = "a"
167-
filter = 1L
168-
DT[DT[[col]] == filter]
201+
DT[DT[["a"]] == 1L]
169202
```
170203

171204
R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
172205

173206
#### force applications to finish computation
174207

175-
The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).
208+
The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed (or even only partially computed) when its results were required. Because of that you should ensure that computation took place completely. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).

0 commit comments

Comments
 (0)