add example requested by Matt

jangorecki · jangorecki · commit 4e4578a7ba91 · 2023-11-06T18:25:40.000+01:00
diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd
@@ -106,7 +106,7 @@ There are another cases when it might not be desired, for example when benchmark
 
 ### avoid class coercion
 
-Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.
+Unless this is what you truly want to measure you should prepare input objects for every tool you are benchmarking in their expected class, so benchmark can measure timing of an actual computation rather than class coercion time and computation.
 
 ### avoid `microbenchmark(..., times=100)`
 
@@ -118,6 +118,41 @@ Matt once said:
 
 This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
 
+Example below represents the problem discussed:
+```r
+library(microbenchmark)
+library(data.table)
+set.seed(108)
+
+N = 1e5L
+dt = data.table(id=sample(N), value=rnorm(N))
+setindex(dt, "id")
+df = as.data.frame(dt)
+microbenchmark(
+  dt[id==5e4L, value],
+  df[df$id==5e4L, "value"],
+  times = 1000
+)
+#Unit: microseconds
+#                        expr      min        lq      mean    median        uq       max neval
+#     dt[id == 50000L, value] 1237.964 1359.5635 1466.9513 1392.1735 1443.1725 14500.751  1000
+#df[df$id == 50000L, "value"]  355.063  391.2695  430.3884  404.7575  429.5605  2481.442  1000
+
+N = 1e7L
+dt = data.table(id=sample(N), value=rnorm(N))
+setindex(dt, "id")
+df = as.data.frame(dt)
+microbenchmark(
+  dt[id==5e6L, value],
+  df[df$id==5e6L, "value"],
+  times = 5
+)
+#Unit: milliseconds
+#                          expr       min        lq     mean    median        uq       max neval
+#     dt[id == 5000000L, value]  1.306013  1.367846  1.59317  1.709714  1.748953  1.833324     5
+#df[df$id == 5000000L, "value"] 47.359246 47.858230 50.83947 51.774551 53.020058 54.185249     5
+```
+
 ### multithreaded processing
 
 One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
@@ -149,9 +184,9 @@ setindex(DT, a)
 # }
 ```
 
-### inside a loop prefer `setDT` instead of `data.table()`
+### inside a loop prefer `setDT()` instead of `data.table()`
 
-As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
+As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above.
 
 ### lazy evaluation aware benchmarking
 
@@ -163,13 +198,11 @@ In languages like python which does not support _lazy evaluation_ the following
 DT = data.table(a=1L, b=2L)
 DT[a == 1L]
 
-col = "a"
-filter = 1L
-DT[DT[[col]] == filter]
+DT[DT[["a"]] == 1L]
 ```
 
 R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
 
 #### force applications to finish computation
 
-The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).
+The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed (or even only partially computed) when its results were required. Because of that you should ensure that computation took place completely. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).