address feedback on bench-vign improvements

jangorecki · jangorecki · commit 213295c7827f · 2019-03-06T10:14:25.000+05:30
diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd
@@ -4,16 +4,38 @@ date: "`r Sys.Date()`"
 output:
   rmarkdown::html_vignette:
     toc: true
-    number_sections: false
 vignette: >
   %\VignetteIndexEntry{Benchmarking data.table}
   %\VignetteEngine{knitr::rmarkdown}
   \usepackage[utf8]{inputenc}
 ---
 
-This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid.
+***
 
-# fread: clear caches
+## General suggestions
+
+Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute.
+What does it mean and how to approach such measurements?
+The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
+There are multiple dimensions that you should consider when examining scaling of your process.
+- increase numbers of rows on input  
+- cardinality of data  
+- skewness of data - for most cases this should have the least importance  
+- increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.
+- presence of NAs in input  
+- sortedness of input  
+
+To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.
+Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios.
+
+You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking.
+You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly.
+
+***
+
+## Best practices
+
+### fread: clear caches
 
 Ideally each `fread` call should be run in fresh session with the following commands preceding R execution. This clears OS cache file in RAM and HD cache.
 
@@ -26,7 +48,7 @@ sudo hdparm -t /dev/sda
 
 When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently as well timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline.
 
-# subset: threshold for index optimization on compound queries
+### subset: threshold for index optimization on compound queries
 
 Index optimization for compound filter queries will be not be used when cross product of elements provided to filter on exceeds 1e4 elements.
 
@@ -49,7 +71,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
 #...
 ```
 
-# index aware benchmarking
+### index aware benchmarking
 
 For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
 To control usage of index use following options:
@@ -70,32 +92,33 @@ options(datatable.optimize=3L)
 `options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on.
 Those options affects much more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`.
 
-# _by reference_ operations
+### _by reference_ operations
 
 When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed `data.table` on input.
 
 Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking.
 
-# try to benchmark atomic processes
+### try to benchmark atomic processes
 
 If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results.
 Of course if your benchmark is meant to present _full workflow_ then it perfectly make sense to present total timing, still spliting timings might give good insight into bottlenecks in such workflow.
 There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires to populate _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.
 
-# avoid class coercion
+### avoid class coercion
 
 Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.
 
-# avoid `microbenchmark(..., times=100)`
+### avoid `microbenchmark(..., times=100)`
 
+Be sure to read _General suggestions_ section in the top of this document as it also well covers that topic.  
 Repeating benchmarking many times usually does not fit well for data processing tools. Of course it perfectly make sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once.
 Matt once said:
 
 > I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.
 
-This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios.
+This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios.
 
-# multithreaded processing
+### multithreaded processing
 
 One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
 You can control how much threads you want to use with `setDTthreads`.
@@ -107,7 +130,7 @@ getDTthreads()     # check how many cores are currently used
 
 Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
 
-# inside a loop prefer `set` instead of `:=`
+### inside a loop prefer `set` instead of `:=`
 
 Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call.
 
@@ -124,13 +147,13 @@ setindex(DT, a)
 # }
 ```
 
-# inside a loop prefer `setDT` instead of `data.table()`
+### inside a loop prefer `setDT` instead of `data.table()`
 
 As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
 
-# lazy evaluation aware benchmarking
+### lazy evaluation aware benchmarking
 
-## let applications to optimize queries
+#### let applications to optimize queries
 
 In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
 
@@ -145,6 +168,6 @@ DT[DT[[col]] == filter]
 
 R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
 
-## force applications to finish computation
+#### force applications to finish computation
 
 The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).