Update vignettes/datatable-benchmarking.Rmd

jangorecki · MichaelChirico · web-flow · commit f4147404012b · 2025-08-27T20:59:40.000+06:00
Co-authored-by: Michael Chirico &lt;chiricom@google.com&gt;
diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd
@@ -31,7 +31,7 @@ Let's assume you are measuring a particular process. It is blazingly fast, takin
 What does it mean and how to approach such measurements?
 The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
 There are multiple dimensions that you should consider when examining how your process scales:
-- increase numbers of rows on input  
+- increase number of rows on input  
 - cardinality of data  
 - skewness of data - for most cases this should have the least importance  
 - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.