Skip to content

Commit f414740

Browse files
Update vignettes/datatable-benchmarking.Rmd
Co-authored-by: Michael Chirico <[email protected]>
1 parent 7501185 commit f414740

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

vignettes/datatable-benchmarking.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Let's assume you are measuring a particular process. It is blazingly fast, takin
3131
What does it mean and how to approach such measurements?
3232
The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale.
3333
There are multiple dimensions that you should consider when examining how your process scales:
34-
- increase numbers of rows on input
34+
- increase number of rows on input
3535
- cardinality of data
3636
- skewness of data - for most cases this should have the least importance
3737
- increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.

0 commit comments

Comments
 (0)