Skip to content

Commit 7501185

Browse files
Update vignettes/datatable-benchmarking.Rmd
Co-authored-by: Michael Chirico <[email protected]>
1 parent dd49bd5 commit 7501185

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

vignettes/datatable-benchmarking.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ There are multiple dimensions that you should consider when examining how your p
3535
- cardinality of data
3636
- skewness of data - for most cases this should have the least importance
3737
- increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.
38-
- presence of NAs in input
38+
- prevalence of NAs in input
3939
- sortedness of input
4040

4141
To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish.

0 commit comments

Comments
 (0)