|
| 1 | + |
| 2 | +## How to create conda environment for benchmarking |
| 3 | +`conda create -n skl_bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf` |
| 4 | + |
| 5 | +## Algorithms parameters |
| 6 | + |
| 7 | +You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm: |
| 8 | + |
| 9 | +- [General](#general) |
| 10 | +- [DBSCAN](#dbscan) |
| 11 | +- [RandomForestClassifier](#randomforestclassifier) |
| 12 | +- [RandomForestRegressor](#randomforestregressor) |
| 13 | +- [pairwise_distances](#pairwise_distances) |
| 14 | +- [KMeans](#kmeans) |
| 15 | +- [KNeighborsClassifier](#kneighborsclassifier) |
| 16 | +- [LinearRegression](#linearregression) |
| 17 | +- [LogisticRegression](#logisticregression) |
| 18 | +- [PCA](#pca) |
| 19 | +- [Ridge Regression](#ridge) |
| 20 | +- [SVC](#svc) |
| 21 | +- [train_test_split](#train_test_split) |
| 22 | + |
| 23 | +#### General |
| 24 | +| Parameter Name | Type | Default Value | Description | |
| 25 | +| ----- | ---- |---- |---- | |
| 26 | +|num-threads|int|-1| The number of threads to use| |
| 27 | +|arch|str|?|Achine architecture, for bookkeeping| |
| 28 | +|batch|str|?|Batch ID, for bookkeeping| |
| 29 | +|prefix|str|sklearn|Prefix string, for bookkeeping| |
| 30 | +|header|action|False|Output CSV header| |
| 31 | +|verbose|action|False|Output extra debug messages| |
| 32 | +|data-format|str|numpy|Data formats: *numpy*, *pandas* or *cudf*| |
| 33 | +|data-order|str|C|Data order: C (row-major, default) or F (column-major)| |
| 34 | +|dtype|np.dtype|np.float64|Data type: *float64* (default) or *float32*| |
| 35 | +|check-finiteness|action|False|Check finiteness in sklearn input check(disabled by default)| |
| 36 | +|output-format|str|csv|Output format: *csv* (default) or *json*'| |
| 37 | +|time-method|str|mean_min|Method used for time mesurements| |
| 38 | +|box-filter-measurements|int|100|Maximum number of measurements in box filter| |
| 39 | +|inner-loops|int|100|Maximum inner loop iterations. (we take the mean over inner iterations)| |
| 40 | +|outer-loops|int|100|Maximum outer loop iterations. (we take the min over outer iterations)| |
| 41 | +|time-limit|float|10|Target time to spend to benchmark| |
| 42 | +|goal-outer-loops|int|10|The number of outer loops to aim while automatically picking number of inner loops. If zero, do not automatically decide number of inner loops| |
| 43 | +|seed|int|12345|Seed to pass as random_state| |
| 44 | +|dataset-name|str|None|Dataset name| |
| 45 | + |
| 46 | + |
| 47 | +#### DBSCAN |
| 48 | +| parameter Name | Type | default value | description | |
| 49 | +| ----- | ---- |---- |---- | |
| 50 | +| epsilon | float | 10 | Radius of neighborhood of a point| |
| 51 | +| min_samples | int | 5 | The minimum number of samples required in a 'neighborhood to consider a point a core point | |
| 52 | + |
| 53 | +#### RandomForestClassifier |
| 54 | + |
| 55 | +| parameter Name | Type | default value | description | |
| 56 | +| ----- | ---- |---- |---- | |
| 57 | +| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split | |
| 58 | +|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree| |
| 59 | +| num-trees | int | 100 | The number of trees in the forest | |
| 60 | +| max-features | float_or_int | None | Upper bound on features used at each split | |
| 61 | +| max-depth | int | None | Upper bound on depth of constructed trees | |
| 62 | +| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting | |
| 63 | +| max-leaf-nodes | int | None | Maximum leaf nodes per tree | |
| 64 | +| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting | |
| 65 | +| no-bootstrap | store_false | True | Don't control bootstraping | |
| 66 | + |
| 67 | +#### RandomForestRegressor |
| 68 | + |
| 69 | +| parameter Name | Type | default value | description | |
| 70 | +| ----- | ---- |---- |---- | |
| 71 | +| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split | |
| 72 | +|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree| |
| 73 | +| num-trees | int | 100 | The number of trees in the forest | |
| 74 | +| max-features | float_or_int | None | Upper bound on features used at each split | |
| 75 | +| max-depth | int | None | Upper bound on depth of constructed trees | |
| 76 | +| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting | |
| 77 | +| max-leaf-nodes | int | None | Maximum leaf nodes per tree | |
| 78 | +| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting | |
| 79 | +| no-bootstrap | action | True | Don't control bootstraping | |
| 80 | + |
| 81 | +#### KMeans |
| 82 | + |
| 83 | +| parameter Name | Type | default value | description | |
| 84 | +| ----- | ---- |---- |---- | |
| 85 | +| init | str | | Initial clusters | |
| 86 | +| tol | float | 0 | Absolute threshold | |
| 87 | +| maxiter | int | 100 | Maximum number of iterations | |
| 88 | +| samples-per-batch | int | 32768 | The number of samples per batch | |
| 89 | +| n-clusters | int | | The number of clusters | |
| 90 | + |
| 91 | +#### KNeighborsClassifier |
| 92 | + |
| 93 | +| parameter Name | Type | default value | description | |
| 94 | +| ----- | ---- |---- |---- | |
| 95 | +| n-neighbors | int | 5 | The number of neighbors to use | |
| 96 | +| weights | str | uniform | Weight function used in prediction | |
| 97 | +| method | str | brute | Algorithm used to compute the nearest neighbors | |
| 98 | +| metric | str | euclidean | Distance metric to use | |
| 99 | + |
| 100 | +#### LinearRegression |
| 101 | + |
| 102 | +| parameter Name | Type | default value | description | |
| 103 | +| ----- | ---- |---- |---- | |
| 104 | +| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) | |
| 105 | +| solver | str | eig | *eig* or *svd*. Solver used for training | |
| 106 | + |
| 107 | +#### LogisticRegression |
| 108 | + |
| 109 | +| parameter Name | Type | default value | description | |
| 110 | +| ----- | ---- |---- |---- | |
| 111 | +| no-fit-intercept | action | True | Don't fit intercept| |
| 112 | +| solver | str | qn | *qn*, *owl*. Solver to use| |
| 113 | +| maxiter | int | 100 | Maximum iterations for the iterative solver | |
| 114 | +| C | float | 1.0 | Regularization parameter | |
| 115 | +| tol | float | None | Tolerance for solver | |
| 116 | + |
| 117 | +#### PCA |
| 118 | + |
| 119 | +| parameter Name | Type | default value | description | |
| 120 | +| ----- | ---- |---- |---- | |
| 121 | +| svd-solver | str | full | *auto*, *full* or *jacobi*. SVD solver to use | |
| 122 | +| n-components | int | None | The number of components to find | |
| 123 | +| whiten | action | False | Perform whitening | |
| 124 | + |
| 125 | +#### Ridge |
| 126 | + |
| 127 | +| parameter Name | Type | default value | description | |
| 128 | +| ----- | ---- |---- |---- | |
| 129 | +| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) | |
| 130 | +| solver | str | eig | *eig*, *cd* or *svd*. Solver used for training | |
| 131 | +| alpha | float | 1.0 | Regularization strength | |
| 132 | + |
| 133 | +#### SVC |
| 134 | + |
| 135 | +| parameter Name | Type | default value | description | |
| 136 | +| ----- | ---- |---- |---- | |
| 137 | +| C | float | 0.01 | SVM slack parameter | |
| 138 | +| kernel | str | linear | *linear* or *rbf*. SVM kernel function | |
| 139 | +| gamma | float | None | Parameter for kernel="rbf" | |
| 140 | +| maxiter | int | 2000 | Maximum iterations for the iterative solver | |
| 141 | +| max-cache-size | int | 64 | Maximum cache size for SVM. | |
| 142 | +| tol | float | 1e-16 | Tolerance passed to sklearn.svm.SVC | |
| 143 | +| no-shrinking | action | True | Don't use shrinking heuristic | |
| 144 | + |
| 145 | +#### train_test_split |
| 146 | + |
| 147 | +| parameter Name | Type | default value | description | |
| 148 | +| ----- | ---- |---- |---- | |
| 149 | +| train-size | float | 0.75 | Size of training subset | |
| 150 | +| test-size | float | 0.25 | Size of testing subset | |
| 151 | +| do-not-shuffle | action | False | Do not perform data shuffle before splitting | |
0 commit comments