Skip to content

Commit 04ddc78

Browse files
Update README.md (#29)
* Update README.md * fixed a typo * apply general comments to new Benchmarks Documentation * add appropriate links * Add info about IDP to sklearn README * highlight code sample * rename supported algorithms * Change daal4py env build instruction * add ml frameworks to Prerequisites * apply Bill comments remove Config JSON Schema paragraph change main description of repository apply minor text changes * Update configs/README.md Co-authored-by: Ekaterina Mekhnetsova <[email protected]> * Update README.md Co-authored-by: Ekaterina Mekhnetsova <[email protected]> * Apply suggestions from code review Apply all suggestions Co-authored-by: Ekaterina Mekhnetsova <[email protected]> * apply some changes to the tables sells about capital letter, 'The' and full stop Co-authored-by: Ekaterina Mekhnetsova <[email protected]>
1 parent c7e0abd commit 04ddc78

File tree

7 files changed

+675
-5
lines changed

7 files changed

+675
-5
lines changed

README.md

Lines changed: 79 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,92 @@
1+
12
# scikit-learn_bench
23

3-
Benchmark for optimizations to scikit-learn in the Intel(R) Distribution for
4-
Python*. See benchmark results [here](https://intelpython.github.io/scikit-learn_bench).
4+
**scikit-learn_bench** benchmarks various implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently support the [scikit-learn](https://scikit-learn.org/), [DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml), and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used [machine learning algorithms](#supported-algorithms).
5+
6+
See benchmark results [here](https://intelpython.github.io/scikit-learn_bench).
7+
8+
9+
## Table of content
10+
11+
* [Prerequisites](#prerequisites)
12+
* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
13+
* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
14+
* [Supported algorithms](#supported-algorithms)
15+
* [Algorithms parameters](#algorithms-parameters)
16+
* [Legacy automatic building and running](#legacy-automatic-building-and-running)
517

618
## Prerequisites
7-
- python and scikit-learn to run python versions
19+
- `python` and `scikit-learn` to run python versions
820
- pandas when using its DataFrame as input data format
921
- `icc`, `ifort`, `mkl`, `daal` to compile and run native benchmarks
22+
- machine learning frameworks, that you want to test. Check [this item](#how-to-create-conda-environment-for-benchmarking) to get additional information how to set environment.
1023

1124
## How to create conda environment for benchmarking
12-
`conda create -n skl_bench -c intel python=3.7 scikit-learn pandas`
25+
26+
Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
27+
28+
* [**scikit-learn**](https://github.com/PivovarA/scikit-learn_bench/blob/master/sklearn/README.md#how-to-create-conda-environment-for-benchmarking)
29+
* [**daal4py**](https://github.com/PivovarA/scikit-learn_bench/blob/master/daal4py/README.md#how-to-create-conda-environment-for-benchmarking)
30+
* [**cuml**](https://github.com/PivovarA/scikit-learn_bench/blob/master/cuml/README.md#how-to-create-conda-environment-for-benchmarking)
31+
* [**xgboost**](https://github.com/PivovarA/scikit-learn_bench/tree/master/xgboost/README.md#how-to-create-conda-environment-for-benchmarking)
32+
1333

1434
## Running Python benchmarks with runner script
15-
`python runner.py --config config_example.json [--output-format json --verbose]`
35+
36+
Run `python runner.py --config configs/config_example.json [--output-format json --verbose]` to launch benchmarks.
37+
38+
runner options:
39+
* ``config`` : the path to configuration file
40+
* ``dummy-run`` : run configuration parser and datasets generation without benchmarks running
41+
* ``verbose`` : print additional information during benchmarks running
42+
* ``output-format``: *json* or *csv*. Output type of benchmarks to use with their runner
43+
44+
Benchmarks currently support the following frameworks:
45+
* **scikit-learn**
46+
* **daal4py**
47+
* **cuml**
48+
* **xgboost**
49+
50+
The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
51+
52+
You can configure benchmarks by editing a config file. Check [config.json schema](https://github.com/PivovarA/scikit-learn_bench/blob/master/configs/README.md) for more details.
53+
54+
## Benchmark supported algorithms
55+
56+
| algorithm | benchmark name | sklearn | daal4py | cuml | xgboost |
57+
|---|---|---|---|---|---|
58+
|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
59+
|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
60+
|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
61+
|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:white_check_mark:|:x:|:x:|
62+
|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
63+
|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:white_check_mark:|:x:|
64+
|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
65+
|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
66+
|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
67+
|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
68+
|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
69+
|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:white_check_mark:|:x:|
70+
|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:white_check_mark:|
71+
|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:white_check_mark:|
72+
73+
## Algorithms parameters
74+
75+
You can launch benchmarks for each algorithm separately.
76+
To do this, go to the directory with the benchmark:
77+
78+
cd <framework>
79+
80+
Run the following command:
81+
82+
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
83+
84+
The list of supported parameters for each algorithm you can find here:
85+
86+
* [**scikit-learn**](https://github.com/PivovarA/scikit-learn_bench/blob/master/sklearn/README.md#algorithms-parameters)
87+
* [**daal4py**](https://github.com/PivovarA/scikit-learn_bench/blob/master/daal4py/README.md#algorithms-parameters)
88+
* [**cuml**](https://github.com/PivovarA/scikit-learn_bench/blob/master/cuml/README.md#algorithms-parameters)
89+
* [**xgboost**](https://github.com/PivovarA/scikit-learn_bench/tree/master/xgboost/README.md#algorithms-parameters)
1690

1791
## Legacy automatic building and running
1892
- Run `make`. This will generate data, compile benchmarks, and run them.

configs/README.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
## Config JSON Schema
2+
3+
Configure benchmarks by editing the `config.json` file.
4+
You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables.
5+
Refer to the tables below for descriptions of all fields in the configuration file.
6+
7+
- [Root Config Object](#root-config-object)
8+
- [Common Object](#common-object)
9+
- [Case Object](#case-object)
10+
- [Dataset Object](#dataset-object)
11+
- [Training Object](#training-object)
12+
- [Testing Object](#testing-object)
13+
14+
### Root Config Object
15+
| Field Name | Type | Description |
16+
| ----- | ---- |------------ |
17+
|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads |
18+
|common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings |
19+
|cases| array[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data |
20+
21+
### Common Object
22+
23+
| Field Name | Type | Description |
24+
| ----- | ---- |------------ |
25+
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* |
26+
|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* |
27+
|data-order| array[string] | **REQUIRED** input data order. Data order: *C* (row-major, default) or *F* (column-major) |
28+
|dtype| array[string] | **REQUIRED** input data type. Data type: *float64* (default) or *float32* |
29+
|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) |
30+
31+
### Case Object
32+
33+
| Field Name | Type | Description |
34+
| ----- | ---- |------------ |
35+
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*|
36+
|algorithm| string | **REQUIRED** benchmark name |
37+
|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. |
38+
|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
39+
40+
### Dataset Object
41+
42+
| Field Name | Type | Description |
43+
| ----- | ---- |------------ |
44+
|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* |
45+
|type| string | **REQUIRED** for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* |
46+
|n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem |
47+
|n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate |
48+
|n_features| int | **REQUIRED** For *synthetic* data only. The number of features to generate |
49+
|name| string | Name of dataset |
50+
|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
51+
|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
52+
53+
### Training Object
54+
55+
| Field Name | Type | Description |
56+
| ----- | ---- |------------ |
57+
| n_samples | int | The total number of the training points |
58+
| x | str | The path to the training samples |
59+
| y | str | The path to the training labels |
60+
61+
### Testing Object
62+
63+
| Field Name | Type | Description |
64+
| ----- | ---- |------------ |
65+
| n_samples | int | The total number of the testing points |
66+
| x | str | The path to the testing samples |
67+
| y | str | The path to the testing labels |
File renamed without changes.

cuml/README.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
2+
## How to create conda environment for benchmarking
3+
`conda create -n skl_bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf`
4+
5+
## Algorithms parameters
6+
7+
You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm:
8+
9+
- [General](#general)
10+
- [DBSCAN](#dbscan)
11+
- [RandomForestClassifier](#randomforestclassifier)
12+
- [RandomForestRegressor](#randomforestregressor)
13+
- [pairwise_distances](#pairwise_distances)
14+
- [KMeans](#kmeans)
15+
- [KNeighborsClassifier](#kneighborsclassifier)
16+
- [LinearRegression](#linearregression)
17+
- [LogisticRegression](#logisticregression)
18+
- [PCA](#pca)
19+
- [Ridge Regression](#ridge)
20+
- [SVC](#svc)
21+
- [train_test_split](#train_test_split)
22+
23+
#### General
24+
| Parameter Name | Type | Default Value | Description |
25+
| ----- | ---- |---- |---- |
26+
|num-threads|int|-1| The number of threads to use|
27+
|arch|str|?|Achine architecture, for bookkeeping|
28+
|batch|str|?|Batch ID, for bookkeeping|
29+
|prefix|str|sklearn|Prefix string, for bookkeeping|
30+
|header|action|False|Output CSV header|
31+
|verbose|action|False|Output extra debug messages|
32+
|data-format|str|numpy|Data formats: *numpy*, *pandas* or *cudf*|
33+
|data-order|str|C|Data order: C (row-major, default) or F (column-major)|
34+
|dtype|np.dtype|np.float64|Data type: *float64* (default) or *float32*|
35+
|check-finiteness|action|False|Check finiteness in sklearn input check(disabled by default)|
36+
|output-format|str|csv|Output format: *csv* (default) or *json*'|
37+
|time-method|str|mean_min|Method used for time mesurements|
38+
|box-filter-measurements|int|100|Maximum number of measurements in box filter|
39+
|inner-loops|int|100|Maximum inner loop iterations. (we take the mean over inner iterations)|
40+
|outer-loops|int|100|Maximum outer loop iterations. (we take the min over outer iterations)|
41+
|time-limit|float|10|Target time to spend to benchmark|
42+
|goal-outer-loops|int|10|The number of outer loops to aim while automatically picking number of inner loops. If zero, do not automatically decide number of inner loops|
43+
|seed|int|12345|Seed to pass as random_state|
44+
|dataset-name|str|None|Dataset name|
45+
46+
47+
#### DBSCAN
48+
| parameter Name | Type | default value | description |
49+
| ----- | ---- |---- |---- |
50+
| epsilon | float | 10 | Radius of neighborhood of a point|
51+
| min_samples | int | 5 | The minimum number of samples required in a 'neighborhood to consider a point a core point |
52+
53+
#### RandomForestClassifier
54+
55+
| parameter Name | Type | default value | description |
56+
| ----- | ---- |---- |---- |
57+
| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split |
58+
|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree|
59+
| num-trees | int | 100 | The number of trees in the forest |
60+
| max-features | float_or_int | None | Upper bound on features used at each split |
61+
| max-depth | int | None | Upper bound on depth of constructed trees |
62+
| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting |
63+
| max-leaf-nodes | int | None | Maximum leaf nodes per tree |
64+
| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting |
65+
| no-bootstrap | store_false | True | Don't control bootstraping |
66+
67+
#### RandomForestRegressor
68+
69+
| parameter Name | Type | default value | description |
70+
| ----- | ---- |---- |---- |
71+
| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split |
72+
|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree|
73+
| num-trees | int | 100 | The number of trees in the forest |
74+
| max-features | float_or_int | None | Upper bound on features used at each split |
75+
| max-depth | int | None | Upper bound on depth of constructed trees |
76+
| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting |
77+
| max-leaf-nodes | int | None | Maximum leaf nodes per tree |
78+
| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting |
79+
| no-bootstrap | action | True | Don't control bootstraping |
80+
81+
#### KMeans
82+
83+
| parameter Name | Type | default value | description |
84+
| ----- | ---- |---- |---- |
85+
| init | str | | Initial clusters |
86+
| tol | float | 0 | Absolute threshold |
87+
| maxiter | int | 100 | Maximum number of iterations |
88+
| samples-per-batch | int | 32768 | The number of samples per batch |
89+
| n-clusters | int | | The number of clusters |
90+
91+
#### KNeighborsClassifier
92+
93+
| parameter Name | Type | default value | description |
94+
| ----- | ---- |---- |---- |
95+
| n-neighbors | int | 5 | The number of neighbors to use |
96+
| weights | str | uniform | Weight function used in prediction |
97+
| method | str | brute | Algorithm used to compute the nearest neighbors |
98+
| metric | str | euclidean | Distance metric to use |
99+
100+
#### LinearRegression
101+
102+
| parameter Name | Type | default value | description |
103+
| ----- | ---- |---- |---- |
104+
| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) |
105+
| solver | str | eig | *eig* or *svd*. Solver used for training |
106+
107+
#### LogisticRegression
108+
109+
| parameter Name | Type | default value | description |
110+
| ----- | ---- |---- |---- |
111+
| no-fit-intercept | action | True | Don't fit intercept|
112+
| solver | str | qn | *qn*, *owl*. Solver to use|
113+
| maxiter | int | 100 | Maximum iterations for the iterative solver |
114+
| C | float | 1.0 | Regularization parameter |
115+
| tol | float | None | Tolerance for solver |
116+
117+
#### PCA
118+
119+
| parameter Name | Type | default value | description |
120+
| ----- | ---- |---- |---- |
121+
| svd-solver | str | full | *auto*, *full* or *jacobi*. SVD solver to use |
122+
| n-components | int | None | The number of components to find |
123+
| whiten | action | False | Perform whitening |
124+
125+
#### Ridge
126+
127+
| parameter Name | Type | default value | description |
128+
| ----- | ---- |---- |---- |
129+
| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) |
130+
| solver | str | eig | *eig*, *cd* or *svd*. Solver used for training |
131+
| alpha | float | 1.0 | Regularization strength |
132+
133+
#### SVC
134+
135+
| parameter Name | Type | default value | description |
136+
| ----- | ---- |---- |---- |
137+
| C | float | 0.01 | SVM slack parameter |
138+
| kernel | str | linear | *linear* or *rbf*. SVM kernel function |
139+
| gamma | float | None | Parameter for kernel="rbf" |
140+
| maxiter | int | 2000 | Maximum iterations for the iterative solver |
141+
| max-cache-size | int | 64 | Maximum cache size for SVM. |
142+
| tol | float | 1e-16 | Tolerance passed to sklearn.svm.SVC |
143+
| no-shrinking | action | True | Don't use shrinking heuristic |
144+
145+
#### train_test_split
146+
147+
| parameter Name | Type | default value | description |
148+
| ----- | ---- |---- |---- |
149+
| train-size | float | 0.75 | Size of training subset |
150+
| test-size | float | 0.25 | Size of testing subset |
151+
| do-not-shuffle | action | False | Do not perform data shuffle before splitting |

0 commit comments

Comments
 (0)