Skip to content

Commit b2e8a28

Browse files
authored
some conclusions in the readme
1 parent 0b226e1 commit b2e8a28

File tree

1 file changed

+14
-4
lines changed

1 file changed

+14
-4
lines changed

README.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
11
SmallDatasetBenchmarks
22
======================
3-
This repo is for testing models on small (classification) datasets.
3+
This repo is for testing machine learnign models on small (classification) datasets.
44

5-
It uses a subset of this dataset: UCI++, a huge collection of preprocessed datasets for supervised classification problems in ARFF format
6-
[![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.13748.svg)](http://dx.doi.org/10.5281/zenodo.13748)
5+
The relevant figures are produced in `figures.ipynb`. The actual experimental set-up is in files that start with `01_*`, `02_*`, etc. Nested cross-validation is used to get unbiased estimates of generalization performance. The splits are stratified random with fixed seeds, so the conclusions of these experiments are unlikely to hold for "real" data where test/production data is not IID with the training data.
76

8-
Note that UCI++ reuses the same data in different configurations and often you can't tell what's a categorical feature.
7+
All that said, here are some observations:
8+
- Non-linear models are better than linear ones, even for datas with < 100 samples.
9+
- SVM and Logistic Regression do similarly, but there are two datasets where SVM is the only algorithm that does not fail catastrophically. However, logistic regression with `elasticnet` penalty never gets less than 0.5 area under the ROC curve.
10+
- LightGBM works well. Giving it more hyperparameters to try is a good idea. The `hyperopt` package did better than `scikit-optimize` and `Optuna` (not shown), but it could be user error.
11+
- AutoGluon works really well and is the best approach for predictive power. But you need to give it enough time. A 2m budget (per fold) was not enough, but 5m was enough for datasets up to 10k samples.
12+
13+
Data
14+
----
15+
The data is subset of this dataset-of-datasets: "UCI++, a huge collection of preprocessed datasets for supervised classification problems in ARFF format
16+
[![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.13748.svg)](http://dx.doi.org/10.5281/zenodo.13748)"
17+
18+
Note that UCI++ reuses the same datasets in different configurations and often you can't tell what's a categorical feature.

0 commit comments

Comments
 (0)