Skip to content

Commit 84a8b74

Browse files
committed
v1.4.0: less dependencies for standalone RealMLP, new features, etc.
1 parent 94c7e34 commit 84a8b74

29 files changed

+2168
-164
lines changed

README.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,18 @@ on our benchmarks.
1515

1616
![Meta-test benchmark results](./figures/meta-test_benchmark_results.png)
1717

18-
## Installation
18+
## Installation (new in 1.4.0: optional model dependencies)
1919

2020
```bash
21-
pip install pytabkit
21+
pip install pytabkit[models]
2222
```
2323

24+
- RealMLP (and TabM) can be used without the `[models]` part.
2425
- If you want to use **TabR**, you have to manually install
2526
[faiss](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md),
2627
which is only available on **conda**.
2728
- Please install torch separately if you want to control the version (CPU/GPU etc.)
28-
- Use `pytabkit[autogluon,extra,hpo,bench,dev]` to install additional dependencies for
29+
- Use `pytabkit[models,autogluon,extra,hpo,bench,dev]` to install additional dependencies for
2930
AutoGluon models, extra preprocessing,
3031
hyperparameter optimization methods beyond random search (hyperopt/SMAC),
3132
the benchmarking part, and testing/documentation. For the hpo part,
@@ -169,6 +170,15 @@ and https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
169170

170171
## Releases (see git tags)
171172

173+
- v1.4.0:
174+
- moved some imports to the new `models` optional dependencies
175+
to have a more light-weight RealMLP installation
176+
- Added GPU support for CatBoost (not guaranteed to produce exactly the same results)
177+
- Ensembling now saves models after training if a path is supplied, to reduce memory usage
178+
- Added more search spaces
179+
- fixed error in multiquantile output when the passed y was one-dimensional
180+
instead of having shape `(n_samples, 1)`
181+
- Added some examples to the documentation
172182
- v1.3.0:
173183
- Added multiquantile regression for RealMLP:
174184
see the [documentation](https://pytabkit.readthedocs.io/en/latest/models/quantile_reg.html)

docs/requirements.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ lightgbm>=4.1
88
matplotlib>=3.0
99
msgpack>=1.0
1010
myst_parser>=3.0
11-
numpy>=1.25,<2.0
11+
numpy>=1.25
1212
openml>=0.14
1313
openpyxl>=3.0
1414
pandas>=2.0
@@ -22,11 +22,12 @@ pytorch_lightning>=2.0
2222
pyyaml>=5.0
2323
ray>=2.8
2424
requests>=2.0
25-
scikit-learn>=1.3,<1.6
25+
scikit-learn>=1.3
2626
seaborn>=0.0.13
2727
skorch>=0.15
2828
sphinx>=7.0
2929
sphinx_rtd_theme>=2.0
30+
torch>=2.0
3031
torch>=2.0,<2.6
3132
torchmetrics>=1.2.1
3233
tqdm

docs/source/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,13 @@ Tabular ML models in pytabkit.models
1212
models/00_overview
1313
models/01_sklearn_interfaces
1414
models/02_hpo
15+
models/examples
1516
models/nn_classes
1617
models/03_training_implementation
1718
models/quantile_reg
1819

1920

21+
2022
Tabular benchmarking using pytabkit.bench
2123
====================================
2224

docs/source/models/examples.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Examples
2+
3+
## Refitting RealMLP on train+val data using the best epoch from a previous run
4+
5+
You can refit RealMLP by simply using $n_refit=1$
6+
(or, better, larger values to ensemble multiple NNs).
7+
But in case you want more control, you can do it manually
8+
(e.g., if you only want to refit the best configuration from HPO,
9+
but you're not using the HPO within pytabkit).
10+
11+
```python
12+
import numpy as np
13+
from sklearn.model_selection import train_test_split
14+
15+
from pytabkit import RealMLP_TD_Regressor
16+
17+
np.random.seed(0)
18+
19+
X = np.random.randn(500, 5)
20+
y = np.random.randn(500)
21+
22+
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
23+
24+
reg = RealMLP_TD_Regressor(verbosity=2, random_state=0)
25+
reg.fit(X_train, y_train, X_val, y_val)
26+
27+
refit = RealMLP_TD_Regressor(verbosity=2, stop_epoch=list(reg.fit_params_['stop_epoch'].values())[0], val_fraction=0.0, random_state=0)
28+
refit.fit(X, y)
29+
```
30+
31+
## Fitting again after HPO on a smaller subset
32+
33+
Here is an example on how to fit HPO on a smaller subset
34+
and fit the best configuration again with validation.
35+
(It might be better to just use `n_refit` in the HPO classifier/regressor instead.)
36+
37+
```python
38+
import numpy as np
39+
from sklearn.model_selection import train_test_split
40+
41+
from pytabkit import LGBM_HPO_TPE_Regressor, LGBM_TD_Regressor
42+
43+
# This is an example on how to fit a HPO method on a smaller subset of the data,
44+
# and then refit the best hyperparams on the full dataset
45+
np.random.seed(0)
46+
47+
X = np.random.randn(500, 5)
48+
y = np.random.randn(500)
49+
50+
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.9, random_state=0)
51+
52+
# use 90% for validation to train faster
53+
# if there is too much validation data, validation data might be the bottleneck, then you should pass
54+
model = LGBM_HPO_TPE_Regressor(val_fraction=0.9, n_hyperopt_steps=5)
55+
model.fit(X, y)
56+
57+
# unfortunately params are not always called the same way, so we need to rename a few
58+
params = model.fit_params_['hyper_fit_params']
59+
params['subsample'] = params.pop('bagging_fraction')
60+
params['colsample_bytree'] = params.pop('feature_fraction')
61+
params['lr'] = params.pop('learning_rate')
62+
63+
# unfortunately, it is hard right now to check if this is exactly the same config,
64+
# as this might set some default params that are not used in the HPO config
65+
model_refit = LGBM_TD_Regressor(**params)
66+
model_refit.fit(X, y)
67+
```

pyproject.toml

Lines changed: 37 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ keywords = ['tabular data', 'scikit-learn', 'deep learning', 'gradient boosting'
1313
authors = [
1414
{ name = "David Holzmüller" }, #, email = "[email protected]" },
1515
{ name = "Léo Grinsztajn" }, #, email = "[email protected]" },
16+
{ name = "Ingo Steinwart" }, #, email = "[email protected]" },
1617
]
1718
classifiers = [
1819
"Development Status :: 4 - Beta",
@@ -26,40 +27,50 @@ classifiers = [
2627
"License :: OSI Approved :: Apache Software License",
2728
]
2829
dependencies = [
30+
"torch>=2.0",
31+
"numpy>=1.25", # hopefully don't need <2.0 anymore?
32+
"pandas>=2.0",
33+
"scikit-learn>=1.3",
34+
# these could be made optional with lazy imports
35+
# older versions of torchmetrics (<1.2.1) have a bug that makes certain metrics used in TabR slow:
36+
# https://github.com/Lightning-AI/torchmetrics/pull/2184
37+
"torchmetrics>=1.2.1",
38+
# can also install the newer lightning package with more dependencies instead, it will be prioritized
39+
"pytorch_lightning>=2.0",
40+
"psutil>=5.0", # used for getting logical CPU count in the sklearn base and for getting process RAM usage
41+
"dill", # more powerful pickle, used for file-saving and multiprocessing
42+
]
43+
44+
[project.optional-dependencies]
45+
models = [
2946
# use <2.6 for now since it can run into pickling issues with skorch if the skorch version is too old
3047
# see https://github.com/skorch-dev/skorch/commit/be93b7769d61aa22fb928d2e89e258c629bfeaf9
3148
"torch>=2.0,<2.6",
32-
"numpy>=1.25,<2.0",
33-
"pandas>=2.0",
34-
"scikit-learn>=1.3,<1.6",
3549
"xgboost>=2.0",
3650
"catboost>=1.2",
3751
"lightgbm>=4.1",
38-
# older versions of torchmetrics (<1.2.1) have a bug that makes certain metrics used in tabr slow:
39-
# https://github.com/Lightning-AI/torchmetrics/pull/2184
40-
"torchmetrics>=1.2.1",
41-
# can also install the newer lightning package with more dependencies instead, it will be prioritized
42-
"pytorch_lightning>=2.0",
43-
"skorch>=0.15", # for rtdl models
44-
"dask[dataframe]>=2023", # this is here because of a pandas warning:
52+
# for rtdl models (MLP, ResNet) but also lightly used in TabR
53+
# note that scikit-learn 1.6 needs skorch >= 1.1.0
54+
"skorch>=0.15",
55+
"dask[dataframe]>=2023", # this is here because of a pandas warning:
4556
# "Dask dataframe query planning is disabled because dask-expr is not installed"
4657
# "packaging", # unclear why this is here?
47-
"tqdm", # for TabM with verbosity >= 1
48-
"psutil>=5.0",
49-
# more classification metrics and post-hoc calibrators (could be an optional dependency)
58+
59+
"tqdm", # for TabM with verbosity >= 1
60+
61+
# more classification metrics and post-hoc calibrators
62+
# not necessary unless these things are actually used
5063
"probmetrics>=0.0.1",
5164

52-
# packages for saving objects in different formats
53-
"dill",
65+
# saving objects in yaml/msgpack
66+
# needed if used in utils.serialize() / deserialize()
5467
"pyyaml>=5.0",
5568
"msgpack>=1.0",
5669
# apparently msgpack_numpy fixed some bug in using numpy arrays in msgpack?
57-
# but apparently it can also cause a bug in ray due to its monkey-patching of msgpack functions
58-
# in theory we shouldn't be using if for numpy arrays at the moment, not sure why the need for this occured
70+
# but apparently it can also cause a bug in ray due to its monkey-patching of msgpack functions# in theory we shouldn't be using if for numpy arrays at the moment, not sure why the need for this occured
71+
# maybe it occured because we tried to save hyperparameters that were numpy scalars instead of python scalars
5972
# "msgpack_numpy>=0.4",
6073
]
61-
62-
[project.optional-dependencies]
6374
autogluon = [
6475
"autogluon.tabular[all]>=1.0",
6576
"autogluon.multimodal>=1.0",
@@ -73,10 +84,10 @@ hpo = [
7384
"hyperopt>=0.2",
7485
]
7586
bench = [
76-
"fire", # argparse utilities
77-
"ray>=2.8", # parallelization
78-
"pynvml>=11.0", # NVIDIA GPU utilization
79-
"openml>=0.14", # OpenML data download
87+
"fire", # argparse utilities
88+
"ray>=2.8", # parallelization
89+
"pynvml>=11.0", # NVIDIA GPU utilization
90+
"openml>=0.14", # OpenML data download
8091
# ----- UCI import ------
8192
"requests>=2.0",
8293
"patool>=1.0",
@@ -102,12 +113,12 @@ path = "pytabkit/__about__.py"
102113

103114
[tool.hatch.envs.default]
104115
installer = "uv"
105-
features = ["bench","autogluon","extra","hpo","dev"]
116+
features = ["models", "bench", "autogluon", "extra", "hpo", "dev"]
106117

107118
[tool.hatch.envs.hatch-test]
108119
installer = "uv"
109-
features = ["bench","dev"]
110-
#features = ["bench","autogluon","extra","hpo","dev"]
120+
features = ["models", "bench", "dev"]
121+
#features = ["models","bench","autogluon","extra","hpo","dev"]
111122

112123
[tool.hatch.build.targets.sdist]
113124
package = ['pytabkit']

pytabkit/__about__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
#
33
# SPDX-License-Identifier: Apache-2.0
44

5-
__version__ = "1.3.0"
5+
__version__ = "1.4.0"

pytabkit/bench/alg_wrappers/general.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ def run(self, task_package: TaskPackage, logger: Logger, assigned_resources: Nod
3333
:param tmp_folders: Temporary folders, one for each train/test split, to save temporary data to.
3434
:return: A dictionary of lists of ResultManager objects.
3535
The dict key is the predict params name, which is used as a suffix for the alg_name,
36-
and each list contains ResultManagers for each train/test split.
36+
and each list contains ResultManagers for each train/test split.
3737
"""
3838
raise NotImplementedError()
3939

pytabkit/bench/alg_wrappers/interface_wrappers.py

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818
from pytabkit.bench.run.results import ResultManager
1919
from pytabkit.models.alg_interfaces.other_interfaces import RFSubSplitInterface, SklearnMLPSubSplitInterface, \
2020
KANSubSplitInterface, GrandeSubSplitInterface, GBTSubSplitInterface, RandomParamsRFAlgInterface, \
21-
TabPFN2SubSplitInterface, TabICLSubSplitInterface
21+
TabPFN2SubSplitInterface, TabICLSubSplitInterface, RandomParamsExtraTreesAlgInterface, RandomParamsKNNAlgInterface, \
22+
ExtraTreesSubSplitInterface, KNNSubSplitInterface, RandomParamsLinearModelAlgInterface, LinearModelSubSplitInterface
2223
from pytabkit.bench.scheduling.resources import NodeResources
2324
from pytabkit.models.alg_interfaces.alg_interfaces import AlgInterface, MultiSplitWrapperAlgInterface
2425
from pytabkit.models.alg_interfaces.base import SplitIdxs, RequiredResources
@@ -317,7 +318,7 @@ def _create_alg_interface_impl(self, task_package: TaskPackage) -> AlgInterface:
317318
n_splits = len(task_package.split_infos)
318319
return MultiSplitWrapperAlgInterface(
319320
single_split_interfaces=[self.create_single_alg_interface(n_cv, task_type)
320-
for i in range(n_splits)])
321+
for i in range(n_splits)], **self.config)
321322

322323

323324
class SubSplitInterfaceWrapper(MultiSplitAlgInterfaceWrapper):
@@ -333,7 +334,7 @@ def create_sub_split_interface(self, task_type: TaskType) -> AlgInterface:
333334
def create_single_alg_interface(self, n_cv: int, task_type: TaskType) \
334335
-> AlgInterface:
335336
return SingleSplitWrapperAlgInterface([self.create_sub_split_interface(task_type)
336-
for i in range(n_cv)])
337+
for i in range(n_cv)], **self.config)
337338

338339

339340
class NNInterfaceWrapper(AlgInterfaceWrapper):
@@ -455,6 +456,21 @@ def create_sub_split_interface(self, task_type: TaskType) -> AlgInterface:
455456
return RFSubSplitInterface(**self.config)
456457

457458

459+
class ExtraTreesInterfaceWrapper(SubSplitInterfaceWrapper):
460+
def create_sub_split_interface(self, task_type: TaskType) -> AlgInterface:
461+
return ExtraTreesSubSplitInterface(**self.config)
462+
463+
464+
class KNNInterfaceWrapper(SubSplitInterfaceWrapper):
465+
def create_sub_split_interface(self, task_type: TaskType) -> AlgInterface:
466+
return KNNSubSplitInterface(**self.config)
467+
468+
469+
class LinearModelInterfaceWrapper(SubSplitInterfaceWrapper):
470+
def create_sub_split_interface(self, task_type: TaskType) -> AlgInterface:
471+
return LinearModelSubSplitInterface(**self.config)
472+
473+
458474
class GBTInterfaceWrapper(SubSplitInterfaceWrapper):
459475
def create_sub_split_interface(self, task_type: TaskType) -> AlgInterface:
460476
return GBTSubSplitInterface(**self.config)
@@ -544,3 +560,21 @@ class RandomParamsRFInterfaceWrapper(AlgInterfaceWrapper):
544560
def __init__(self, model_idx: int, **config):
545561
# model_idx should be the random search iteration (i.e. start from zero)
546562
super().__init__(RandomParamsRFAlgInterface, model_idx=model_idx, **config)
563+
564+
565+
class RandomParamsExtraTreesInterfaceWrapper(AlgInterfaceWrapper):
566+
def __init__(self, model_idx: int, **config):
567+
# model_idx should be the random search iteration (i.e. start from zero)
568+
super().__init__(RandomParamsExtraTreesAlgInterface, model_idx=model_idx, **config)
569+
570+
571+
class RandomParamsKNNInterfaceWrapper(AlgInterfaceWrapper):
572+
def __init__(self, model_idx: int, **config):
573+
# model_idx should be the random search iteration (i.e. start from zero)
574+
super().__init__(RandomParamsKNNAlgInterface, model_idx=model_idx, **config)
575+
576+
577+
class RandomParamsLinearModelInterfaceWrapper(AlgInterfaceWrapper):
578+
def __init__(self, model_idx: int, **config):
579+
# model_idx should be the random search iteration (i.e. start from zero)
580+
super().__init__(RandomParamsLinearModelAlgInterface, model_idx=model_idx, **config)

pytabkit/bench/data/common.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ class TaskSource:
77
OPENML_CLASS_BIN_EXTRA = 'openml-class-bin-extra'
88
OPENML_REGRESSION = 'openml-reg'
99
AUTOML_CLASS_SMALL = 'automl-class-small'
10+
TABARENA_CLASS = 'tabarena-class'
11+
TABARENA_REG = 'tabarena-reg'
1012
CUSTOM = 'custom'
1113

1214

pytabkit/bench/eval/evaluation.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -67,12 +67,21 @@ def select_eval_modes(self, eval_modes: List[Tuple[str, str, str]]) -> List[Tupl
6767
modes = [mode for mode in eval_modes if mode[0] == val]
6868
if len(modes) > 0:
6969
# maximize n_models
70-
idx = np.argmax([int(mode[1]) for mode in modes])
71-
idx_min = np.argmin([int(mode[1]) for mode in modes])
72-
mode = modes[idx]
73-
result.append((f' [{name}-{mode[1]}]', mode))
74-
if idx_min != idx:
75-
result.append((f' [{name}-{modes[idx_min][1]}]', modes[idx_min]))
70+
bag_sizes = [int(mode[1]) for mode in modes]
71+
max_cv = np.max(bag_sizes)
72+
min_cv = np.min(bag_sizes)
73+
bag_sizes = list({max_cv, min_cv}) # only have one element if they're equal
74+
75+
for bag_size in bag_sizes:
76+
# make sure to always select model '0' to avoid non-determinism
77+
result.append((f' [{name}-{bag_size}]', (val, str(bag_size), '0')))
78+
79+
# idx = np.argmax([int(mode[1]) for mode in modes])
80+
# idx_min = np.argmin([int(mode[1]) for mode in modes])
81+
# mode = modes[idx]
82+
# result.append((f' [{name}-{mode[1]}]', mode))
83+
# if idx_min != idx:
84+
# result.append((f' [{name}-{modes[idx_min][1]}]', modes[idx_min]))
7685

7786
return result
7887

0 commit comments

Comments
 (0)