OptuML: Hyperparameter Optimization for Machine Learning Algorithms using Optuna

 ⣰⡁ ⡀⣀ ⢀⡀ ⣀⣀    ⢀⡀ ⣀⡀ ⣰⡀ ⡀⢀ ⣀⣀  ⡇   ⠄ ⣀⣀  ⣀⡀ ⢀⡀ ⡀⣀ ⣰⡀   ⡎⢱ ⣀⡀ ⣰⡀ ⠄ ⣀⣀  ⠄ ⣀⣀ ⢀⡀ ⡀⣀
 ⢸  ⠏  ⠣⠜ ⠇⠇⠇   ⠣⠜ ⡧⠜ ⠘⠤ ⠣⠼ ⠇⠇⠇ ⠣   ⠇ ⠇⠇⠇ ⡧⠜ ⠣⠜ ⠏  ⠘⠤   ⠣⠜ ⡧⠜ ⠘⠤ ⠇ ⠇⠇⠇ ⠇ ⠴⠥ ⠣⠭ ⠏

OptuML (Optuna + ML) is a Python module providing hyperparameter optimization for machine learning algorithms using the Optuna framework. The module offers a scikit-learn compatible API with enhanced features for robust optimization.

tl;dr

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create and train optimizer
clf = Optimizer(algorithm="RandomForestClassifier", n_trials=50, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)
# 0.9111111111111111
print(y_pred[:10])
# [1 1 1 1 0 0 2 2 0 0]

tl;dr why this module?

I want to make a fair comparison of ML methods, where 'fair' means that each method has tuned hyperparameters, making it the best version of itself.

Key Features

Comprehensive Algorithm Support: Full scikit-learn algorithm zoo plus CatBoost and XGBoost
Full Scikit-learn Compatibility: Seamless integration with pipelines, cross-validation, and all sklearn tools
Robust Optimization: Powered by Optuna with early stopping, timeout protection, and parallel execution
Type-Safe Design: Separate optimizers for classification and regression with proper type checking
Production Ready: Cross-platform compatibility, comprehensive error handling, and extensive validation
Flexible Configuration: Control every aspect of the optimization process
Benchmarking of multiple algorithms at once, see benchmarking

Installation

Option A: pip (recommended)

pip install optuml

With optional algorithm support:

pip install optuml[all]          # CatBoost + XGBoost + LightGBM
pip install optuml[catboost]     # CatBoost only
pip install optuml[xgboost]      # XGBoost only
pip install optuml[lightgbm]     # LightGBM only

or upgrade:

pip install optuml --upgrade

Option B: Manual installation

# Install required dependencies
pip install optuna scikit-learn numpy

# Optional: Install additional algorithms
pip install catboost xgboost

# Download the module
wget https://raw.githubusercontent.com/filipsPL/optuml/main/optuml/optuml.py

Quick Start

Classification Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train optimizer
clf = Optimizer(
    algorithm="RandomForestClassifier",
    n_trials=50,
    cv=5,
    scoring="accuracy",
    random_state=42,
    show_progress_bar=True
)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# View results
print(f"Accuracy: {accuracy:.3f}")
print(f"Best parameters: {clf.best_params_}")
print(f"Optimization took: {clf.study_time_:.2f} seconds")
print(f"Trials completed: {clf.n_trials_completed_}")

Regression Example

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from optuml import Optimizer

# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train optimizer
reg = Optimizer(
    algorithm="XGBRegressor",
    n_trials=100,
    cv=5,
    scoring="r2",
    early_stopping_patience=10,  # Stop if no improvement for 10 trials
    n_jobs=-1,  # Use all CPU cores for CV
    verbose=True
)
reg.fit(X_train, y_train)

# Evaluate
y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.3f}")

Supported Algorithms

Classification Algorithms

Algorithm	Description	Key Features
`SVC`	Support Vector Classifier	Non-linear kernels, probability estimates
`LogisticRegression`	Logistic Regression	L1/L2/Elastic-Net regularization
`RidgeClassifier`	Ridge Classifier	L2 regularization, fast linear model
`KNeighborsClassifier`	k-Nearest Neighbors	Distance weighting, various metrics
`RandomForestClassifier`	Random Forest	Feature importance, OOB score
`ExtraTreesClassifier`	Extremely Randomized Trees	Faster than RF, reduced variance
`AdaBoostClassifier`	AdaBoost	Boosted ensemble, learning rate tuning
`GradientBoostingClassifier`	Gradient Boosting	Sequential boosting, feature subsampling
`HistGradientBoostingClassifier`	Histogram Gradient Boosting	Fast GBDT, native NaN support
`MLPClassifier`	Neural Network	Multiple architectures, early stopping
`GaussianNB`	Gaussian Naive Bayes	Fast, probabilistic
`QDA`	Quadratic Discriminant Analysis	Non-linear boundaries
`DecisionTreeClassifier`	Decision Tree	Multiple criteria, pruning
`SGDClassifier`	Stochastic Gradient Descent	Multiple losses, L1/L2/ElasticNet, online
`CatBoostClassifier`*	CatBoost	Categorical features, GPU support
`XGBClassifier`*	XGBoost	Regularization, missing values
`LGBMClassifier`*	LightGBM	Fast GBDT, leaf-wise growth

Regression Algorithms

Algorithm	Description	Key Features
`SVR`	Support Vector Regression	Epsilon-insensitive loss
`LinearRegression`	Linear Regression	Simple, interpretable
`Ridge`	Ridge Regression	L2 regularization, stable on collinear
`Lasso`	Lasso Regression	L1 regularization, feature selection
`ElasticNet`	Elastic Net	L1+L2 regularization, sparse solutions
`KNeighborsRegressor`	k-Nearest Neighbors	Local regression
`RandomForestRegressor`	Random Forest	Reduces overfitting
`ExtraTreesRegressor`	Extremely Randomized Trees	Faster than RF, reduced variance
`AdaBoostRegressor`	AdaBoost	Sequential learning
`GradientBoostingRegressor`	Gradient Boosting	Sequential boosting, feature subsampling
`HistGradientBoostingRegressor`	Histogram Gradient Boosting	Fast GBDT, native NaN support
`MLPRegressor`	Neural Network	Non-linear patterns
`DecisionTreeRegressor`	Decision Tree	Non-parametric
`SGDRegressor`	Stochastic Gradient Descent	Multiple losses, L1/L2/ElasticNet, online
`CatBoostRegressor`*	CatBoost	Handles categoricals
`XGBRegressor`*	XGBoost	High performance
`LGBMRegressor`*	LightGBM	Fast GBDT, leaf-wise growth

*Optional dependencies (install separately)

Advanced Features

Early Stopping

Stop optimization when no improvement is observed:

optimizer = Optimizer(
    algorithm="XGBClassifier",
    n_trials=1000,
    early_stopping_patience=20  # Stop after 20 trials without improvement
)

Parallel Cross-Validation

Speed up optimization using multiple CPU cores:

optimizer = Optimizer(
    algorithm="RandomForestClassifier",
    n_trials=100,
    cv=10,
    n_jobs=-1  # Use all available cores
)

Custom Scoring Metrics

Use any scikit-learn compatible scoring metric:

optimizer = Optimizer(
    algorithm="SVC",
    scoring="roc_auc",  # For classification
    # scoring="neg_mean_squared_error",  # For regression
    # scoring="f1_weighted",  # For imbalanced classes
)

Timeout Protection

Set time limits for optimization:

optimizer = Optimizer(
    algorithm="MLPClassifier",
    timeout=300,  # Total optimization timeout (5 minutes)
    cv_timeout=30,  # Per-trial timeout (30 seconds)
    n_trials=1000  # Will stop at timeout even if trials remain
)

Access to Optuna Study

Get detailed optimization information:

# After fitting
optimizer.fit(X_train, y_train)

# Access the Optuna study object
study = optimizer.study_
print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.4f}")

# Plot optimization history (requires plotly)
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()

# Plot parameter importances
fig = vis.plot_param_importances(study)
fig.show()

Pipeline Integration

Full compatibility with scikit-learn pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create pipeline with OptuML
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('optimizer', Optimizer(algorithm="SVC", n_trials=50))
])

# Use like any sklearn pipeline
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Algorithm Benchmarking

AlgorithmBenchmark runs every supported algorithm (or a chosen subset) on your data, optimizes each one independently, and reports a ranked comparison — without any scikit-learn estimator constraints. Check sample script and outputs.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from optuml import AlgorithmBenchmark

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bench = AlgorithmBenchmark(
    task="classification",   # "classification" or "regression"
    n_trials=50,
    random_state=42,
)
bench.fit(X_train, y_train)

# Ranked results as a DataFrame (requires pandas) or list of dicts
print(bench.summary())
#                      algorithm  best_score  n_trials_completed  fit_time error
# 0         RandomForestClassifier    0.983333                  50      4.21  None
# 1           ExtraTreesClassifier    0.975000                  50      3.87  None
# ...

print(bench.best_algorithm_)   # e.g. "RandomForestClassifier"
print(bench.best_score_)       # best CV score across all algorithms

# Use the winning estimator directly
predictions = bench.best_estimator_.predict(X_test)

# Or drill into any individual optimizer
rf_optimizer = bench.optimizers_["RandomForestClassifier"]
print(rf_optimizer.best_params_)

To benchmark a specific subset of algorithms:

bench = AlgorithmBenchmark(
    task="regression",
    algorithms=["Ridge", "RandomForestRegressor", "XGBRegressor"],
    n_trials=50,
    scoring="r2",
)
bench.fit(X_train, y_train)

Run algorithms in parallel across CPU cores with n_jobs_algorithms:

bench = AlgorithmBenchmark(
    task="classification",
    n_trials=50,
    n_jobs_algorithms=-1,   # one process per algorithm, all cores
)

`AlgorithmBenchmark` Parameters

Parameter	Type	Default	Description
`task`	str	required	`"classification"` or `"regression"`
`algorithms`	list or `"all"`	`"all"`	Algorithms to benchmark
`n_trials`	int	50	Optuna trials per algorithm
`timeout`	float/None	None	Per-algorithm study timeout (seconds)
`cv`	int	5	Cross-validation folds
`scoring`	str/None	Auto*	Scoring metric
`cv_timeout`	float	120	Per-trial CV timeout (seconds)
`random_state`	int/None	None	Random seed forwarded to every `Optimizer`
`early_stopping_patience`	int/None	None	Early stopping patience per algorithm
`n_jobs`	int	1	Parallel CV jobs inside each `Optimizer`
`n_jobs_algorithms`	int	1	Algorithms to run in parallel (`-1` = all cores)
`verbose`	bool/int	False	Verbosity forwarded to each `Optimizer`

*Auto defaults: "accuracy" for classification, "r2" for regression

Attributes after `fit()`

Attribute	Description
`best_algorithm_`	Name of the best-scoring algorithm
`best_estimator_`	Fitted sklearn estimator from the winning optimizer
`best_score_`	Best CV score across all algorithms
`best_params_`	Hyperparameters of the winning optimizer
`results_`	List of per-algorithm result dicts (including failures)
`optimizers_`	`dict[algorithm_name, Optimizer]` for full introspection

Type-Specific Optimizers

For more control, use the specific optimizer classes:

from optuml.optuml import ClassifierOptimizer, RegressorOptimizer

# Classifier with all classifier-specific methods
clf = ClassifierOptimizer(
    algorithm="RandomForestClassifier",
    n_trials=100
)
clf.fit(X_train, y_train)
probas = clf.predict_proba(X_test)
decision = clf.decision_function(X_test)  # If supported

# Regressor with regression-specific defaults
reg = RegressorOptimizer(
    algorithm="RandomForestRegressor",
    n_trials=100,
    scoring="r2"  # Default for regressors
)

API Reference

Main Classes

`Optimizer`

Universal optimizer that automatically selects between classification and regression.

`ClassifierOptimizer`

Specialized optimizer for classification algorithms with methods like predict_proba() and decision_function().

`RegressorOptimizer`

Specialized optimizer for regression algorithms with appropriate default scoring metrics.

Common Parameters

Parameter	Type	Default	Description
`algorithm`	str	required	ML algorithm to optimize
`n_trials`	int	100	Number of optimization trials
`cv`	int	5	Cross-validation folds
`scoring`	str/None	Auto*	Scoring metric for CV
`direction`	str	"maximize"	Optimization direction
`timeout`	float/None	None	Total optimization timeout (seconds)
`cv_timeout`	float	120	Single CV evaluation timeout
`random_state`	int/None	None	Random seed for reproducibility
`n_jobs`	int	1	Parallel jobs for CV (-1 for all cores)
`early_stopping_patience`	int/None	None	Trials without improvement before stopping
`verbose`	bool/int	False	Verbosity level
`show_progress_bar`	bool	False	Show optimization progress

*Auto defaults: "accuracy" for classifiers, "r2" for regressors

Methods

Method	Description	Available For
`fit(X, y)`	Optimize hyperparameters and train	All
`predict(X)`	Make predictions	All
`score(X, y)`	Evaluate model performance	All
`predict_proba(X)`	Predict class probabilities	Classifiers
`decision_function(X)`	Get decision values	Some classifiers
`get_params()`	Get optimizer parameters	All
`set_params(**params)`	Set optimizer parameters	All

Attributes (after fitting)

Attribute	Description
`best_estimator_`	Trained model with best parameters
`best_params_`	Best hyperparameters found
`best_score_`	Best cross-validation score
`study_`	Optuna study object
`study_time_`	Total optimization time
`n_trials_completed_`	Number of completed trials
`classes_`	Class labels (classifiers only)
`n_features_in_`	Number of input features
`feature_names_in_`	Feature names (if available)

Troubleshooting

Issue: "No successful trials completed"

Solution: Increase cv_timeout or reduce cv folds:

optimizer = Optimizer(algorithm="SVC", cv_timeout=300, cv=3)

Issue: CatBoost/XGBoost/LightGBM not available

Solution: Install optional dependencies:

pip install optuml[all]
# or individually:
pip install catboost xgboost lightgbm

Issue: Optimization takes too long

Solutions:

Use parallel CV: n_jobs=-1
Set timeout: timeout=600
Use early stopping: early_stopping_patience=10
Reduce trials: n_trials=50

Issue: Memory errors with large datasets

Solutions:

Use algorithms with lower memory footprint (e.g., LogisticRegression, SGDClassifier, or SGDRegressor)
Reduce CV folds

Best Practices

Start with fewer trials: Begin with n_trials=20-50 for exploration, then increase for final optimization
Use appropriate scoring metrics:
- Imbalanced classification: "f1_weighted", "roc_auc"
- Regression: "r2", "neg_mean_squared_error"

Enable early stopping for large trial counts:

Optimizer(n_trials=1000, early_stopping_patience=20)

Set random state for reproducibility:
```
Optimizer(random_state=42)
```
Use parallel processing for faster optimization:
```
Optimizer(n_jobs=-1)
```

Benchmark results

See this page for benchmark results.

Citation

If you use OptuML in your research, please cite:

@software{stefaniak_optuml_2024,
  author       = {Filip Stefaniak},
  title        = {OptuML: Hyperparameter Optimization for Multiple Machine Learning Algorithms using Optuna},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17305963},
  url          = {https://doi.org/10.5281/zenodo.17305963}
}

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
benchmark		benchmark
examples		examples
optuml		optuml
tests		tests
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

OptuML: Hyperparameter Optimization for Machine Learning Algorithms using Optuna

tl;dr

tl;dr why this module?

Key Features

Installation

Option A: pip (recommended)

Option B: Manual installation

Quick Start

Classification Example

Regression Example

Supported Algorithms

Classification Algorithms

Regression Algorithms

Advanced Features

Early Stopping

Parallel Cross-Validation

Custom Scoring Metrics

Timeout Protection

Access to Optuna Study

Pipeline Integration

Algorithm Benchmarking

AlgorithmBenchmark Parameters

Attributes after fit()

Type-Specific Optimizers

API Reference

Main Classes

Optimizer

ClassifierOptimizer

RegressorOptimizer

Common Parameters

Methods

Attributes (after fitting)

Troubleshooting

Issue: "No successful trials completed"

Issue: CatBoost/XGBoost/LightGBM not available

Issue: Optimization takes too long

Issue: Memory errors with large datasets

Best Practices

Benchmark results

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Contributors

Uh oh!

Languages

`AlgorithmBenchmark` Parameters

Attributes after `fit()`

`Optimizer`

`ClassifierOptimizer`

`RegressorOptimizer`