⣰⡁ ⡀⣀ ⢀⡀ ⣀⣀ ⢀⡀ ⣀⡀ ⣰⡀ ⡀⢀ ⣀⣀ ⡇ ⠄ ⣀⣀ ⣀⡀ ⢀⡀ ⡀⣀ ⣰⡀ ⡎⢱ ⣀⡀ ⣰⡀ ⠄ ⣀⣀ ⠄ ⣀⣀ ⢀⡀ ⡀⣀
⢸ ⠏ ⠣⠜ ⠇⠇⠇ ⠣⠜ ⡧⠜ ⠘⠤ ⠣⠼ ⠇⠇⠇ ⠣ ⠇ ⠇⠇⠇ ⡧⠜ ⠣⠜ ⠏ ⠘⠤ ⠣⠜ ⡧⠜ ⠘⠤ ⠇ ⠇⠇⠇ ⠇ ⠴⠥ ⠣⠭ ⠏
OptuML (Optuna + ML) is a Python module providing hyperparameter optimization for machine learning algorithms using the Optuna framework. The module offers a scikit-learn compatible API with enhanced features for robust optimization.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create and train optimizer
clf = Optimizer(algorithm="RandomForestClassifier", n_trials=50, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
# 0.9111111111111111
print(y_pred[:10])
# [1 1 1 1 0 0 2 2 0 0]I want to make a fair comparison of ML methods, where 'fair' means that each method has tuned hyperparameters, making it the best version of itself.
- Comprehensive Algorithm Support: Full scikit-learn algorithm zoo plus CatBoost and XGBoost
- Full Scikit-learn Compatibility: Seamless integration with pipelines, cross-validation, and all sklearn tools
- Robust Optimization: Powered by Optuna with early stopping, timeout protection, and parallel execution
- Type-Safe Design: Separate optimizers for classification and regression with proper type checking
- Production Ready: Cross-platform compatibility, comprehensive error handling, and extensive validation
- Flexible Configuration: Control every aspect of the optimization process
- Benchmarking of multiple algorithms at once, see benchmarking
pip install optumlWith optional algorithm support:
pip install optuml[all] # CatBoost + XGBoost + LightGBM
pip install optuml[catboost] # CatBoost only
pip install optuml[xgboost] # XGBoost only
pip install optuml[lightgbm] # LightGBM onlyor upgrade:
pip install optuml --upgrade# Install required dependencies
pip install optuna scikit-learn numpy
# Optional: Install additional algorithms
pip install catboost xgboost
# Download the module
wget https://raw.githubusercontent.com/filipsPL/optuml/main/optuml/optuml.pyfrom sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train optimizer
clf = Optimizer(
algorithm="RandomForestClassifier",
n_trials=50,
cv=5,
scoring="accuracy",
random_state=42,
show_progress_bar=True
)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# View results
print(f"Accuracy: {accuracy:.3f}")
print(f"Best parameters: {clf.best_params_}")
print(f"Optimization took: {clf.study_time_:.2f} seconds")
print(f"Trials completed: {clf.n_trials_completed_}")from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from optuml import Optimizer
# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train optimizer
reg = Optimizer(
algorithm="XGBRegressor",
n_trials=100,
cv=5,
scoring="r2",
early_stopping_patience=10, # Stop if no improvement for 10 trials
n_jobs=-1, # Use all CPU cores for CV
verbose=True
)
reg.fit(X_train, y_train)
# Evaluate
y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.3f}")| Algorithm | Description | Key Features |
|---|---|---|
SVC |
Support Vector Classifier | Non-linear kernels, probability estimates |
LogisticRegression |
Logistic Regression | L1/L2/Elastic-Net regularization |
RidgeClassifier |
Ridge Classifier | L2 regularization, fast linear model |
KNeighborsClassifier |
k-Nearest Neighbors | Distance weighting, various metrics |
RandomForestClassifier |
Random Forest | Feature importance, OOB score |
ExtraTreesClassifier |
Extremely Randomized Trees | Faster than RF, reduced variance |
AdaBoostClassifier |
AdaBoost | Boosted ensemble, learning rate tuning |
GradientBoostingClassifier |
Gradient Boosting | Sequential boosting, feature subsampling |
HistGradientBoostingClassifier |
Histogram Gradient Boosting | Fast GBDT, native NaN support |
MLPClassifier |
Neural Network | Multiple architectures, early stopping |
GaussianNB |
Gaussian Naive Bayes | Fast, probabilistic |
QDA |
Quadratic Discriminant Analysis | Non-linear boundaries |
DecisionTreeClassifier |
Decision Tree | Multiple criteria, pruning |
SGDClassifier |
Stochastic Gradient Descent | Multiple losses, L1/L2/ElasticNet, online |
CatBoostClassifier* |
CatBoost | Categorical features, GPU support |
XGBClassifier* |
XGBoost | Regularization, missing values |
LGBMClassifier* |
LightGBM | Fast GBDT, leaf-wise growth |
| Algorithm | Description | Key Features |
|---|---|---|
SVR |
Support Vector Regression | Epsilon-insensitive loss |
LinearRegression |
Linear Regression | Simple, interpretable |
Ridge |
Ridge Regression | L2 regularization, stable on collinear |
Lasso |
Lasso Regression | L1 regularization, feature selection |
ElasticNet |
Elastic Net | L1+L2 regularization, sparse solutions |
KNeighborsRegressor |
k-Nearest Neighbors | Local regression |
RandomForestRegressor |
Random Forest | Reduces overfitting |
ExtraTreesRegressor |
Extremely Randomized Trees | Faster than RF, reduced variance |
AdaBoostRegressor |
AdaBoost | Sequential learning |
GradientBoostingRegressor |
Gradient Boosting | Sequential boosting, feature subsampling |
HistGradientBoostingRegressor |
Histogram Gradient Boosting | Fast GBDT, native NaN support |
MLPRegressor |
Neural Network | Non-linear patterns |
DecisionTreeRegressor |
Decision Tree | Non-parametric |
SGDRegressor |
Stochastic Gradient Descent | Multiple losses, L1/L2/ElasticNet, online |
CatBoostRegressor* |
CatBoost | Handles categoricals |
XGBRegressor* |
XGBoost | High performance |
LGBMRegressor* |
LightGBM | Fast GBDT, leaf-wise growth |
*Optional dependencies (install separately)
Stop optimization when no improvement is observed:
optimizer = Optimizer(
algorithm="XGBClassifier",
n_trials=1000,
early_stopping_patience=20 # Stop after 20 trials without improvement
)Speed up optimization using multiple CPU cores:
optimizer = Optimizer(
algorithm="RandomForestClassifier",
n_trials=100,
cv=10,
n_jobs=-1 # Use all available cores
)Use any scikit-learn compatible scoring metric:
optimizer = Optimizer(
algorithm="SVC",
scoring="roc_auc", # For classification
# scoring="neg_mean_squared_error", # For regression
# scoring="f1_weighted", # For imbalanced classes
)Set time limits for optimization:
optimizer = Optimizer(
algorithm="MLPClassifier",
timeout=300, # Total optimization timeout (5 minutes)
cv_timeout=30, # Per-trial timeout (30 seconds)
n_trials=1000 # Will stop at timeout even if trials remain
)Get detailed optimization information:
# After fitting
optimizer.fit(X_train, y_train)
# Access the Optuna study object
study = optimizer.study_
print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.4f}")
# Plot optimization history (requires plotly)
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()
# Plot parameter importances
fig = vis.plot_param_importances(study)
fig.show()Full compatibility with scikit-learn pipelines:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Create pipeline with OptuML
pipe = Pipeline([
('scaler', StandardScaler()),
('optimizer', Optimizer(algorithm="SVC", n_trials=50))
])
# Use like any sklearn pipeline
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)AlgorithmBenchmark runs every supported algorithm (or a chosen subset) on your data, optimizes each one independently, and reports a ranked comparison — without any scikit-learn estimator constraints. Check sample script and outputs.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from optuml import AlgorithmBenchmark
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
bench = AlgorithmBenchmark(
task="classification", # "classification" or "regression"
n_trials=50,
random_state=42,
)
bench.fit(X_train, y_train)
# Ranked results as a DataFrame (requires pandas) or list of dicts
print(bench.summary())
# algorithm best_score n_trials_completed fit_time error
# 0 RandomForestClassifier 0.983333 50 4.21 None
# 1 ExtraTreesClassifier 0.975000 50 3.87 None
# ...
print(bench.best_algorithm_) # e.g. "RandomForestClassifier"
print(bench.best_score_) # best CV score across all algorithms
# Use the winning estimator directly
predictions = bench.best_estimator_.predict(X_test)
# Or drill into any individual optimizer
rf_optimizer = bench.optimizers_["RandomForestClassifier"]
print(rf_optimizer.best_params_)To benchmark a specific subset of algorithms:
bench = AlgorithmBenchmark(
task="regression",
algorithms=["Ridge", "RandomForestRegressor", "XGBRegressor"],
n_trials=50,
scoring="r2",
)
bench.fit(X_train, y_train)Run algorithms in parallel across CPU cores with n_jobs_algorithms:
bench = AlgorithmBenchmark(
task="classification",
n_trials=50,
n_jobs_algorithms=-1, # one process per algorithm, all cores
)| Parameter | Type | Default | Description |
|---|---|---|---|
task |
str | required | "classification" or "regression" |
algorithms |
list or "all" |
"all" |
Algorithms to benchmark |
n_trials |
int | 50 | Optuna trials per algorithm |
timeout |
float/None | None | Per-algorithm study timeout (seconds) |
cv |
int | 5 | Cross-validation folds |
scoring |
str/None | Auto* | Scoring metric |
cv_timeout |
float | 120 | Per-trial CV timeout (seconds) |
random_state |
int/None | None | Random seed forwarded to every Optimizer |
early_stopping_patience |
int/None | None | Early stopping patience per algorithm |
n_jobs |
int | 1 | Parallel CV jobs inside each Optimizer |
n_jobs_algorithms |
int | 1 | Algorithms to run in parallel (-1 = all cores) |
verbose |
bool/int | False | Verbosity forwarded to each Optimizer |
*Auto defaults: "accuracy" for classification, "r2" for regression
| Attribute | Description |
|---|---|
best_algorithm_ |
Name of the best-scoring algorithm |
best_estimator_ |
Fitted sklearn estimator from the winning optimizer |
best_score_ |
Best CV score across all algorithms |
best_params_ |
Hyperparameters of the winning optimizer |
results_ |
List of per-algorithm result dicts (including failures) |
optimizers_ |
dict[algorithm_name, Optimizer] for full introspection |
For more control, use the specific optimizer classes:
from optuml.optuml import ClassifierOptimizer, RegressorOptimizer
# Classifier with all classifier-specific methods
clf = ClassifierOptimizer(
algorithm="RandomForestClassifier",
n_trials=100
)
clf.fit(X_train, y_train)
probas = clf.predict_proba(X_test)
decision = clf.decision_function(X_test) # If supported
# Regressor with regression-specific defaults
reg = RegressorOptimizer(
algorithm="RandomForestRegressor",
n_trials=100,
scoring="r2" # Default for regressors
)Universal optimizer that automatically selects between classification and regression.
Specialized optimizer for classification algorithms with methods like predict_proba() and decision_function().
Specialized optimizer for regression algorithms with appropriate default scoring metrics.
| Parameter | Type | Default | Description |
|---|---|---|---|
algorithm |
str | required | ML algorithm to optimize |
n_trials |
int | 100 | Number of optimization trials |
cv |
int | 5 | Cross-validation folds |
scoring |
str/None | Auto* | Scoring metric for CV |
direction |
str | "maximize" | Optimization direction |
timeout |
float/None | None | Total optimization timeout (seconds) |
cv_timeout |
float | 120 | Single CV evaluation timeout |
random_state |
int/None | None | Random seed for reproducibility |
n_jobs |
int | 1 | Parallel jobs for CV (-1 for all cores) |
early_stopping_patience |
int/None | None | Trials without improvement before stopping |
verbose |
bool/int | False | Verbosity level |
show_progress_bar |
bool | False | Show optimization progress |
*Auto defaults: "accuracy" for classifiers, "r2" for regressors
| Method | Description | Available For |
|---|---|---|
fit(X, y) |
Optimize hyperparameters and train | All |
predict(X) |
Make predictions | All |
score(X, y) |
Evaluate model performance | All |
predict_proba(X) |
Predict class probabilities | Classifiers |
decision_function(X) |
Get decision values | Some classifiers |
get_params() |
Get optimizer parameters | All |
set_params(**params) |
Set optimizer parameters | All |
| Attribute | Description |
|---|---|
best_estimator_ |
Trained model with best parameters |
best_params_ |
Best hyperparameters found |
best_score_ |
Best cross-validation score |
study_ |
Optuna study object |
study_time_ |
Total optimization time |
n_trials_completed_ |
Number of completed trials |
classes_ |
Class labels (classifiers only) |
n_features_in_ |
Number of input features |
feature_names_in_ |
Feature names (if available) |
Solution: Increase cv_timeout or reduce cv folds:
optimizer = Optimizer(algorithm="SVC", cv_timeout=300, cv=3)Solution: Install optional dependencies:
pip install optuml[all]
# or individually:
pip install catboost xgboost lightgbmSolutions:
- Use parallel CV:
n_jobs=-1 - Set timeout:
timeout=600 - Use early stopping:
early_stopping_patience=10 - Reduce trials:
n_trials=50
Solutions:
- Use algorithms with lower memory footprint (e.g.,
LogisticRegression,SGDClassifier, orSGDRegressor) - Reduce CV folds
-
Start with fewer trials: Begin with
n_trials=20-50for exploration, then increase for final optimization -
Use appropriate scoring metrics:
- Imbalanced classification:
"f1_weighted","roc_auc" - Regression:
"r2","neg_mean_squared_error"
- Imbalanced classification:
-
Enable early stopping for large trial counts:
Optimizer(n_trials=1000, early_stopping_patience=20)
-
Set random state for reproducibility:
Optimizer(random_state=42)
-
Use parallel processing for faster optimization:
Optimizer(n_jobs=-1)
See this page for benchmark results.
If you use OptuML in your research, please cite:
@software{stefaniak_optuml_2024,
author = {Filip Stefaniak},
title = {OptuML: Hyperparameter Optimization for Multiple Machine Learning Algorithms using Optuna},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17305963},
url = {https://doi.org/10.5281/zenodo.17305963}
}