Skip to content

filipsPL/optuml

Repository files navigation

OptuML: Hyperparameter Optimization for Machine Learning Algorithms using Optuna

 ⣰⡁ ⡀⣀ ⢀⡀ ⣀⣀    ⢀⡀ ⣀⡀ ⣰⡀ ⡀⢀ ⣀⣀  ⡇   ⠄ ⣀⣀  ⣀⡀ ⢀⡀ ⡀⣀ ⣰⡀   ⡎⢱ ⣀⡀ ⣰⡀ ⠄ ⣀⣀  ⠄ ⣀⣀ ⢀⡀ ⡀⣀
 ⢸  ⠏  ⠣⠜ ⠇⠇⠇   ⠣⠜ ⡧⠜ ⠘⠤ ⠣⠼ ⠇⠇⠇ ⠣   ⠇ ⠇⠇⠇ ⡧⠜ ⠣⠜ ⠏  ⠘⠤   ⠣⠜ ⡧⠜ ⠘⠤ ⠇ ⠇⠇⠇ ⠇ ⠴⠥ ⠣⠭ ⠏ 

OptuML (Optuna + ML) is a Python module providing hyperparameter optimization for machine learning algorithms using the Optuna framework. The module offers a scikit-learn compatible API with enhanced features for robust optimization.

Python manual install Python pip install pypi version DOI

tl;dr

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create and train optimizer
clf = Optimizer(algorithm="RandomForestClassifier", n_trials=50, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)
# 0.9111111111111111
print(y_pred[:10])
# [1 1 1 1 0 0 2 2 0 0]

tl;dr why this module?

I want to make a fair comparison of ML methods, where 'fair' means that each method has tuned hyperparameters, making it the best version of itself.

Key Features

  • Comprehensive Algorithm Support: Full scikit-learn algorithm zoo plus CatBoost and XGBoost
  • Full Scikit-learn Compatibility: Seamless integration with pipelines, cross-validation, and all sklearn tools
  • Robust Optimization: Powered by Optuna with early stopping, timeout protection, and parallel execution
  • Type-Safe Design: Separate optimizers for classification and regression with proper type checking
  • Production Ready: Cross-platform compatibility, comprehensive error handling, and extensive validation
  • Flexible Configuration: Control every aspect of the optimization process
  • Benchmarking of multiple algorithms at once, see benchmarking

Installation

Option A: pip (recommended)

pip install optuml

With optional algorithm support:

pip install optuml[all]          # CatBoost + XGBoost + LightGBM
pip install optuml[catboost]     # CatBoost only
pip install optuml[xgboost]      # XGBoost only
pip install optuml[lightgbm]     # LightGBM only

or upgrade:

pip install optuml --upgrade

Option B: Manual installation

# Install required dependencies
pip install optuna scikit-learn numpy

# Optional: Install additional algorithms
pip install catboost xgboost

# Download the module
wget https://raw.githubusercontent.com/filipsPL/optuml/main/optuml/optuml.py

Quick Start

Classification Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train optimizer
clf = Optimizer(
    algorithm="RandomForestClassifier",
    n_trials=50,
    cv=5,
    scoring="accuracy",
    random_state=42,
    show_progress_bar=True
)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# View results
print(f"Accuracy: {accuracy:.3f}")
print(f"Best parameters: {clf.best_params_}")
print(f"Optimization took: {clf.study_time_:.2f} seconds")
print(f"Trials completed: {clf.n_trials_completed_}")

Regression Example

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from optuml import Optimizer

# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train optimizer
reg = Optimizer(
    algorithm="XGBRegressor",
    n_trials=100,
    cv=5,
    scoring="r2",
    early_stopping_patience=10,  # Stop if no improvement for 10 trials
    n_jobs=-1,  # Use all CPU cores for CV
    verbose=True
)
reg.fit(X_train, y_train)

# Evaluate
y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.3f}")

Supported Algorithms

Classification Algorithms

Algorithm Description Key Features
SVC Support Vector Classifier Non-linear kernels, probability estimates
LogisticRegression Logistic Regression L1/L2/Elastic-Net regularization
RidgeClassifier Ridge Classifier L2 regularization, fast linear model
KNeighborsClassifier k-Nearest Neighbors Distance weighting, various metrics
RandomForestClassifier Random Forest Feature importance, OOB score
ExtraTreesClassifier Extremely Randomized Trees Faster than RF, reduced variance
AdaBoostClassifier AdaBoost Boosted ensemble, learning rate tuning
GradientBoostingClassifier Gradient Boosting Sequential boosting, feature subsampling
HistGradientBoostingClassifier Histogram Gradient Boosting Fast GBDT, native NaN support
MLPClassifier Neural Network Multiple architectures, early stopping
GaussianNB Gaussian Naive Bayes Fast, probabilistic
QDA Quadratic Discriminant Analysis Non-linear boundaries
DecisionTreeClassifier Decision Tree Multiple criteria, pruning
SGDClassifier Stochastic Gradient Descent Multiple losses, L1/L2/ElasticNet, online
CatBoostClassifier* CatBoost Categorical features, GPU support
XGBClassifier* XGBoost Regularization, missing values
LGBMClassifier* LightGBM Fast GBDT, leaf-wise growth

Regression Algorithms

Algorithm Description Key Features
SVR Support Vector Regression Epsilon-insensitive loss
LinearRegression Linear Regression Simple, interpretable
Ridge Ridge Regression L2 regularization, stable on collinear
Lasso Lasso Regression L1 regularization, feature selection
ElasticNet Elastic Net L1+L2 regularization, sparse solutions
KNeighborsRegressor k-Nearest Neighbors Local regression
RandomForestRegressor Random Forest Reduces overfitting
ExtraTreesRegressor Extremely Randomized Trees Faster than RF, reduced variance
AdaBoostRegressor AdaBoost Sequential learning
GradientBoostingRegressor Gradient Boosting Sequential boosting, feature subsampling
HistGradientBoostingRegressor Histogram Gradient Boosting Fast GBDT, native NaN support
MLPRegressor Neural Network Non-linear patterns
DecisionTreeRegressor Decision Tree Non-parametric
SGDRegressor Stochastic Gradient Descent Multiple losses, L1/L2/ElasticNet, online
CatBoostRegressor* CatBoost Handles categoricals
XGBRegressor* XGBoost High performance
LGBMRegressor* LightGBM Fast GBDT, leaf-wise growth

*Optional dependencies (install separately)

Advanced Features

Early Stopping

Stop optimization when no improvement is observed:

optimizer = Optimizer(
    algorithm="XGBClassifier",
    n_trials=1000,
    early_stopping_patience=20  # Stop after 20 trials without improvement
)

Parallel Cross-Validation

Speed up optimization using multiple CPU cores:

optimizer = Optimizer(
    algorithm="RandomForestClassifier",
    n_trials=100,
    cv=10,
    n_jobs=-1  # Use all available cores
)

Custom Scoring Metrics

Use any scikit-learn compatible scoring metric:

optimizer = Optimizer(
    algorithm="SVC",
    scoring="roc_auc",  # For classification
    # scoring="neg_mean_squared_error",  # For regression
    # scoring="f1_weighted",  # For imbalanced classes
)

Timeout Protection

Set time limits for optimization:

optimizer = Optimizer(
    algorithm="MLPClassifier",
    timeout=300,  # Total optimization timeout (5 minutes)
    cv_timeout=30,  # Per-trial timeout (30 seconds)
    n_trials=1000  # Will stop at timeout even if trials remain
)

Access to Optuna Study

Get detailed optimization information:

# After fitting
optimizer.fit(X_train, y_train)

# Access the Optuna study object
study = optimizer.study_
print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.4f}")

# Plot optimization history (requires plotly)
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()

# Plot parameter importances
fig = vis.plot_param_importances(study)
fig.show()

Pipeline Integration

Full compatibility with scikit-learn pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create pipeline with OptuML
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('optimizer', Optimizer(algorithm="SVC", n_trials=50))
])

# Use like any sklearn pipeline
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Algorithm Benchmarking

AlgorithmBenchmark runs every supported algorithm (or a chosen subset) on your data, optimizes each one independently, and reports a ranked comparison — without any scikit-learn estimator constraints. Check sample script and outputs.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from optuml import AlgorithmBenchmark

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bench = AlgorithmBenchmark(
    task="classification",   # "classification" or "regression"
    n_trials=50,
    random_state=42,
)
bench.fit(X_train, y_train)

# Ranked results as a DataFrame (requires pandas) or list of dicts
print(bench.summary())
#                      algorithm  best_score  n_trials_completed  fit_time error
# 0         RandomForestClassifier    0.983333                  50      4.21  None
# 1           ExtraTreesClassifier    0.975000                  50      3.87  None
# ...

print(bench.best_algorithm_)   # e.g. "RandomForestClassifier"
print(bench.best_score_)       # best CV score across all algorithms

# Use the winning estimator directly
predictions = bench.best_estimator_.predict(X_test)

# Or drill into any individual optimizer
rf_optimizer = bench.optimizers_["RandomForestClassifier"]
print(rf_optimizer.best_params_)

To benchmark a specific subset of algorithms:

bench = AlgorithmBenchmark(
    task="regression",
    algorithms=["Ridge", "RandomForestRegressor", "XGBRegressor"],
    n_trials=50,
    scoring="r2",
)
bench.fit(X_train, y_train)

Run algorithms in parallel across CPU cores with n_jobs_algorithms:

bench = AlgorithmBenchmark(
    task="classification",
    n_trials=50,
    n_jobs_algorithms=-1,   # one process per algorithm, all cores
)

AlgorithmBenchmark Parameters

Parameter Type Default Description
task str required "classification" or "regression"
algorithms list or "all" "all" Algorithms to benchmark
n_trials int 50 Optuna trials per algorithm
timeout float/None None Per-algorithm study timeout (seconds)
cv int 5 Cross-validation folds
scoring str/None Auto* Scoring metric
cv_timeout float 120 Per-trial CV timeout (seconds)
random_state int/None None Random seed forwarded to every Optimizer
early_stopping_patience int/None None Early stopping patience per algorithm
n_jobs int 1 Parallel CV jobs inside each Optimizer
n_jobs_algorithms int 1 Algorithms to run in parallel (-1 = all cores)
verbose bool/int False Verbosity forwarded to each Optimizer

*Auto defaults: "accuracy" for classification, "r2" for regression

Attributes after fit()

Attribute Description
best_algorithm_ Name of the best-scoring algorithm
best_estimator_ Fitted sklearn estimator from the winning optimizer
best_score_ Best CV score across all algorithms
best_params_ Hyperparameters of the winning optimizer
results_ List of per-algorithm result dicts (including failures)
optimizers_ dict[algorithm_name, Optimizer] for full introspection

Type-Specific Optimizers

For more control, use the specific optimizer classes:

from optuml.optuml import ClassifierOptimizer, RegressorOptimizer

# Classifier with all classifier-specific methods
clf = ClassifierOptimizer(
    algorithm="RandomForestClassifier",
    n_trials=100
)
clf.fit(X_train, y_train)
probas = clf.predict_proba(X_test)
decision = clf.decision_function(X_test)  # If supported

# Regressor with regression-specific defaults
reg = RegressorOptimizer(
    algorithm="RandomForestRegressor",
    n_trials=100,
    scoring="r2"  # Default for regressors
)

API Reference

Main Classes

Optimizer

Universal optimizer that automatically selects between classification and regression.

ClassifierOptimizer

Specialized optimizer for classification algorithms with methods like predict_proba() and decision_function().

RegressorOptimizer

Specialized optimizer for regression algorithms with appropriate default scoring metrics.

Common Parameters

Parameter Type Default Description
algorithm str required ML algorithm to optimize
n_trials int 100 Number of optimization trials
cv int 5 Cross-validation folds
scoring str/None Auto* Scoring metric for CV
direction str "maximize" Optimization direction
timeout float/None None Total optimization timeout (seconds)
cv_timeout float 120 Single CV evaluation timeout
random_state int/None None Random seed for reproducibility
n_jobs int 1 Parallel jobs for CV (-1 for all cores)
early_stopping_patience int/None None Trials without improvement before stopping
verbose bool/int False Verbosity level
show_progress_bar bool False Show optimization progress

*Auto defaults: "accuracy" for classifiers, "r2" for regressors

Methods

Method Description Available For
fit(X, y) Optimize hyperparameters and train All
predict(X) Make predictions All
score(X, y) Evaluate model performance All
predict_proba(X) Predict class probabilities Classifiers
decision_function(X) Get decision values Some classifiers
get_params() Get optimizer parameters All
set_params(**params) Set optimizer parameters All

Attributes (after fitting)

Attribute Description
best_estimator_ Trained model with best parameters
best_params_ Best hyperparameters found
best_score_ Best cross-validation score
study_ Optuna study object
study_time_ Total optimization time
n_trials_completed_ Number of completed trials
classes_ Class labels (classifiers only)
n_features_in_ Number of input features
feature_names_in_ Feature names (if available)

Troubleshooting

Issue: "No successful trials completed"

Solution: Increase cv_timeout or reduce cv folds:

optimizer = Optimizer(algorithm="SVC", cv_timeout=300, cv=3)

Issue: CatBoost/XGBoost/LightGBM not available

Solution: Install optional dependencies:

pip install optuml[all]
# or individually:
pip install catboost xgboost lightgbm

Issue: Optimization takes too long

Solutions:

  1. Use parallel CV: n_jobs=-1
  2. Set timeout: timeout=600
  3. Use early stopping: early_stopping_patience=10
  4. Reduce trials: n_trials=50

Issue: Memory errors with large datasets

Solutions:

  1. Use algorithms with lower memory footprint (e.g., LogisticRegression, SGDClassifier, or SGDRegressor)
  2. Reduce CV folds

Best Practices

  1. Start with fewer trials: Begin with n_trials=20-50 for exploration, then increase for final optimization

  2. Use appropriate scoring metrics:

    • Imbalanced classification: "f1_weighted", "roc_auc"
    • Regression: "r2", "neg_mean_squared_error"
  3. Enable early stopping for large trial counts:

    Optimizer(n_trials=1000, early_stopping_patience=20)
  4. Set random state for reproducibility:

    Optimizer(random_state=42)
  5. Use parallel processing for faster optimization:

    Optimizer(n_jobs=-1)

Benchmark results

See this page for benchmark results.

Citation

If you use OptuML in your research, please cite:

@software{stefaniak_optuml_2024,
  author       = {Filip Stefaniak},
  title        = {OptuML: Hyperparameter Optimization for Multiple Machine Learning Algorithms using Optuna},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17305963},
  url          = {https://doi.org/10.5281/zenodo.17305963}
}