10.18.1

mathias-von-ottenbreit · mathias-von-ottenbreit · commit afc0518ed9ad · 2025-10-30T20:59:43.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,22 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+## [10.18.1] - 2025-10-30
+
+### Fixed
+- **Improved Backward Compatibility for Saved Models:** Resolved an issue where loading models trained with older versions of `aplr` would fail due to missing attributes. The `__setstate__` method now initializes new preprocessing-related attributes to `None` for older models, ensuring they can be loaded and used without `AttributeError` exceptions.
+- **Stability for Unfitted Models:** Fixed a crash that occurred when calling `predict` on an unfitted `APLRClassifier`. The model now correctly raises a `RuntimeError` with an informative message in this scenario, improving stability and user feedback.
+- **Restored Flexibility for `X_names` Parameter:** Fixed a regression from v10.18.0 where the `X_names` parameter no longer accepted `numpy.ndarray` or other list-like inputs. The parameter now correctly handles these types again, restoring flexibility for non-DataFrame inputs.
+
+## [10.18.0] - 2025-10-29
+
+### Added
+- **Automatic Data Preprocessing with `pandas.DataFrame`**:
+  - When a `pandas.DataFrame` is passed as input `X`, the model now automatically handles missing values and categorical features.
+  - **Missing Value Imputation**: Columns with missing values (`NaN`) are imputed using the column's median. A new binary feature (e.g., `feature_name_missing`) is created to indicate where imputation occurred. The median calculation correctly handles `sample_weight`.
+  - **Categorical Feature Encoding**: Columns with `object` or `category` data types are automatically one-hot encoded. The model gracefully handles unseen category levels during prediction by creating columns for all categories seen during training and setting those of them not seen during prediction to zero.
+
+### Changed
+- **Enhanced Flexibility in `APLRClassifier`**: The classifier now automatically converts numeric target arrays (e.g., `[0, 1, 0, 1]`) into string representations, simplifying setup for classification tasks.
+- **Updated Documentation and Examples**: The API reference and examples have been updated to reflect the new automatic preprocessing capabilities.
diff --git a/aplr/aplr.py b/aplr/aplr.py
@@ -22,6 +22,9 @@ def _common_X_preprocessing(self, X, is_fitting: bool, X_names=None):
         """Common preprocessing for fit and predict."""
         is_dataframe_input = isinstance(X, pd.DataFrame)
 
+        if X_names is not None:
+            X_names = list(X_names)
+
         if not is_dataframe_input:
             try:
                 X_numeric = np.array(X, dtype=np.float64)
@@ -35,11 +38,11 @@ def _common_X_preprocessing(self, X, is_fitting: bool, X_names=None):
                     X.columns = X_names
                 else:
                     X.columns = [f"X{i}" for i in range(X.shape[1])]
-            elif hasattr(self, "X_names_") and len(self.X_names_) == X.shape[1]:
+            elif self.X_names_ and len(self.X_names_) == X.shape[1]:
                 X.columns = self.X_names_
         else:  # X is already a DataFrame
             X = X.copy()  # Always copy to avoid modifying original
-            if not is_fitting and hasattr(self, "X_names_"):
+            if not is_fitting and self.X_names_:
                 # Check if input columns for prediction match training columns (before OHE)
                 if set(X.columns) != set(self.X_names_):
                     raise ValueError(
@@ -52,11 +55,18 @@ def _common_X_preprocessing(self, X, is_fitting: bool, X_names=None):
             self.categorical_features_ = list(
                 X.select_dtypes(include=["category", "object"]).columns
             )
+            # Ensure it's an empty list if no categorical features, not None
+            if not self.categorical_features_:
+                self.categorical_features_ = []
 
+        # Apply OHE if categorical_features_ were found during fitting.
         if self.categorical_features_:
             X = pd.get_dummies(X, columns=self.categorical_features_, dummy_na=False)
             if is_fitting:
                 self.ohe_columns_ = list(X.columns)
+                # Ensure it's an empty list if no OHE columns, not None
+                if not self.ohe_columns_:
+                    self.ohe_columns_ = []
             else:
                 missing_cols = set(self.ohe_columns_) - set(X.columns)
                 for c in missing_cols:
@@ -65,13 +75,17 @@ def _common_X_preprocessing(self, X, is_fitting: bool, X_names=None):
 
         if is_fitting:
             self.na_imputed_cols_ = [col for col in X.columns if X[col].isnull().any()]
+            # Ensure it's an empty list if no NA imputed columns, not None
+            if not self.na_imputed_cols_:
+                self.na_imputed_cols_ = []
 
+        # Apply NA indicator if na_imputed_cols_ were found during fitting.
         if self.na_imputed_cols_:
             for col in self.na_imputed_cols_:
                 X[col + "_missing"] = X[col].isnull().astype(int)
 
-        if not is_fitting:
-            for col in self.median_values_:
+        if not is_fitting and self.median_values_:
+            for col in self.median_values_:  # Iterate over keys if it's a dict
                 if col in X.columns:
                     X[col] = X[col].fillna(self.median_values_[col])
 
@@ -131,11 +145,30 @@ def _preprocess_X_fit(self, X, X_names, sample_weight):
     def _preprocess_X_predict(self, X):
         X = self._common_X_preprocessing(X, is_fitting=False)
 
-        if hasattr(self, "final_training_columns_"):
+        # Enforce column order from training if it was set.
+        if self.final_training_columns_:
             X = X[self.final_training_columns_]
 
         return X.values.astype(np.float64)
 
+    def __setstate__(self, state):
+        """Handles unpickling for backward compatibility."""
+        self.__dict__.update(state)
+
+        # For backward compatibility, initialize new attributes to None if they don't exist,
+        # indicating the model was trained before these features were introduced.
+        new_attributes = [
+            "X_names_",
+            "categorical_features_",
+            "ohe_columns_",
+            "na_imputed_cols_",
+            "median_values_",
+            "final_training_columns_",
+        ]
+        for attr in new_attributes:
+            if not hasattr(self, attr):
+                setattr(self, attr, None)
+
 
 class APLRRegressor(BaseAPLR):
     def __init__(
@@ -261,6 +294,7 @@ def __init__(
         self.ohe_columns_ = []
         self.na_imputed_cols_ = []
         self.X_names_ = []
+        self.final_training_columns_ = []
 
         # Creating aplr_cpp and setting parameters
         self.APLRRegressor = aplr_cpp.APLRRegressor()
@@ -702,6 +736,7 @@ def __init__(
         self.ohe_columns_ = []
         self.na_imputed_cols_ = []
         self.X_names_ = []
+        self.final_training_columns_ = []
 
         # Creating aplr_cpp and setting parameters
         self.APLRClassifier = aplr_cpp.APLRClassifier()
diff --git a/cpp/APLRClassifier.h b/cpp/APLRClassifier.h
@@ -21,6 +21,7 @@ class APLRClassifier
     void invert_second_model_in_two_class_case(APLRRegressor &second_model);
     void calculate_validation_metrics();
     void calculate_unique_term_affiliations();
+    void throw_error_if_not_fitted();
     void cleanup_after_fit();
 
 public:
@@ -306,8 +307,18 @@ void APLRClassifier::cleanup_after_fit()
     response_values.clear();
 }
 
+void APLRClassifier::throw_error_if_not_fitted()
+{
+    if (categories.empty())
+    {
+        throw std::runtime_error("This APLRClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.");
+    }
+}
+
 MatrixXd APLRClassifier::predict_class_probabilities(const MatrixXd &X, bool cap_predictions_to_minmax_in_training)
 {
+    throw_error_if_not_fitted();
+
     MatrixXd predictions{MatrixXd::Constant(X.rows(), categories.size(), 0.0)};
     for (size_t i = 0; i < categories.size(); ++i)
     {
@@ -328,6 +339,7 @@ MatrixXd APLRClassifier::predict_class_probabilities(const MatrixXd &X, bool cap
 
 std::vector<std::string> APLRClassifier::predict(const MatrixXd &X, bool cap_predictions_to_minmax_in_training)
 {
+    throw_error_if_not_fitted();
     std::vector<std::string> predictions(X.rows());
     MatrixXd predicted_class_probabilities{predict_class_probabilities(X, cap_predictions_to_minmax_in_training)};
     for (size_t row = 0; row < predicted_class_probabilities.rows(); ++row)
@@ -342,6 +354,7 @@ std::vector<std::string> APLRClassifier::predict(const MatrixXd &X, bool cap_pre
 
 MatrixXd APLRClassifier::calculate_local_feature_contribution(const MatrixXd &X)
 {
+    throw_error_if_not_fitted();
     MatrixXd output{MatrixXd::Constant(X.rows(), unique_term_affiliations.size(), 0)};
     std::vector<std::string> predictions{predict(X, false)};
     for (size_t row = 0; row < predictions.size(); ++row)
diff --git a/documentation/APLR 10.18.1.pdf b/documentation/APLR 10.18.1.pdf
diff --git a/setup.py b/setup.py
@@ -25,7 +25,7 @@
 
 setuptools.setup(
     name="aplr",
-    version="10.18.0",
+    version="10.18.1",
     description="Automatic Piecewise Linear Regression",
     ext_modules=[sfc_module],
     author="Mathias von Ottenbreit",