Skip to content

APLR version 10.18.0 - automatic handling of missing values and categorical features

Choose a tag to compare

@mathias-von-ottenbreit mathias-von-ottenbreit released this 29 Oct 20:13
· 1 commit to main since this release

This release introduces significant enhancements to streamline the data preprocessing workflow and improve overall usability. The model now intelligently handles pandas.DataFrame inputs, automating common data preparation steps and making it easier to go from raw data to a trained model.

Key Features and Enhancements

1. Automatic Data Preprocessing with pandas.DataFrame

When a pandas.DataFrame is passed as input X to APLRRegressor or APLRClassifier, the model now automatically performs the following preprocessing steps, reducing the need for manual data preparation:

  • Missing Value Imputation:

    • For columns containing missing values (NaN), the model automatically imputes them using the column's median.
    • To preserve information about the original missingness, a new binary feature (e.g., feature_name_missing) is created for each column that had values imputed.
    • The median calculation correctly handles sample_weight if provided during fitting, ensuring a weighted median is used for imputation.
  • Categorical Feature Encoding:

    • Columns with object or category data types are automatically identified and one-hot encoded.
    • The model gracefully handles unseen category levels during prediction by creating columns for all categories seen during training and setting them to zero for new data.

2. Enhanced Flexibility in APLRClassifier

The APLRClassifier is now more versatile with respect to the target variable y. It automatically converts numeric target arrays (e.g., [0, 1, 0, 1]) into string representations. This simplifies the setup for classification tasks, as you no longer need to manually pre-convert your target variable.

3. Updated Documentation and Examples

The API reference and examples have been updated to reflect these new automatic preprocessing capabilities, providing clearer guidance on leveraging these new, user-friendly features.