Skip to content

Latest commit

 

History

History
2250 lines (1303 loc) · 76.8 KB

File metadata and controls

2250 lines (1303 loc) · 76.8 KB

0.129.1

Breaking changes

None.

New features

  • Added support for causalml.inference.meta.BaseXClassifier and BaseXRegressor classes (aka X-learners).

  • Added support for causalml.propensity.ElasticNetPropensityModel and LogisticRegressionPropensityModel classes.

Minor improvements and fixes

None.

0.129.0

Breaking changes

None.

New features

The initial implementation supports causalml.inference.meta.BaseSClassifier and BaseSRegressor classes (aka S-learners), as well as causalml.inference.meta.BaseTClassifier and BaseTRegressor classes (aka T-learners).

CausalML estimators are typically fitted outside of pipelines, because their fit method requires an extra treatment argument. However, for the export to PMML, it is still advisable to wrap the data pre-processor and the final estimator into a PMMLPipeline object:

from causalml.inference.meta import BaseTRegressor
from sklearn.compose import ColumnTransformer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

transformer = ColumnTransformer(...)
regressor = BaseTRegressor(...)

# Stepwise fit
Xt = transformer.fit_transform(X)
regressor.fit(X = Xt, treatment = ..., y = y)

# Wrapper for pre-fitted steps
pipeline = PMMLPipeline([
  ("transformer", transformer),
  ("regressor", regressor)
])
pipeline.active_fields = numpy.asarray(X.columns.values)
pipeline.target_fields = numpy.asarray(["uplift"])

sklearn2pmml(pipeline, "Pipeline.pmml")

When assembling a pipeline for S-learners, then the first column of the pre-processed dataset must contain ordinally encoded treatment values.

Use the FeatureUnion meta-transformer to prepend the treatment column to feature columns:

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import OrdinalEncoder

treatment_transformer = ColumnTransformer([
  ("keep", OrdinalEncoder(), ["treatment"])
], remainder = "drop")

pipeline = PMMLPipeline([
  ("transformer", FeatureUnion([
    # The treatment column
    ("treatment", treatment_transformer),
    # Subsequent feature columns
    ("features", transformer)
  ])),
  ("regressor", regressor)
])

Minor improvements and fixes

  • Refined Java exception types and messages.

0.128.1

Breaking changes

None.

New features

  • Added support for ngboost.NGBSurvival class.

Minor improvements and fixes

  • Added tree model conversion options to AdaBoost and NGBoost estimators.

The goal is to enable tree conversion options for all ensemble estimators, where the (default-) base estimator is Scikit-Learn's decision tree regressor.

For example, exporting the same NGBoost regressor first in optimized (flattened hierarchy, multy-way splits) and then in native-looking (deep hierarchy, binary splits) representations:

from ngboost import NGBRegressor
from ngboost.distns import LogNormal
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("regressor", NGBRegressor(Dist = LogNormal))
])
pipeline.fit(X, y)

pipeline.configure(prune = True, compact = True)
sklearn2pmml(pipeline, "NGBoost_optimized.pmml")

pipeline.configure(prune = False, compact = False)
sklearn2pmml(pipeline, "NGBoost_native-sklearn.pmml")
  • Refined Java exception types and messages.

0.128.0

Breaking changes

None.

New features

  • Added support for NGBoost package.

The initial implementation supports ngboost.NGBClassifier and NGBRegressor classes.

The main selling point of NGBRegressor over conventional regressors is heteroscedastic prediction intervals. Contrary to LLM claims, the PMML standard is capable of representing this functionality fully and very efficiently.

The export of prediction intervals is available for Normal and LogNormal distributions, by setting the confidence_level conversion option to the appropriate value (float for static intervals, boolean True or string for dynamic intervals).

For example, generating static 95% prediction intervals:

from ngboost import NGBRegressor
from ngboost.distns import Normal
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("regressor", NGBRegressor(Dist = Normal))
])
pipeline.fit(X, y)

# Generate 'upper(<y>)' and 'lower(<y>)' output fields
pipeline.configure(confidence_level = 0.95)
sklearn2pmml(pipeline, "NGBoost_Q95.pmml")

See Converting Scikit-Learn NGBoost pipelines to PMML

See SkLearn2PMML-377

  • Added support for FeatureUnion.transformer_weights attribute.

See JPMML-SkLearn-63

Minor improvements and fixes

  • Added support for passthrough and drop pseudo-transformers in FeatureUnion.

  • Refined Java exception types and messages.

0.127.2

Breaking changes

None.

New features

  • Refactored the handling of missing value and invalid value treatment decorations.

Such decorations control how the model checks and sanitizes its inputs. They should be declared once at the very beginning of the pipeline (eg. using CategoricalDomain, ContinuousDomain domain decorators), and remain unchanged after that.

The refactored behaviour is to protect the applied decoration against replacement with more permissive configurations. For example, if the invalid value treatment has been set to "replace with predefined value" (eg. CategoricalDomain(invalid_value_treatment = "as_value", invalid_value_replacement = "__other")), then any subsequent attempt to replace it with "pass through as-is" (eg. OneHotEncoder(handle_unknown = "warn")) will be ignored.

Previously, all replacements succeeded. This led to very unintuitive bugs, where explicit decorations did not have any effect, because they got silently overridden by implicit decorations.

See SkLearn2PMML-428

Minor improvements and fixes

  • Improved support for Pandas 3.0.0.

0.127.1

Breaking changes

None.

New features

  • Added support for Pandas 3.0.0.

Pandas 3 breaks things (relative to Pandas 2) in two ways. First, many public-facing core classes have been moved to different modules (eg. pandas.core.Series has become pandas.Series, and pandas.core.arrays.string_.StringArray has become pandas.arrays.StringArray). Second, string arrays are selectively persisted in Parquet data format.

The JPMML-SkLearn library should now be able to handle pickled Pandas 3 objects. However, if Pandas-related unpickling errors are still raised, then consider reverting to Pandas 2.

Minor improvements and fixes

  • Improved support for Pandas arrays.

See SkLearn2PMML-420

  • Ensured compatibility with NumPy 2.4.1.

0.127.0

Breaking changes

  • The sklearn2pmml.sklearn2pmml() utility function now prints an informational message (stating the PMML file location and size) after a successful conversion. The first such instance of each day prints an extra informational banner message.

Previously, this utility function did not print anything, which might lead users to think that it was muted by design.

New features

See JPMML-SkLearn-64. An eight-year-old issue!

  • Added support for TfidfVectorizer.norm attribute.

The current implementation is correct and complete, but is a little inefficient.

See SkLearn2PMML-98

  • Added support for flaml.automl.model.SGDEstimator class.

  • Added support for missing values in EBM models.

See SkLearn2PMML-440

Minor improvements and fixes

  • Fixed the PMMLPipeline.predict_transform(X) method for multi-output models.

  • Ensured compatibility with FLAML 2.5.0.

0.126.1

Breaking changes

None.

New features

  • Added support for negative indexing in column selectors.

For example, dropping the last two columns from a dataset:

from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([
  ("tail", "drop", [-2, -1])
])

Minor improvements and fixes

  • Fixed customization.

Customizations can now operate on the same element any number of times.

Previously, the first customization left the PMML DOM in an inconsistent state, preventing subsequent customizations from correctly identifying and matching it (typical error message XPath expression <XPath> is not associated with a PMML object).

See SkLearn2PMML-469

  • Refined Java exception types and messages.

0.126.0

Breaking changes

  • Refactored the post-processing.

The final transformed state of PMMLPipeline.predict_transformer, PMMLPipeline.predict_proba_transformer and PMMLPipeline.apply_transformer is now discarded.

Previously, if this state contained not-yet-materialized features (ie. features not backed by any DerivedField elements), they were automatically materialized as OutputField elements.

All feature materializations must be explicit in order to keep the model schema clean from unnecessary output fields. Use the newly introduced FeatureExporter transformer for that (see below).

New features

  • Added support for FLAML package.

FLAML is a lightweight AutoML and hyperparameter tuning framework.

The AutoML.fit() method sets the AutoML.model attribute to some flaml.automl.model.SKLearnEstimator wrapper object.

The sklearn2pmml.sklearn2pmml() utility function can now accept this object directly in most cases (Scikit-Learn, LightGBM and XGBoost estimators). There is no need to decompose it manually anymore.

from flaml import AutoML
from sklearn2pmml import sklearn2pmml

automl = AutoML()
automl.fit(X, y, task = ...)

sklearn2pmml(automl.model, "FLAML.pmml")
  • Added support for sklearn2pmml.postprocessing.FeatureExporter class.

This is a meta-transformer for the post-processing part of the workflow.

It creates and attaches an OutputField element to the final model element for each of its incoming features. Data scientists can leverage this meta-transformer to enrich the model schema with custom output fields, which refer to primary and secondary transformations in the pipeline.

Use the Recaller meta-transformer for looking up the desired data pre-processing step(s) by name:

from sklearn.pipeline import make_pipeline
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.cross_reference import Recaller
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.postprocessing import FeatureExporter

custom_output = make_pipeline(
  Recaller(memory = None, names = ["func(field1)", "func(field2)"]), # List of DerivedField names
  FeatureExporter(names = ["feature1", "feature2"]) # List of OutputField names
)

pipeline = PMMLPipeline(...)
pipeline.fit(X, y)

# Activate post-processing
pipeline.predict_transformer = custom_output

sklearn2pmml(pipeline, "Pipeline.pmml")

See SkLearn2PMML-469

Minor improvements and fixes

  • Fixed the lookup of fields during post-processing.

A Recaller can now reference any DerivedField element by name.

  • Improved logging.

  • Ensured compatibility with InterpretML 0.7.4.

0.125.2

Breaking changes

None.

New features

  • Improved logging.

The Java exception chain is clearly separated from other log messages using EXCEPTION and Caused by headers, making logs much easier to skim, parse and interpret.

Minor improvements and fixes

  • Refined Java exception types and messages.

0.125.1

Breaking changes

None.

New features

None.

Minor improvements and fixes

  • Refined Java exception types and messages.

  • Ensured compatibility with Dill 0.4.0, Joblib 1.5.3, NumPy 2.3.5 and Pandas 2.3.3.

0.125.0

Breaking changes

None intended, but some breaking changes in Scikit-Learn 1.8.0 may not be fully handled yet.

New features

  • Added support for the floor division operator // in expressions.

  • Added support for chained comparison operations (eg. 0 <= x < 1) in expressions and predicates.

  • Added support for raise statements in Python functions.

The exception class and exception message are mapped to an X-Error element (a JPMML vendor extension).

  • Added support for assert statements in Python functions.

  • Added support for pow() and round(x, digits?) built-in functions.

  • Added support for abs(), exp2() and round(x, digits?) NumPy functions.

  • Added support for the float("NaN") expression.

In Scikit-Learn, math.nan and numpy.nan are typically regarded as missing values, not invalid values. Therefore, this expression is translated to a PMML missing value constant.

  • Added support for string type names.

Minor improvements and fixes

  • Refined Java exception types and messages.

  • Ensured compatibility with Scikit-Learn 1.8.0.

  • Updated Java libraries.

0.124.0

Breaking changes

None.

New features

See JPMML-SkLearn-70

Minor improvements and fixes

  • Refined Java exception types and messages.

  • Made ColumnTransformer and DataFrameMapper converters more lenient towards occasional NumPy scalars.

  • Fixed the parsing of Scikit-Learn version strings.

  • Ensured compatibility with Category-Encoders 2.9.0, Imbalanced-Learn 0.14.0, InterpretML 0.7.3, OptBinning 0.21.0 and Scikit-Lego 0.9.6.

0.123.1

Breaking changes

None.

New features

  • Added support for abs(), max() and min() built-in functions.

  • Added support for NumPy and Pandas module aliases (np and pd, respectively) in expressions and predicates.

Minor improvements and fixes

  • Ensured compatibility with XGBoost 3.1.1.

  • Updated Java libraries.

0.123.0

Breaking changes

None.

New features

  • Added support for bool(), float(), int() and str() built-in type cast functions.

Value conversions are implemented in a maximally Pythonic way. For example, the bool() function performs a "truthiness check" on its argument:

# Empty string evaluates to False
assert not bool("")

# All non-empty strings evaluate to True
assert bool("False")
assert bool("True")

Minor improvements and fixes

  • Refined Java exception types. Specifically, the org.jpmml.python.AttributeException class was split into org.jpmml.python.MissingAttributeException and org.jpmml.python.InvalidAttributeException subclasses.

  • Ensured compatibility with StatsModels 0.14.5.

0.122.2

Breaking changes

None.

New features

Minor improvements and fixes

  • Fixed compatibility with Scikit-Learn 1.5.2 and earlier.

The bug was introduced in the 0.122.0 version, in relation to the __sklearn_tags__ attribute capture.

0.122.1

Breaking changes

None.

New features

None.

Minor improvements and fixes

  • Improved interaction between OrdinalEncoder transformer and XGBoost estimators.

XGBoost estimators now check ordinally encoded features, and exclude the last category value if it matches the missing attribute value. Previously, this value was encoded into categorical split elements.

  • Improved support for OneHotEncoder.handle_unknown attribute.

  • Ensured compatibility with Scikit-Learn 1.7.2 and XGBoost 3.0.5.

0.122.0

Breaking changes

  • Changed the default value of the --unpickle command-line option from False to True.

It means that the command-line application is now coupled to library version of the Python environment. This may cause problems if the pickle file references classes from incompatible library versions, or unavailable libraries.

The old behaviour can be restored by specifying the --no-unpickle command-line option:

sklearn2pmml --no-unpickle -i pipeline.pkl -o pipeline.pmml

New features

If the Python class has an __sklearn_tags__ attribute, then its value will be captured as dict and persisted as a _pmml_sklearn attribute.

The JPMML-SkLearn library may use this information for PMML refinement.

  • Improved support for GridSearchCV and RandomizedSearchCV classes.

Fitted estimator searcher objects can now be passed directly to the sklearn2pmml.sklearn2pmml() utility function. Previously, they required manual dissection for that.

from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml

searcher = GridSearchCV(estimator = ..., param_grid = ...)
searcher.fit(X, y)

# Legacy approach
# sklearn2pmml(searcher.best_estimator_, "BestEstimator.pmml")

sklearn2pmml(searcher, "BestEstimator.pmml")
  • Made PMMLPipeline constructor validate the steps argument.

The Pipeline parent class validates steps during the fit(X, y) method call. However, PMMLPipeline objects are often constructed from pre-fitted components, in which case the fit(X, y) method call is intentionally skipped to prevent overwriting the existing state.

Minor improvements and fixes

  • Improved support for FunctionTransformer.func attribute.

The export to PMML is typically not possible if this attribute refers to a user-defined function (the pickle files stores the function name, but not its body).

The new behaviour is to raise an error, which suggests replacing the FunctionTransformer transformer with a sklearn2pmml.preprocessing.ExpressionTransformer transformer that does not suffer from partial serializability issues.

  • Improved support for xgboost.Booster.fmap attribute.

  • Fixed compatibility with XGBoost 3.0.1 and newer.

See JPMML-XGBoost-77

0.121.1

Breaking changes

None.

New features

Minor improvements and fixes

  • Fixed target category levels for BernoulliNB estimator.

  • Improved integration between CountVectorizer transformer and BernoulliNB estimator.

  • Ensured compatibility with Scikit-Learn 1.7.1.

0.121.0

Breaking changes

None.

New features

See JPMML-SkLearn-20. An eight-and-a-half-year-old issue!

Minor improvements and fixes

None.

0.120.0

Breaking changes

  • Moved escape_func parameter from sklearn2pmml.make_pmml_pipeline() utility function to sklearn2pmml.sklearn2pmml().

The default value of this parameter is a sklearn2pmml._escape() utility function reference, which should be fine for all pure Scikit-Learn workflows.

Before:

from sklearn2pmml import _escape, make_pmml_pipeline, sklearn2pmml

estimator = joblib.load("estimator.pkl")

# The escape_func argument may be omitted
pmml_pipeline = make_pmml_pipeline(estimator, escape_func = _escape)

sklearn2pmml(pmml_pipeline, "estimator.pmml")

After:

from sklearn2pmml import _escape, sklearn2pmml

estimator = joblib.load("estimator.pkl")

# The escape_func argument may be omitted
sklearn2pmml(estimator, "estimator.pkl", escape_func = _escape)
  • Removed sklearn2pmml.pycaret.make_pmml_pipeline() utility function.

PyCaret workflows require setting the escape_func argument to the sklearn2pmml.pycaret._escape() utility function reference.

New features

The JPMML-SkLearn library is unable to access and traverse Python class hierarchies. Therefore, it only knows about Python classes whose fully-qualified class names have been previously whitelisted.

A custom estimator or transformer class can be made "recognizable" by defining a pmml_base_class_ attribute that points to some whitelisted class (typically, a parent class):

from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml

class MultinomialClassifier(LogisticRegression):

  def __init__(self):
    super().__init__(multi_class = "multinomial")
    # Stipulate that the fitted state of this class conforms to that of the LogisticRegression class
    self.pmml_base_class_ = LogisticRegression

classifier = MultinomialClassifier()
classifier.fit(X, y)

sklearn2pmml(classifier, "classifier.pmml")

The sklearn2pmml._escape() utility function automatically sets the pmml_base_class_ attribute for all objects that meet the following criteria:

  1. The class has been defined in the current script (ie. obj.__module__ == "__main__").
  2. The class has exactly one parent class, which is not abstract, and inherits from sklearn.base.BaseEstimator or sklearn.base.TransformerMixin base classes.
  3. The class does not override fit(X, y), predict(X) and/or transform(X) methods.

Minor improvements and fixes

  • Added --unpickle command-line option.

0.119.1

Breaking changes

None.

New features

Minor improvements and fixes

  • Updated telemetry.

0.119.0

Breaking changes

None.

New features

  • Added support for Scikit-Learn 1.7.X.

  • Added sklearn2pmml.sklearn_pandas.patch_sklearn() utility function.

The sklearn_pandas.DataFrameMapper class depends on the sklearn.utils.tosequence() utility function, which was removed in Scikit-Learn 1.7.0. It is currently unclear when and if all the sklearn_pandas package will be updated to reflect this breaking change.

A data scientist can safely "patch" the latest Scikit-Learn version for SkLearn-Pandas 2.2.0 needs by calling the patch_sklearn utility function during Python environment initialization:

from sklearn2pmml.sklearn_pandas import patch_sklearn

patch_sklearn()

from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper(...)

Minor improvements and fixes

  • Improved support for pandas.Int64Dtype data type in the DiscreteDomain.fit(X, y) method.

  • Ensured compatibility with Category-Encoders 2.8.1.

0.118.0

Breaking changes

None.

New features

  • Added support for Scikit-Learn 1.6.X.

  • Added telemetry.

Minor improvements and fixes

  • Ensured compatibility with InterpretML 0.6.11, Scikit-Lego 0.9.5 and Treeple 0.10.3 (formerly Scikit-Tree).

0.117.0

Breaking changes

None.

New features

  • Added support for sklearn2pmml.preprocessing.MultiCastTransformer class.

Implements casts in multi-column mode.

Before:

from sklearn2pmml.preprocessing import CastTransformer

transformer = ColumnTransformer([
  ("first", CastTransformer(dtype = "category"), [0]),
  ("second", CastTransformer(dtype = "category"), [1]),
  ("third", CastTransformer(dtype = "category"), [2])
])

After:

from sklearn2pmml.preprocessing import MultiCastTransformer

transformer = ColumnTransformer([
  ("cat", MultiCastTransformer(dtypes = ["category", "category", "category"]), [0, 1, 2])
])
  • Added CutTransformer.dtype attribute.

If set to a "proto" categorical data type, the transform(X) method now yields a Pandas' series. Previously, the output was force-converted into a 2D NumPy array of shape (n_samples, 1).

This allows easier interfacing with categorical data type-aware steps such as LightGBM and XGBoost estimators.

  • Added LookupTransformer.dtype and MultiLookupTransformer.dtype attributes.

Minor improvements and fixes

  • Added support for numpy.datetime64 data type.

See SkLearn2PMML-454

  • Added name availability check into the field renaming logic.

See SkLearn2PMML-455

0.116.4

Breaking changes

None.

New features

  • Added Unicode support to Python-to-PMML translator.

It is now possible to use international languages (eg. Chinese) in Python string literals.

Minor improvements and fixes

  • Ensured compatibility with XGBoost 3.0.0.

  • Updated Java libraries.

0.116.3

Breaking changes

None.

New features

  • Improved the parsing of Python functions.

Restored support for assignment statements (was temporarily removed in SkLearn2PMML 0.116.2), and added support for collecting multiple "partial" statements into a singular "complete" statement.

See SkLearn2PMML-447

  • Added support for type hints.

The Python-to-PMML translator does its best to infer type information for DefineFunction and DerivedField elements based on their child expression element. However, the inference algorithm is not guaranteed to yield maximally specific results at all times.

A data scientist can now use PEP 484-style type hints to stipulate the intended function return type, and variable types:

def _binarize(x) -> int:
  zeroOrPositive: bool = (x >= 0)
  if zeroOrPositive:
    return 1
  return 0

Minor improvements and fixes

None.

0.116.2

Breaking changes

  • Refactored ExpressionTransformer class to verify that the target PMML element for ExpressionTransformer.map_missing_to, ExpressionTransformer.default_value and ExpressionTransformer.invalid_value_treatment is clearly known. If any disambiguities are found, then an error is raised, which suggests that the expression should be refactored from inline string representation to UDF representation.

Previously, it was assumed that the target PMML element is the "outermost" transformer element (typically, the Apply element).

Before:

from sklearn2pmml.preprocessing import ExpressionTransformer

# Ambiguous, because yields a hierarchy of Apply elements,
# where the target Apply element (`Apply@function="greaterOrEqual"`)
# is shielded by non-target Apply element (`Apply@function="if"`).
# Furthermore, the map_missing_to value should be boolean not integer
transformer = ExpressionTransformer("1 if X[0] >= 0 else 0", map_missing_to = 0)

After:

from sklearn2pmml.preprocessing import ExpressionTransformer
from sklearn2pmml.util import Expression

def _binarize(x):
  return (1 if x >= 0 else 0)

# Unambiguous, because yields a single Apply element (`Apply@function="_binarize"`)
transformer = ExpressionTransformer(Expression("_binarize(X[0])", function_defs = [_binarize]), map_missing_to = 0)

See SkLearn2PMML-446

  • Refactored the PMML representation of UDFs.

Previously, they were translated to DerivedField elements, whereas now they are translated to DefineFunction elements.

New features

None.

Minor improvements and fixes

  • Improved the parsing of Python statements.

See SkLearn2PMML-447

  • Ensured compatibility with Python 3.13.

0.116.1

Breaking changes

  • Refactored the cast to Python string data types (str and "unicode").

Previously, the cast was implemented using data container-native apply methods (eg. X.astype(str)). However, these methods act destructively with regards to constant values such as None, numpy.nan, pandas.NA and pandas.NaT, by replacing them with the corresponding string literals. For example, a None constant becomes a "None" string.

The sklearn2pmml.util.cast() utility function that implements casts across SkLearn2PMML transformers now contains an extra masking logic to detect and preserve missing value constants unchanged. This is critical for the correct functioning of downstream missing value-aware steps such as imputers, encoders and expression transformers.

See SkLearn2PMML-445

New features

  • Added LagTransformer.block_indicators and RollingAggregateTransformer.block_indicators attributes.

These attributes enhance the base transformation with "group by" functionality.

For example, calculating a moving average over a mixed stock prices dataset:

from sklearn2pmml.preprocessing import RollingAggregateTransformer

mapper = DataFrameMapper([
  (["stock", "price"], RollingAggregateTransformer(function = "avg", n = 100, block_indicators = ["stock"]))
], input_df = True, df_out = True)
  • Added package up-to-date check.

The Java side of the package computes the timedelta between the current timestamp and the package build timestamp before doing any actual work. If this timedelta is greater than 180 days (6 months) a warning is issued. If this timedelta is greater than 360 days (12 months) an error is raised.

Minor improvements and fixes

  • Added LagTransformer.get_feature_names_out() and RollingAggregateTransformer.get_feature_names_out() methods.

  • Fixed the cast of wildcard features.

Previously, if a CastTransformer transformer was applied to a wildcard feature, then the newly assigned operational type was not guaranteed to stick.

See SkLearn2PMML-445

0.116.0

Breaking changes

  • Renamed sklearn2pmml.preprocessing.Aggregator class to AggregateTransformer.

In order to support archived pipeline objects, the SkLearn2PMML package shall keep recognizing the old name variant alongside the new one.

New features

The post-fit tuned target is exposed in the model schema as an extra thresholded(<target field name>) output field.

  • Added support for sklearn2pmml.preprocessing.LagTransformer class.

Implements a "shift" operation using PMML's Lag element.

  • Added support for sklearn2pmml.preprocessing.RollingAggregateTransformer class.

Implements a "rolling aggregate" operation using PMML's Lag element.

The PMML implementation differs from Pandas' default implementation in that it excludes the curent row. For example, when using a window size of five, then PMML considers five rows preceding the current row (ie. X.rolling(window = 5, closed = "left")), whereas Pandas considers four rows preceding the current row plus the current row (ie. X.rolling(window = 5, closed = "right")).

A Pandas-equivalent "rolling aggregate" operation can be emulated using AggregateTransformer and LagTransformer transformers directly.

Minor improvements and fixes

None.

0.115.0

Breaking changes

None.

New features

  • Added support for sklearn2pmml.preprocessing.StringLengthTransformer class.

Minor improvements and fixes

  • Fixed the StringNormalizer.transform(X) method to preserve the original data container shape.

See SkLearn2PMML-443

  • Ensured compatibility with PCRE2 0.5.0.

The 0.5.X development branch underwent breaking changes, with the goal of migrating from proprietary API to Python RE-compatible API. For example, the compiler pattern object now provides both search(x) and sub(replacement, x) conveniene methods.

  • Ensured compatibility with BorutaPy 0.4.3, Category-Encoders 2.6.4, CHAID 5.4.2, Hyperopt-sklearn 1.0.3, Imbalanced-Learn 0.13.0, InterpretML 0.6.9, OptBinning 0.20.1, PyCaret 3.3.2, Scikit-Lego 0.9.4, Scikit-Tree 0.8.0 and TPOT 0.12.2.

0.114.0

Breaking changes

  • Required Java 11 or newer.

New features

None.

Minor improvements and fixes

None.

0.113.0

Breaking changes

None.

New features

None.

Minor improvements and fixes

  • Updated Java libraries.

0.112.1

Breaking changes

None.

New features

  • Added support for in-place file conversion.

If the estimator parameter to the sklearn2pmml.sklearn2pmml(estimator, pmml_path) utility function is a path-like object (eg. pathlib.Path or string), then the Pytho side of the SkLearn2PMML package shall pass it forward to the Java side (without making any efforts to load or modify anything about it).

This opens the door for the safe conversion of legacy and/or potentially harmful Pickle files.

For example, attempting to convert an unknown origin and composition estimator file to a PMML document:

from sklearn2pmml import sklearn2pmml

sklearn2pmml("/path/to/estimator.pkl", "estimator.pmml")

Minor improvements and fixes

  • Added --version command-line option.

Checking the version of the currently installed sklearn2pmml command-line application:

sklearn2pmml --version
  • Fixed the version standardization transformation.

0.112.0

Breaking changes

  • Required Python 3.8 or newer.

This requirement stems from underlying package requirements, most notably that of the NumPy package (numpy>=1.24).

Portions of the SkLearn2PMML package are still usable with earlier Python versions. For example, the sklearn2pmml.sklearn2pmml(estimator, pmml_path) utlity function should work with any Python 2.7, 3.4 or newer version.

  • Migrated setup from distutils to setuptools.

  • Migrated unit tests from nose to pytest.

Testing the (source checkout of-) package:

python -m pytest .

New features

  • Added command-line interface to the sklern2pmml.sklearn2pmml() utility function.

Sample usage:

python -m sklearn2pmml --input pipeline.pkl --output pipeline.pmml

Getting help:

python -m sklearn2pmml --help
  • Added sklearn2pmml command-line application.

Sample usage:

sklearn2pmml -i pipeline.pkl -o pipeline.pmml

Minor improvements and fixes

None.

0.111.2

Breaking changes

None.

New features

  • Separated version transformation into two parts - version standardization (from vendor-extended PMML 4.4 to standard PMML 4.4) and version downgrade (from PMML 4.4 to any earlier PMML version).

Minor improvements and fixes

  • Eliminated the use of temporary file(s) during version transformation.

  • Improved version downgrade.

0.111.1

Breaking changes

  • Refactored the downgrading of PMML schema versions.

Previously, the downgrade failed if the generated PMML document was not strictly compatible with the requested PMML schema version. Also, the downgrade failed if there were any vendor extension attributes or elements around (ie. attributes prefixed with x- or elements prefixed with X-).

The new behaviour is to allow the downgrade to run to completion, and display a grave warning (together with the full list of incompatibilities) in the end.

See SkLearn2PMML-433

  • Updated logging configuration.

The Java backend used to employ SLF4J's default logging configuration, which prints two lines per logging event - the first line being metadata, and the second line being the actual data.

The new logging configuration prints one line per logging event. The decision was to drop the leading metadata line in order to de-clutter the console.

New features

  • Added support for using pcre2 module functions in expressions and predicates.

For example, performing text replacement operation on a string column:

from sklearn2pmml.preprocessing import ExpressionTransformer

# Replace sequences of one or more 'B' characters with a single 'c' character
transformer = ExpressionTransformer("pcre2.substitute('B+', 'c', X[0])")

Minor improvements and fixes

  • Added support for single-quoted multiline strings in expressions and predictions.

0.111.0

Breaking changes

  • Assume re as the default regular expression (RE) flavour.

  • Removed support for multi-column mode from StringNormalizer class. String transformations are unique and rare enough, so that they should be specified on a column-by-column basis.

New features

  • Added MatchesTransformer.re_flavour and ReplaceTransformer.re_flavour attributes. The Python environment allows to choose between different RE engines, which vary by RE syntax to a material degree. Unambiguous identification of the RE engine improves the portability of RE transformers between applications (train vs. deployment) and environments.

Supported RE flavours:

RE flavour Implementation
pcre PCRE package
pcre2 PCRE2 package
re Built-in re module

PMML only supports Perl Compatible Regular Expression (PCRE) syntax.

It is recommended to use some PCRE-based RE engine on Python side as well to minimize the chance of "communication errors" between Python and PMML environments.

  • Added sklearn2pmml.preprocessing.regex.make_regex_engine(pattern, re_flavour) utility function.

This utility function pre-compiles and wraps the specified RE pattern into a sklearn2pmml.preprocessing.regex.RegExEngine object.

The RegExEngine class provides matches(x) and replace(replacement, x) methods, which correspond to PMML's matches and replace built-in functions, respectively.

For example, unit testing a RE engine:

from sklearn2pmml.preprocessing.regex import make_regex_engine

regex_engine = make_regex_engine("B+", re_flavour = "pcre2")

assert regex_engine.matches("ABBA") == True
assert regex_engine.replace("c", "ABBA") == "AcA"

See SkLearn2PMML-228

  • Refactored StringNormalizer.transform(X) and SubstringTransformer.transform(X) methods to support pandas.Series input and output.

See SkLearn2PMML-434

Minor improvements and fixes

  • Ensured compatibility wth Scikit-Learn 1.5.1 and 1.5.2.

0.110.0

Breaking changes

None.

New features

  • Added pmml_schema parameter to the sklearn2pmml.sklearn2pmml(estimator, pmml_path) utility function.

This parameter allows downgrading PMML schema version from the default 4.4 version to any 3.X or 4.X version. However, the downgrade is "soft", meaning that it only succeeds if the in-memory PMML document is naturally compatible with the requested PMML schema version. The downgrade fails if there are structural changes needed.

Exprting a pipeline into a PMML schema version 4.3 document:

from sklearn2pmml import sklearn2pmml

pipeline = Pipeline([...])
pipeline.fit(X, y)

sklearn2pmml(pipeline, "Pipeline-v43.pmml", pmml_schema = "4.3")

See SkLearn2PMML-423

Complex downgrades will be implemented based on customer needs.

Minor improvements and fixes

None.

0.109.0

Breaking changes

None.

New features

  • Added support for Scikit-Learn 1.5.X.

  • Added support for yeo-johnson power transform method in PowerTransformer class.

This method is the default for this transformer.

Minor improvements and fixes

  • Fixed the initialization of Python expression evaluation environments.

The environment is "pre-loaded" with a small number of Python built-in (math, re) and third-party (numpy, pandas, scipy and optionally pcre) module imports.

All imports use canonical module names (eg. import numpy). There is no module name aliasing taking place (eg. import numpy as np). Therefore, the evaluatable Python expressions must also spell out canonical module names.

See SkLearn2PMML-421

  • Added support for log link function in ExplainableBoostingRegressor class.

See SkLearn2PMML-422

0.108.0

Breaking changes

None.

New features

See InterpretML-536

Minor improvements and fixes

  • Ensured compatibility with Scikit-Learn 1.4.2.

0.107.1

Breaking changes

None.

New features

This class implements the isolation forest algorithm using oblique tree models. It is claimed to outperform the H2OIsolationForestEstimator class, which does the same using plain (ie. non-oblique) tree models.

  • Made lightgbm.Booster class directly exportable to PMML.

The SkLearn2PMML package now supports both LightGBM Training API and Scikit-Learn API:

from lightgbm import train, Dataset
from sklearn2pmml import sklearn2pmml

ds = Dataset(data = X, label = y)

booster = train(params = {...}, train_set = ds)

sklearn2pmml(booster, "LightGBM.pmml")
  • Made xgboost.Booster class directly exportable to PMML.

The SkLearn2PMML package now supports both XGBoost Learning API and Scikit-Learn API:

from xgboost import train, DMatrix
from sklearn2pmml import sklearn2pmml

dmatrix = DMatrix(data = X, label = y)

booster = train(params = {...}, dtrain = dmatrix)

sklearn2pmml(booster, "XGBoost.pmml")
  • Added xgboost.Booster.fmap attribute.

This attribute allows overriding the embedded feature map with a user-defined feature map.

The main use case is refining the category levels of categorical features.

A suitable feature map object can be generated from the training dataset using the sklearn2pmml.xgboost.make_feature_map(X) utility function:

from xgboost import train, DMatrix
from sklearn2pmml.xgboost import make_feature_map

# Enable categorical features
dmatrix = DMatrix(X, label = y, enable_categorical = True)

# Generate a feature map with detailed description of all continuous and categorical features in the dataset
fmap = make_feature_map(X)

booster = train(params = {...}, dtrain = dmatrix)
booster.fmap = fmap
  • Added input_float conversion option for XGBoost models.

Minor improvements and fixes

None.

0.107.0

Breaking changes

None.

New features

For example, training and exporting an ExtendedIsolationForest outlier detector into a PMML document:

from sklearn.datasets import load_iris
from sktree.ensemble import ExtendedIsolationForest
from sklearn2pmml import sklearn2pmml

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

eif = ExtendedIsolationForest(n_estimators = 13)
eif.fit(iris_X)

sklearn2pmml(eif, "ExtendedIsolationForestIris.pmml")

See SKTree-255

Minor improvements and fixes

None.

0.106.0

Breaking changes

  • Upgraded JPMML-SkLearn library from 1.7(.56) to 1.8(.0).

This is a major API upgrade. The 1.8.X development branch is already source and binary incompatible with earlier 1.5.X through 1.7.X development branches, with more breaking changes to follow suit.

Custom SkLearn2PMML plugins would need to be upgraded and rebuilt.

New features

None.

Minor improvements and fixes

  • Ensured compatibility with Python 3.12.

  • Ensured compatibility with Dill 0.3.8.

0.105.2

Breaking changes

None.

New features

None.

Minor improvements and fixes

  • Improved support for categorical encoding over mixed datatype column sets.

Scikit-Learn transformers such as OneHotEncoder, OrdinalEncoder and TargetEncoder can be applied to several columns in one go. Previously it was assumed that all columns shared the same data type. If that was assumption was violated in practice, they were all force cast to the string data type.

The JPMML-SkLearn library now detects and maintains the data type on a single column basis.

  • Made Category-Encoders classes directly exportable to PMML.

For example, training and exporting a BaseNEncoder transformer into a PMML document for manual analysis and interpretation purposes:

from category_encoders import BaseNEncoder
from sklearn2pmml import sklearn2pmml

transformer = BaseNEncoder(base = 3)
transformer.fit(X, y = None)

sklearn2pmml(transformer, "Base3Encoder.pmml")
  • Fixed support for (category_encoders.utils.)BaseEncoder.feature_names_in_ attribute.

According to SLEP007, the value of a feature_names_in_ attribute should be an array of strings.

Category-Encoders transformers are using a list of strings instead.

  • Refactored ExpressionClassifier and ExpressionRegressor constructors.

The evaluatable object can now also be a string literal.

0.105.1

Breaking changes

None.

New features

The SplineTransformer class computes a B-spline for a feature, which is then used to expand the feature into new features that correspond to B-spline basis elements.

This class is not suitable for simple feature and prediction scaling purposes (eg. calibration of computer probabilities). Consider using the sklearn2pmml.preprocessing.BSplineTransformer class in such a situation.

Scikit-Learn tree and tree ensemble models prepare their inputs by first casting them to (numpy.)float32, and then to (numpy.)float64 (exactly so, even if the input value already happened to be of (numpy.)float64 data type).

PMML does not provide effective means for implementing "chained casts"; the chain must be broken down into elementary cast operations, each of which is represented using a standalone DerivedField element. For example, preparing the "Sepal.Length" field of the iris dataset:

<PMML>
  <DataDictionary>
    <DataField name="Sepal.Length" optype="continuous" dataType="double">
      <Interval closure="closedClosed" leftMargin="4.3" rightMargin="7.9"/>
    </DataField>
  </DataDictionary>
  <TransformationDictionary>
    <DerivedField name="float(Sepal.Length)" optype="continuous" dataType="float">
      <FieldRef field="Sepal.Length"/>
    </DerivedField>
    <DerivedField name="double(float(Sepal.Length))" optype="continuous" dataType="double">
      <FieldRef field="float(Sepal.Length)"/>
    </DerivedField>
  </TransformationDictionary>
</PMML>

Activating the input_float conversion option:

pipeline = PMMLPipeline([
  ("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_X, iris_y)

# Default mode
pipeline.configure(input_float = False)
sklearn2pmml("DecisionTree-default.pmml")

# "Input float" mode
pipeline.configure(input_float = True)
sklearn2pmml("DecisionTree-input_float.pmml")

This conversion option updates the data type of the "Sepal.Length" data field from double to float, thereby eliminating the need for the first DerivedField element of the two:

<PMML>
  <DataDictionary>
    <DataField name="Sepal.Length" optype="continuous" dataType="float">
      <Interval closure="closedClosed" leftMargin="4.300000190734863" rightMargin="7.900000095367432"/>
    </DataField>
  </DataDictionary>
  <TransformationDictionary>
    <DerivedField name="double(Sepal.Length)" optype="continuous" dataType="double">
      <FieldRef field="Sepal.Length"/>
    </DerivedField>
  </TransformationDictionary>
</PMML>

Changing the data type of a field may have side effects if the field contributes to more than one feature. The effectiveness and safety of configuration options should be verified by integration testing.

  • Added H2OEstimator.pmml_classes_ attribute.

This attribute allows customizing target category levels. It comes in handly when working with ordinal targets, where the H2O.ai framework requires that target category levels are encoded from their original representation to integer index representation.

A fitted H2O.ai ordinal classifier predicts integer indices, which must be manually decoded in the application layer. The JPMML-SkLearn library is able to "erase" this encode-decode helper step from the workflow, resulting in a clean and efficient PMML document:

ordinal_classifier = H2OGeneralizedLinearEstimator(family = "ordinal")
ordinal_classifier.fit(...)

# Customize target category levels
# Note that the default lexicographic ordering of labels is different from their intended ordering
ordinal_classifier.pmml_classes_ = ["bad", "poor", "fair", "good", "excellent"]

sklearn2pmml(ordinal_classifier, "OrdinalClassifier.pmml")

Minor improvements and fixes

  • Fixed the categorical encoding of missing values.

This bug manifested itself when the input column was mixing different data type values. For example, a sparse string column, where non-missing values are strings, and missing values are floating-point numpy.NaN values.

Scikit-Learn documentation warns against mixing string and numeric values within a single column, but it can happen inadvertently when reading a sparse dataset into a Pandas' dataframe using standard library functions (eg. the pandas.read_csv() function).

  • Added Pandas to package dependencies.

See SkLearn2PMML-418

  • Ensured compatibility with H2O.ai 3.46.0.1.

  • Ensured compatibility with BorutaPy 0.3.post0 (92e4b4e).

0.105.0

Breaking changes

None.

New features

  • Added Domain.n_features_in_ and Domain.feature_names_in_ attributes.

This brings domain decorators to conformance with "physical" Scikit-Learn input inspection standards such as SLEP007 and SLEP010.

Domain decorators are natively about "logical" input inspection (ie. establishing and enforcing model's applicability domain).

By combining these two complementary areas of functionality, they now make a great first step for any pipeline:

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn2pmml.decoration import ContinuousDomain

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

pipeline = Pipeline([
  # Collect column-oriented model's applicability domain
  ("domain", ContinuousDomain()),
  ("classifier", ...)
])
pipeline.fit(iris_X, iris_y)

# Dynamic properties, delegate to (the attributes of-) the first step
print(pipeline.n_features_in_)
print(pipeline.feature_names_in_)
  • Added MultiDomain.n_features_in_ and MultiDomain.feature_names_in_ attribute.

  • Added support for missing values in tree and tree ensemble models.

Scikit-Learn 1.3 extended the Tree data structure with a missing_go_to_left field. This field indicates the default split direction for each split, and is always present and populated whether the training dataset actually contained any missing values or not.

As a result, Scikit-Learn 1.3 tree models are able to accept and make predictions on sparse datasets, even if they were trained on a fully dense dataset. There is currently no mechanism for a data scientist to tag tree models as "can or cannot be used with missing values".

The JPMML-SkLearn library implements two Tree data structure conversion modes, which can be toggled using the allow_missing conversion option. The default mode corresponds to Scikit-Learn 0.18 through 1.2 behaviour, where a missing input causes the evaluation process to immediately bail out with a missing prediction. The "missing allowed" mode corresponds to Scikit-Learn 1.3 and newer behaviour, where a missing input is ignored, and the evaluation proceeds to the pre-defined child branch until a final non-missing prediction is reached.

Right now, the data scientist must activate the latter mode manually, by configuring allow_missing = True:

from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("classifier", DecisionTreeClassifier())
])
pipeline.fit(X, y)

# Default mode
pipeline.configure(allow_missing = False)
sklearn2pmml(pipeline, "DecisionTree-default.pmml")

# "Missing allowed" mode
pipeline.configure(allow_missing = True)
sklearn2pmml(pipeline, "DecisionTree-missing_allowed.pmml")

Both conversion modes generate standard PMML markup. However, the "missing allowed" mode results in slightly bigger PMML documents (say, up to 10-15%), because the default split direction is encoded using extra Node@defaultChild and Node@id attributes. The size difference disappears when the tree model is compacted.

  • Added support for nullable Pandas' scalar data types.

If the dataset contains sparse columns, then they should be cast from the default NumPy object data type to the most appropriate nullable Pandas' scalar data type. The cast may be performed using a data type object (eg. pandas.BooleanDtype, pandas.Int64Dtype, pandas.Float32Dtype) or its string alias (eg. Boolean, Int64, Float32).

This kind of "type hinting" is instrumental to generating high(er) quality PMML documents.

Minor improvements and fixes

  • Added ExpressionRegressor.normalization_method attribute.

This attribute allows performing some most common normalizations atomically.

The list of supported values is none and exp.

  • Refactored ExpressionClassifier.normalization_method attribute.

The list of supported values is none, logit, simplemax and softmax.

  • Fixed the formatting of non-finite tree split values.

It is possible that some tree splits perform comparisons against the positive infinity to indicate "always true" and "always false" conditions (eg. x <= +Inf and x > +Inf, respectively).

Previously, infinite values were formatted using Java's default formatting method, which resulted in Java-style -Infinity and Infinity string literals. They are now detected and replaced with PMML-style -INF and INF (case insensitive) string literals, respectively.

  • Ensured compatibility with CHAID 5.4.1.

0.104.1

Breaking changes

  • Removed sklearn2pmml.ensemble.OrdinalClassifier class.

The uses of this class should be replaced with the uses of the sklego.meta.OrdinalClassifier class (see below), which implements exactly the same algorithm, and offers extra functionality such as calibration and parallelized fitting.

New features

  • Added support for sklego.meta.OrdinalClassifier class.
from pandas import CategoricalDtype, Series

# A proper ordinal target
y_bin = Series(_bin(y), dtype = CategoricalDtype(categories = [...], ordered = True), name = "bin(y)")

classifier = OrdinalClassifier(LogisticRegression(), use_calibration = True, ...)
# Map categories from objects to integer codes
classifier.fit(X, (y_bin.cat).codes.values)

# Store the categories mapping:
# the `OrdinalClassifier.classes_` attribute holds integer codes, 
# and the `OrdinalClassifier.pmml_classes_` holds the corresponding objects
classifier.pmml_classes_ = y_bin.dtype.categories

See Scikit-Lego-607

Minor improvements and fixes

  • Removed the SkLearn-Pandas package from installation requirements.

The sklearn_pandas.DataFrameMapper meta-transformer is giving way to the sklearn.compose.ColumnTransformer meta-transformer in most common pipelines.

  • Fixed the base-N encoding of missing values.

This bug manifested itself when missing values were assigned to a category by itself.

This bug was discovered when rebuilding integration tests with Category-Encoders 2.6(.3). It is currently unclear if the base-N encoding algorithm had its behaviour changed between Category-Encoders 2.5 and 2.6 development lines.

In any case, when using SkLearn2PMML 0.104.1 or newer, it is advisable to upgrade to Category-Encoders 2.6.0 or newer.

  • Ensured compatibility with Category-Encoders 2.6.3, Imbalanced-Learn 0.12.0, OptBinning 0.19.0 and Scikit-Lego 0.7.4.

0.104.0

Breaking changes

  • Updated Scikit-Learn installation requirement from 0.18+ to 1.0+.

This change helps the SkLearn2PMML package to better cope with breaking changes in Scikit-Learn APIs. The underlying JPMML-SkLearn library retains the maximum version coverage, because it is dealing with Scikit-Learn serialized state (Pickle/Joblib or Dill), which is considerably more stable.

New features

  • Added support for Scikit-Learn 1.4.X.

The JPMML-SkLearn library integration tests were rebuilt with Scikit-Learn 1.4.0 and 1.4.1.post1 versions. All supported transformers and estimators passed cleanly.

See SkLearn2PMML-409 and JPMML-SkLearn-195

  • Added support for BaseHistGradientBoosting._preprocessor attribute.

This attribute gets initialized automatically if a HistGradientBoostingClassifier or HistGradientBoostingRegressor estimator is inputted with categorical features.

In Scikit-Learn 1.0 through 1.3 it is necessary to pre-process categorical features manually. The indices of (ordinally-) encoded columns must be tracked and passed to the estimator using the categorical_features parameter:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

mapper = DataFrameMapper(
  [([cont_col], ContinuousDomain()) for cont_col in cont_cols] +
  [([cat_col], [CategoricalDomain(), OrdinalEncoder()]) for cat_col in cat_cols]
)

regressor = HistGradientBoostingRegressor(categorical_features = [...])

pipeline = Pipeline([
  ("mapper", mapper),
  ("regressor", regressor)
])
pipeline.fit(X, y)

In Scikit-Learn 1.4, this workflow simplifies to the following:

# Activate full Pandas' support by specifying `input_df = True` and `df_out = True` 
mapper = DataFrameMapper(
  [([cont_col], ContinuousDomain()) for cont_col in cont_cols] +
  [([cat_col], CategoricalDomain(dtype = "category")) for cat_col in cat_cols]
, input_df = True, df_out = True)

# Auto-detect categorical features by their data type
regressor = HistGradientBoostingRegressor(categorical_features = "from_dtype")

pipeline = Pipeline([
  ("mapper", mapper),
  ("regressor", regressor)
])
pipeline.fit(X, y)

# Print out feature type information
# This list should contain one or more `True` values
print(pipeline._final_estimator.is_categorical_)

Minor improvements and fixes

  • Improved support for ColumnTransformer.transformers attribute.

Column selection using dense boolean arrays.

0.103.3

Breaking changes

  • Refactored the PMMLPipeline.customize(customizations: [str]) method into PMMLPipeline.customize(command: str, xpath_expr: str, pmml_element: str).

This method may be invoked any number of times. Each invocation appends a sklearn2pmml.customization.Customization object to the pmml_customizations_ attribute of the final estimator.

The command argument is one of SQL-inspired keywords insert, update or delete (to insert a new element, or to update or delete an existing element, respectively). The xpath_expr is an XML Path (XPath) expression for pinpointing the action site. The XPath expression is evaluated relative to the main model element. The pmml_element is a PMML fragment string.

For example, suppressing the secondary results by deleting the Output element:

pipeline = PMMLPipeline([
  ("classifier", ...)
])
pipeline.fit(X, y)
pipeline.customize(command = "delete", xpath_expr = "//:Output")

New features

  • Added sklearn2pmml.metrics module.

This module provides high-level BinaryClassifierQuality, ClassifierQuality and RegressorQuality pmml classes for the automated generation of PredictiveModelQuality elements for most common estimator types.

Refactoring the v0.103.0 code example:

from sklearn2pmml.metrics import ModelExplanation, RegressorQuality

pipeline = PMMLPipeline([
  ("regressor", ...)
])
pipeline.fit(X, y)

model_explanation = ModelExplanation()
predictive_model_quality = RegressorQuality(pipeline, X, y, target_field = y.name) \
  .with_all_metrics()
model_explanation.append(predictive_model_quality)

pipeline.customize(command = "insert", pmml_element = model_explanation.tostring())
  • Added sklearn2pmml.util.pmml module.

Minor improvements and fixes

  • Added EstimatorProxy.classes_ propery.

  • Extracted sklearn2pmml.configuration and sklearn2pmml.customization modules.

0.103.2

Breaking changes

  • Refactored the transform(X) methods of SkLearn2PMML custom transformers to maximally preserve the original type and dimensionality of data containers.

For example, if the input to a single-column transformation is a Pandas' series, and the nature of the transformation allows for it, then the output will also be a Pandas' series. Previously, the output was force-converted into a 2D NumPy array of shape (n_samples, 1).

This change should go unnoticed for the majority of pipelines, as most Scikit-Learn transformers and estimators are quite lenient towards what they accept as input. Any conflicts can be resolved by converting and/or reshaping the data container to a 2D NumPy array manually.

New features

  • Improved support for Pandas' categorical data type.

There is now a clear distinction between "proto" and "post" states of a data type object. A "proto" object is a category string literal or an empty pandas.CategoricalDtype object. A "post" object is fully initialized pandas.CategoricalDtype object that has been retrieved from some data container (typically, a training dataset).

  • Added ExpressionTransformer.dtype_ attribute.

A fitted ExpressionTransformer object now holds data type information using two attributes. First, the dtype attribute holds the "proto" state - what was requested. Second, the dtype_ attribute holds the "post" state - what was actually found and delivered.

For example:

transformer = ExpressionTransformer(..., dtype = "category")
Xt = transformer.fit_transform(X, y)

# Prints "category" string literal
print(transformer.dtype)

# Prints pandas.CategoricalDtype object
print(transformer.dtype_)
print(transformer.dtype_.categories)
  • Added SeriesConstructor meta-transformer.

This meta-transformer supersedes the DataFrameConstructor meta-transformer for single-column data container conversion needs.

Minor improvements and fixes

  • Added ExpressionTransformer.fit_transform(X, y) method.

  • Added DataFrameConstructor.get_feature_names_out() and SeriesConstructor.get_feature_names_out() methods.

This makes these two meta-transformers compatible with Scikit-Learn's set_output API.

0.103.1

Breaking changes

None.

New features

  • Added support for pandas.CategoricalDtype data type to the DiscreteDomain class and its subclasses.

It has been possible to set the DiscreteDomain.dtype parameter to a Pandas' categorical data type for quite some time. However, up until this point, the JPMML-SkLearn library did not interact with this extra information in any way, because the valid value space (VVS) was constructed solely based on the DiscreteDomain.data_values_ attribute.

The Pandas' categorical data type is not relevant in pure Scikit-Learn workflows. However, it is indispensable for the proper representation of categorical features in LightGBM and XGBoost workflows.

Default usage (the VVS is learned automatically from the training dataset):

domain = CategoricalDomain(..., dtype = "category")

Advanced usage (the VVS is pre-defined):

vvs = [...]

# The DiscreteDomain.data_values parameter expects a list-like of list-likes, hence the double indexing syntax
domain = CategoricalDomain(..., data_values = [vvs], dtype = CategoricalDtype(categories = vvs))

See SkLearn2PMML-411

Minor improvements and fixes

  • Fixed the invalid value replacement for the "as_missing" treatment.

This bug manifested itself in configurations where the DiscreteDomain.missing_value_replacement parameter was unset (meaning "leave as default missing value"), and the DiscreteDomain.missing_values parameter was set to a non-None value (meaning "the default missing value is ").

  • Updated JPMML-LightGBM dependency.

0.103.0

Breaking changes

None.

New features

  • Added PMMLPipeline.customize(customizations) method.

This method accepts one or more PMML fragment strings, which will be embedded into the main model element after all the automated PMML generation routines have been completed. The customizations may replace existing elements, or define completely new elements.

The intended use case is defining model metadata such as ModelStats and ModelExplanation elements.

For example, embedding regression model quality information:

from lxml import etree

pipeline = PMMLPipeline([
  ("regressor", ...)
])
pipeline.fit(X, y)

# Calculate R squared
score = pipeline.score(X, y)

# Generate a PMML 4.4 fragment
model_explanation = etree.Element("{http://www.dmg.org/PMML-4_4}ModelExplanation")
predictive_model_quality = etree.SubElement(model_explanation, "{http://www.dmg.org/PMML-4_4}PredictiveModelQuality")
predictive_model_quality.attrib["targetField"] = y.name
predictive_model_quality.attrib["r-squared"] = str(score)

pipeline.customize(etree.tostring(model_explanation))

See SkLearn2PMML-410

Minor improvements and fixes

  • Fixed the scoping of target fields in StackingClassifier and StackingRegressor estimators.

See JPMML-SkLearn-192

  • Updated all JPMML-Converter library dependencies to latest versions.

0.102.0

Breaking changes

  • Changed the default value of Domain.with_statistics attribute from True to False.

This attribute controls the calculation of descriptive statistics during the fitting. The calculation of some descriptive statistics is costly (eg. interquartile range, median, standard deviation), which causes a notable flow-down of the Domain.fit(X, y) method.

The descriptive statistics about the training dataset is stored using the ModelStats element under the main model element (ie. the /PMML/<Model>/ModelStats elenment). It is there for information purposes only. Its presence or absence does not affect the predictive capabilities of the model in any way.

New features

  • Fixed the Domain.transform(X) method to preserve the X argument unchanged.

If the domain decorator needs to modify the dataset in any way (eg. performing missing or invalid value replacement), then it will create a copy of the argument dataset before modifying it. Otherwise, the argument dataset is passed through as-is.

This aligns domain decorators with Scikit-Learn API guidelines that transformers and transformer-likes should not tamper with the original dataset.

  • Added support for One-Model-Per-Target (OMPT)-style multi-target XGBoost estimators.

When XGBClassifier.fit(X, y) and XGBRegressor.fit(X, y) methods are passed a multi-column y dataset, then XGBoost trains a OMPT-style multi-target model by default.

An OMPT-style multi-target model is functionally identical to a collection of single-target models, as all targets are handled one-by-one both during fitting and prediction. In other words, the use of MultiOutputClassifier and MultiOutputRegressor meta-estimators is now deprecated when modelling multi-target datasets with XGBoost estimators.

Before:

from sklearn.multioutput import MultiOutputRegressor
from xgboost import XGBRegressor

X = ...
# A multi-column 2D array
ynd = ...

regressor = MultiOutputRegressor(XGBRegressor())
regressor.fit(X, ynd)

After:

regressor = XGBRegressor()
regressor.fit(X, ynd)
  • Ensured XGBoost 2.0 compatibility:
    • Improved the partitioning of the main trees array into sub-arrays based on model type (boosting vs. bagging) and target cardinality (single-target vs. multi-target).
    • Improved support for early stopping.

See JPMML-XGBoost v1.8.2

Earlier SkLearn2PMML package versions may accept and convert XGBoost 2.0 without errors, but the resulting PMML document may contain an ensemble model with a wrong selection and/or wrong number of member tree models in it. These kind of conversion issues can be easily detected by embedding the model verification dataset into the model.

Minor improvements and fixes

  • Improved support for XGBClassifier.classes_ property.

This member was promoted from attribute to property during the XGBoost 1.7 to 2.0 upgrade, thereby making it "invisible" in non-Python environments.

The temporary workaround was to manually re-assign this property to a XGBClassifier.pmml_classes_ attribute. While the above workaround continues to be relevant with advanced targets (eg. string-valued category levels) it is no longer needed for default targets.

See SkLearn2PMML-402

  • Added GBDTLRClassifier.classes_ property.

0.101.0

Breaking changes

  • Renamed the DiscreteDomain.data_ attribute to data_values_.

New features

  • Added support for multi-column mode to the DiscreteDomain class and its subclasses (CategoricalDomain and OrdinalDomain).

This brings discrete domain decorators to functional parity with continuous domain decorators, which have been supporting both single-column and multi-column mode for years.

Before:

from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

cat_cols = [...]
cont_cols = [...]

mapper = DataFrameMapper(
  # Map continuous columns in one go
  [([cont_cols], ContinuousDomain(...))] +
  # Map categorical columns one by one
  [([cat_col], CategoricalDomain(...)) for cat_col in cat_cols]
)

After:

mapper = DataFrameMapper([
  # Map both continuous and categorical columns in one go
  ([cont_cols], ContinuousDomain(...)),
  ([cat_cols], CategoricalDomain(...))
])
  • Added support for user-defined valid value spaces:
    • ContinuousDomain.data_min and ContinuousDomain.data_max parameters (scalars, or a list-like of scalars depending on the multiplicity).
    • The DiscreteDomain.data_values parameter (a list-like, or a list-like of list-likes depending on multiplicity).

This allows the data scientist to specify valid value spaces that are different (typically, wider) than the valid value space that can be inferred from the training dataset during the fitting.

Extending the valid value space for the "iris" dataset:

from sklearn.datasets import load_iris

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

columns = iris_X.columns.values

# Extend all feature bounds to [0.0 .. 10.0]
data_usermin = [0.0] * len(columns)
data_usermax = [10.0] * len(columns)

mapper = DataFrameMapper([
  (columns.tolist(), ContinuousDomain(data_min = data_usermin, data_max = data_usermax))
])
mapper.fit_transform(iris_X, iris_y)
  • Improved support for "category" data type in the CastTransformer.fit(X, y) method.

If the CastTransformer.dtype parameter value is "category" (ie. a string literal), then the fit method will auto-detect valid category levels, and will set the CastTransformer.dtype_ attribute to a pandas.CategoricalDtype object instead. The subsequent transform method invocations are now guaranteed to exhibit stable transformation behaviour. Previously, each method call was computing its own set of valid category values.

  • Added the Domain class to the sklearn.base.OneToOneFeatureMixin class hierarchy.

This makes domain decorators compatible with Scikit-Learn's set_output API.

Choosing a data container for transformation results:

from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([
  ("cont", ContinuousDomain(...), cont_cols),
  ("cat", CategoricalDomain(...), cat_cols)
])

# Force the output data container to be a Pandas' dataframe (rather than a NumPy array)
transformer.set_output(transform = "pandas")
  • Added CastTransformer and IdentityTransformer classes to the sklearn.base.OneToOneFeatureMixin class hierarchy.

This makes these two transformers compatible with Scikit-Learn's set_output API.

  • Added Memorizer.get_feature_names_out() and Recaller.get_feature_names_out() methods.

This makes memory managers compatible with Scikit-Learn's set_output API.

Minor improvements and fixes

  • Updated formal package requirements to scikit-learn >= 1.0, numpy >= 1.22(.4) and pandas >= 1.5(.3).

  • Optimized ContinuousDomain.fit(X, y) and DiscreteDomain.fit(X, y) methods.

  • Stopped auto-adding the DiscreteDomain.missing_value_replacement parameter value into the valid value space of discrete domains.

The missing value replacement value should occur in the training set naturally. If not, it would be more appropriate to manually define the valid value space using the newly introduced DiscreteDomain.data_values parameter.

  • Improved handling of missing values in the CastTransformer.fit(X, y) method.

Previously, it was possible that the float("NaN") value could be included into the list of valid category levels when casting sparse string columns to the categorical data type.

  • Added sklearn2pmml.util.to_numpy(X) utility function.