Add TOPKAT and PROB-STD methods with tests by Kacper-Kozubowski · Pull Request #480 · MLCIL/scikit-fingerprints

Kacper-Kozubowski · 2025-08-08T07:29:17Z

Changes

Added PROB-STD and TOPKAT applicability domain (AD) checkers, and provided tests. Part of #424

Checklist before requesting a review

Docstrings added/updated in public functions and classes
Tests added, reasonable test coverage (at least ~90%, make test-coverage)
Sphinx docs added/updated and render properly (make docs and see docs/_build/index.html)

j-adamczyk · 2025-08-08T08:54:59Z

skfp/applicability_domain/prob_std.py

+    distribution that lies on the wrong side of the classification threshold (0.5).
+
+    This approach requires a fitted ensemble model exposing the ``estimators_``
+    attribute (e.g., RandomForestRegressor or BaggingRegressor), where each


Add backticks `` for model names. Also, just RandomForestRegressor is enough as an example

Also, we should also support classifiers with .predict_proba() method here

j-adamczyk · 2025-08-08T08:56:09Z

skfp/applicability_domain/prob_std.py

+    References
+    ----------
+    .. [1] `Klingspohn, W., Mathea, M., ter Laak, A. et al.
+        Efficiency of different measures for defining the applicability


Add quotes " for paper name, and move journal name to next line

j-adamczyk · 2025-08-08T08:59:44Z

skfp/applicability_domain/prob_std.py

+    def _compute_prob_std(self, X: np.ndarray) -> np.ndarray:
+        X = validate_data(self, X=X, reset=False)
+
+        preds = np.array([est.predict(X) for est in self.model.estimators_]).T


Why transpose this?

j-adamczyk · 2025-08-08T08:59:57Z

skfp/applicability_domain/topkat.py

+from skfp.bases.base_ad_checker import BaseADChecker
+
+
+class TopKatADChecker(BaseADChecker):


TOPKAT, all capital letters

j-adamczyk · 2025-08-08T09:00:22Z

skfp/applicability_domain/topkat.py

+    and a weighted distance (dOPS) from the center is computed.
+
+    Samples are considered in-domain if their dOPS is below a threshold. By default,
+    this threshold is computed as ``5 * D / (2 * N)``, where:


Use :math: instead of backticks for math formulas, in entire docstring

j-adamczyk · 2025-08-08T09:01:15Z

skfp/applicability_domain/topkat.py

+        self.S_ = (2 * X - self.X_max_ - self.X_min_) / np.where(
+            (self.X_max_ - self.X_min_) != 0, (self.X_max_ - self.X_min_), 1.0
+        )


Break this into separate variables. You are using max - min at least 3 times here also

j-adamczyk · 2025-08-08T09:01:40Z

skfp/applicability_domain/topkat.py

+
+        threshold = self.threshold
+        if threshold is None:
+            threshold = (5 * self.num_dims) / (2 * self.num_points)


One empty line after if for readability

my-alaska · 2025-08-11T05:19:43Z

skfp/applicability_domain/topkat.py

+        self.num_points = X.shape[0]
+        self.num_dims = X.shape[1]
+
+        self.S_ = (2 * X - self.X_max_ - self.X_min_) / np.where(


S_ might be a confusing name. Especially for future maintainers without mathematical background. Please add code comments that briefly explain what we are computing. It's very technical so there's no need to include this information in documentation

Also if I'm seeing correctly, S_ is not used in other methods. I think there's no need to make it a member of the class. Correct me if I'm wrong

j-adamczyk · 2025-08-12T11:16:58Z

skfp/applicability_domain/prob_std.py

+        either ``.predict(X)`` or ``.predict_proba(X)`` method on each sub-estimator.
+        If not provided, a default :class:`~sklearn.ensemble.RandomForestRegressor` will be created.
+
+    threshold : float, default=0.2


Why this threshold?

It's just a heuristic I assumed to be a generally good starting point. Should we use something like 0.5 or 1.0 instead?

j-adamczyk · 2025-08-12T11:17:48Z

skfp/applicability_domain/prob_std.py

+    This approach supports both regression models (using ``.predict(X)`` with outputs
+    interpretable as positive-class probabilities in [0, 1], e.g., regressors trained
+    on binary targets) and binary classifiers (using ``.predict_proba(X)`` and the
+    probability of the positive class). The ensemble model must expose the ``estimators_``


Too long text in parentheses. Just make this regular sentences

j-adamczyk · 2025-08-12T11:18:30Z

skfp/applicability_domain/prob_std.py

+    ):
+        if self.model is None:
+            X, y = validate_data(self, X, y, ensure_2d=False)
+            self.model_ = RandomForestRegressor(n_estimators=10, random_state=0)


Do not set n_estimators, rely on default sklearn value

j-adamczyk · 2025-08-12T11:19:12Z

skfp/applicability_domain/prob_std.py

+            if preds.shape[2] == 2:
+                preds = preds[:, :, 1]  # shape: (n_estimators, n_samples)
+            else:
+                raise ValueError("Only binary classifiers are supported.")


Move to validate_params

j-adamczyk · 2025-08-12T11:19:43Z

skfp/applicability_domain/topkat.py

+    - ``D`` is the number of input features,
+    - ``N`` is the number of training samples.


:math:D and similar for N

j-adamczyk · 2025-08-12T11:20:30Z

skfp/applicability_domain/topkat.py

+    .. [1] Gombar, V. K. (1996).
+       Method and apparatus for validation of model-based predictions.
+       U.S. Patent No. 6,036,349. Washington, DC: U.S. Patent and Trademark Office.


Backticks, quotation marks around name, add link to Google Patents page, similar to other citations, remove year from author list

j-adamczyk · 2025-08-12T11:21:28Z

skfp/applicability_domain/topkat.py

+        # TOPKAT S-space: feature-wise scaling of X to [-1, 1].
+        # Avoid division by zero: where range==0, denom=1 => scaled value will be 0.
+        self.denom_ = np.where((self.range_) != 0, (self.range_), 1.0)
+        S = (2 * X - self.X_max_ - self.X_min_) / self.denom_


Here you can also use range_ right?

range_ is X_max_ - X_min_, but here we essentially have - (X_max + X_min) so we can't directly replace it with range_.

j-adamczyk · 2025-08-12T11:22:28Z

skfp/applicability_domain/topkat.py

+        X = validate_data(self, X=X, reset=False)
+
+        # Apply the same S-space transform as in fit().
+        Ssample = (2 * X - self.X_max_ - self.X_min_) / self.denom_


self.range_ can be used here, right?

Add TOPKAT & PROB-STD methods with tests

965da97

Kacper-Kozubowski requested review from j-adamczyk, mjste and my-alaska as code owners August 8, 2025 07:29

j-adamczyk requested changes Aug 8, 2025

View reviewed changes

my-alaska requested changes Aug 11, 2025

View reviewed changes

Kacper Kozubowski added 2 commits August 11, 2025 13:09

Refactor ProbStd and TOPKAT

7c27fd3

Refactor ProbStd

7b423ba

Kacper-Kozubowski requested review from j-adamczyk and my-alaska August 12, 2025 06:50

j-adamczyk requested changes Aug 12, 2025

View reviewed changes

Refactor ProbStd and small documentation changes

ffa6656

Kacper-Kozubowski requested a review from j-adamczyk August 13, 2025 08:12

Small fixes for parameter validation

714fec2

j-adamczyk previously approved these changes Aug 13, 2025

View reviewed changes

Fix tests

37a743f

j-adamczyk dismissed their stale review via 37a743f August 13, 2025 12:13

j-adamczyk self-requested a review August 13, 2025 21:39

j-adamczyk previously approved these changes Aug 13, 2025

View reviewed changes

my-alaska previously approved these changes Aug 15, 2025

View reviewed changes

Sync with master

2f4d79e

j-adamczyk dismissed stale reviews from my-alaska and themself via 2f4d79e August 15, 2025 18:07

j-adamczyk self-requested a review August 15, 2025 18:08

j-adamczyk previously approved these changes Aug 15, 2025

View reviewed changes

Try to fix CI

a2430de

j-adamczyk dismissed their stale review via a2430de August 19, 2025 10:36

j-adamczyk added 3 commits August 19, 2025 17:10

Try to fix CI

ace45e6

Try to fix CI

333323e

Try to fix CI

240150e

j-adamczyk approved these changes Aug 19, 2025

View reviewed changes

j-adamczyk merged commit db8de97 into master Aug 19, 2025
13 checks passed

j-adamczyk deleted the ad_probstd_topkat branch August 19, 2025 15:43

		from skfp.bases.base_ad_checker import BaseADChecker


		class TopKatADChecker(BaseADChecker):

		- ``D`` is the number of input features,
		- ``N`` is the number of training samples.

Conversation

Kacper-Kozubowski commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist before requesting a review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

my-alaska Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kacper-Kozubowski commented Aug 8, 2025 •

edited

Loading

my-alaska Aug 11, 2025 •

edited

Loading