Add TOPKAT and PROB-STD methods with tests#480
Conversation
| distribution that lies on the wrong side of the classification threshold (0.5). | ||
|
|
||
| This approach requires a fitted ensemble model exposing the ``estimators_`` | ||
| attribute (e.g., RandomForestRegressor or BaggingRegressor), where each |
There was a problem hiding this comment.
Add backticks `` for model names. Also, just RandomForestRegressor is enough as an example
There was a problem hiding this comment.
Also, we should also support classifiers with .predict_proba() method here
| References | ||
| ---------- | ||
| .. [1] `Klingspohn, W., Mathea, M., ter Laak, A. et al. | ||
| Efficiency of different measures for defining the applicability |
There was a problem hiding this comment.
Add quotes " for paper name, and move journal name to next line
| def _compute_prob_std(self, X: np.ndarray) -> np.ndarray: | ||
| X = validate_data(self, X=X, reset=False) | ||
|
|
||
| preds = np.array([est.predict(X) for est in self.model.estimators_]).T |
skfp/applicability_domain/topkat.py
Outdated
| from skfp.bases.base_ad_checker import BaseADChecker | ||
|
|
||
|
|
||
| class TopKatADChecker(BaseADChecker): |
skfp/applicability_domain/topkat.py
Outdated
| and a weighted distance (dOPS) from the center is computed. | ||
|
|
||
| Samples are considered in-domain if their dOPS is below a threshold. By default, | ||
| this threshold is computed as ``5 * D / (2 * N)``, where: |
There was a problem hiding this comment.
Use :math: instead of backticks for math formulas, in entire docstring
skfp/applicability_domain/topkat.py
Outdated
| self.S_ = (2 * X - self.X_max_ - self.X_min_) / np.where( | ||
| (self.X_max_ - self.X_min_) != 0, (self.X_max_ - self.X_min_), 1.0 | ||
| ) |
There was a problem hiding this comment.
Break this into separate variables. You are using max - min at least 3 times here also
|
|
||
| threshold = self.threshold | ||
| if threshold is None: | ||
| threshold = (5 * self.num_dims) / (2 * self.num_points) |
There was a problem hiding this comment.
One empty line after if for readability
skfp/applicability_domain/topkat.py
Outdated
| self.num_points = X.shape[0] | ||
| self.num_dims = X.shape[1] | ||
|
|
||
| self.S_ = (2 * X - self.X_max_ - self.X_min_) / np.where( |
There was a problem hiding this comment.
S_ might be a confusing name. Especially for future maintainers without mathematical background. Please add code comments that briefly explain what we are computing. It's very technical so there's no need to include this information in documentation
There was a problem hiding this comment.
Also if I'm seeing correctly, S_ is not used in other methods. I think there's no need to make it a member of the class. Correct me if I'm wrong
| either ``.predict(X)`` or ``.predict_proba(X)`` method on each sub-estimator. | ||
| If not provided, a default :class:`~sklearn.ensemble.RandomForestRegressor` will be created. | ||
|
|
||
| threshold : float, default=0.2 |
There was a problem hiding this comment.
It's just a heuristic I assumed to be a generally good starting point. Should we use something like 0.5 or 1.0 instead?
| This approach supports both regression models (using ``.predict(X)`` with outputs | ||
| interpretable as positive-class probabilities in [0, 1], e.g., regressors trained | ||
| on binary targets) and binary classifiers (using ``.predict_proba(X)`` and the | ||
| probability of the positive class). The ensemble model must expose the ``estimators_`` |
There was a problem hiding this comment.
Too long text in parentheses. Just make this regular sentences
| ): | ||
| if self.model is None: | ||
| X, y = validate_data(self, X, y, ensure_2d=False) | ||
| self.model_ = RandomForestRegressor(n_estimators=10, random_state=0) |
There was a problem hiding this comment.
Do not set n_estimators, rely on default sklearn value
| if preds.shape[2] == 2: | ||
| preds = preds[:, :, 1] # shape: (n_estimators, n_samples) | ||
| else: | ||
| raise ValueError("Only binary classifiers are supported.") |
skfp/applicability_domain/topkat.py
Outdated
| - ``D`` is the number of input features, | ||
| - ``N`` is the number of training samples. |
skfp/applicability_domain/topkat.py
Outdated
| .. [1] Gombar, V. K. (1996). | ||
| Method and apparatus for validation of model-based predictions. | ||
| U.S. Patent No. 6,036,349. Washington, DC: U.S. Patent and Trademark Office. |
There was a problem hiding this comment.
Backticks, quotation marks around name, add link to Google Patents page, similar to other citations, remove year from author list
| # TOPKAT S-space: feature-wise scaling of X to [-1, 1]. | ||
| # Avoid division by zero: where range==0, denom=1 => scaled value will be 0. | ||
| self.denom_ = np.where((self.range_) != 0, (self.range_), 1.0) | ||
| S = (2 * X - self.X_max_ - self.X_min_) / self.denom_ |
There was a problem hiding this comment.
Here you can also use range_ right?
There was a problem hiding this comment.
range_ is X_max_ - X_min_, but here we essentially have - (X_max + X_min) so we can't directly replace it with range_.
| X = validate_data(self, X=X, reset=False) | ||
|
|
||
| # Apply the same S-space transform as in fit(). | ||
| Ssample = (2 * X - self.X_max_ - self.X_min_) / self.denom_ |
There was a problem hiding this comment.
self.range_ can be used here, right?
Changes
Added
PROB-STDandTOPKATapplicability domain (AD) checkers, and provided tests. Part of #424Checklist before requesting a review
make test-coverage)make docsand seedocs/_build/index.html)