|
| 1 | +.. _conditional_feature_importance: |
| 2 | + |
| 3 | + |
| 4 | +============================== |
| 5 | +Conditional Feature Importance |
| 6 | +============================== |
| 7 | + |
| 8 | +Conditional Feature Importance (CFI) is a model-agnostic approach for quantifying the |
| 9 | +relevance of individual or groups of features in predictive models. It is a |
| 10 | +perturbation-based method that compares the predictive performance of a model on |
| 11 | +unmodified test data—following the same distribution as the training data— |
| 12 | +to its performance when the studied feature is conditionally perturbed. Thus, this approach |
| 13 | +does not require retraining the model. |
| 14 | + |
| 15 | +.. figure:: ../generated/gallery/examples/images/sphx_glr_plot_cfi_001.png |
| 16 | + :target: ../generated/gallery/examples/plot_cfi.html |
| 17 | + :align: center |
| 18 | + |
| 19 | + |
| 20 | +Theoretical index |
| 21 | +------------------ |
| 22 | + |
| 23 | +Conditional Feature Importance (CFI) is a model-agnostic method for estimating feature |
| 24 | +importance through conditional perturbations. Specifically, it constructs a perturbed |
| 25 | +version of the feature :math:`X_j^P`, sampled independently from the conditional distribution |
| 26 | +:math:`P(X_j | X_{-j})`, such that its association with the output is removed: |
| 27 | +:math:`X_j^P \perp Y \mid X^{-j}`. The predictive model is then evaluated on the |
| 28 | +modified feature vector :math:`\tilde X = [X_1, ..., X_j^P, ..., X_p]`, and the |
| 29 | +importance of the feature is quantified by the resulting drop in model performance. |
| 30 | + |
| 31 | +.. math:: |
| 32 | + \psi_j^{CFI} = \mathbb{E} [\mathcal{L}(y, \mu(\tilde X))] - \mathbb{E} [\mathcal{L}(y, \mu(X))]. |
| 33 | +
|
| 34 | +
|
| 35 | +The target quantity estimated by CFI is the Total Sobol Index (TSI) :ref:`total_sobol_index`. |
| 36 | +Indeed, |
| 37 | + |
| 38 | +.. math:: |
| 39 | + \frac{1}{2} \psi_j^{CFI} |
| 40 | + = \psi_j^{TSI} |
| 41 | + = \mathbb{E} [\mathcal{L}(y, \mu_{-j}(X^-j))] - \mathbb{E} [\mathcal{L}(y, \mu(X))]. |
| 42 | +
|
| 43 | +Where in regression, :math:`\mu_{-j}(X_{-j}) = \mathbb{E}[Y| X_{-j}]` is the |
| 44 | +theoretical model without the :math:`j^{th}` feature. |
| 45 | + |
| 46 | +Estimation procedure |
| 47 | +-------------------- |
| 48 | + |
| 49 | +The estimation of CFI relies on the ability to sample the perturbed feature matrix |
| 50 | +:math:`\tilde X`, and specifically to sample :math:`X_j^p` independently from the conditional |
| 51 | +distribution, :math:`X_j^p \overset{\text{i.i.d.}}{\sim} P(X_j | X_{-j})`, while breaking the |
| 52 | +association with the output :math:`Y`. Any conditional sampler can be used. A valid |
| 53 | +and efficient approach is conditional permutation (:footcite:t:`Chamma_NeurIPS2023`). |
| 54 | +This procedure decomposes the :math:`j^{th}` feature into a part that |
| 55 | +is predictable from the other features and a residual term that is |
| 56 | +independent of the other features: |
| 57 | + |
| 58 | +.. math:: |
| 59 | + X_j = \nu_j(X_{-j}) + \epsilon_j, \quad \text{with} \quad \epsilon_j \perp\!\!\!\perp X_{-j} \text{ and } \mathbb{E}[\epsilon_j] = 0. |
| 60 | +
|
| 61 | +Here :math:`\nu_j(X_{-j}) = \mathbb{E}[X_j | X_{-j}]` is the conditional expectation of |
| 62 | +:math:`X_j` given the other features. In practice, :math:`\nu_j` is unknown and has to be |
| 63 | +estimated from the data using a predictive model. |
| 64 | + |
| 65 | +Then the perturbed feature :math:`X_j^p` is generated by keeping the predictable part |
| 66 | +:math:`\nu_j(X_{-j})` unchanged, and by replacing the residual :math:`\epsilon_j` by a |
| 67 | +randomly permuted version :math:`\epsilon_j^p`: |
| 68 | + |
| 69 | +.. math:: |
| 70 | + X_j^p = \nu_j(X_{-j}) + \epsilon_j^p, \quad \text{with} \quad \epsilon_j^p \sim \text{Perm}(\epsilon_j). |
| 71 | +
|
| 72 | +
|
| 73 | +.. note:: **Estimation of** :math:`\nu_j` |
| 74 | + |
| 75 | + To generate the perturbed feature :math:`X_j^p`, a model for :math:`\nu_j` is required. |
| 76 | + Estimating :math:`\nu_j` amounts to modeling the relationship between features and is |
| 77 | + arguably an easier task than estimating the relationship between features and the |
| 78 | + target. This 'model-X' assumption was for instance argued in :footcite:t:`Chamma_NeurIPS2023`, |
| 79 | + :footcite:t:`candes2018panning`. |
| 80 | + For example, in genetics, features such as single nucleotide polymorphisms (SNPs) |
| 81 | + are the basis of complex biological processes that result in an outcome (phenotype), |
| 82 | + such as a disease. Predicting the phenotype from SNPs is challenging, whereas |
| 83 | + modeling the relationships between SNPs is often easier due to known correlation |
| 84 | + structures in the genome (linkage disequilibrium). As a result, simple predictive |
| 85 | + models such as regularized linear models or decision trees can be used to estimate |
| 86 | + :math:`\nu_j`. |
| 87 | + |
| 88 | + |
| 89 | +Inference |
| 90 | +--------- |
| 91 | +Under standard assumptions such as additive model: :math:`Y = \mu(X) + \epsilon`, |
| 92 | +Conditional Feature Importance (CFI) allows for conditional independence testing, which |
| 93 | +determines if a feature provides any unique information to the model's predictions that |
| 94 | +isn't already captured by the other features. Essentially, we are testing whether the output is independent from the studied feature given the rest of the input: |
| 95 | + |
| 96 | +.. math:: |
| 97 | + \mathcal{H}_0: Y \perp\!\!\!\perp X_j | X_{-j}. |
| 98 | +
|
| 99 | +
|
| 100 | +The core of this inference is to test the statistical significance of the loss |
| 101 | +differences estimated by CFI. Consequently, a one-sample test on the loss differences |
| 102 | +(or a paired test on the losses) needs to be performed. |
| 103 | + |
| 104 | +Two technical challenges arise in this context: |
| 105 | + |
| 106 | +* When cross-validation (for instance, k-fold) is used to estimate CFI, the loss |
| 107 | + differences obtained from different folds are not independent. Consequently, |
| 108 | + performing a simple t-test on the loss differences is not valid. This issue can be |
| 109 | + addressed by a corrected t-test accounting for this dependence, such as the one |
| 110 | + proposed in :footcite:t:`nadeau1999inference`. |
| 111 | +* Vanishing variance: Under the null hypothesis, even if the loss difference |
| 112 | + converges to zero, the variance of the loss differences also vanishes due to the quadratic functional (:footcite:t:verdinelli2024feature``) . This makes the |
| 113 | + standard one-sample t-test invalid. This second issue can be handled by correcting |
| 114 | + the variance estimate or using other nonparametric test. |
| 115 | + |
| 116 | + |
| 117 | +Regression example |
| 118 | +------------------ |
| 119 | +The following example illustrates the use of CFI on a regression task with:: |
| 120 | + |
| 121 | + >>> from sklearn.datasets import make_regression |
| 122 | + >>> from sklearn.linear_model import LinearRegression |
| 123 | + >>> from sklearn.model_selection import train_test_split |
| 124 | + >>> from hidimstat import CFI |
| 125 | + |
| 126 | + |
| 127 | + >>> X, y = make_regression(n_features=2) |
| 128 | + >>> X_train, X_test, y_train, y_test = train_test_split(X, y) |
| 129 | + >>> model = LinearRegression().fit(X_train, y_train) |
| 130 | + |
| 131 | + >>> cfi = CFI(estimator=model, imputation_model_continuous=LinearRegression()) |
| 132 | + >>> cfi = cfi.fit(X_train, y_train) |
| 133 | + >>> features_importance = cfi.importance(X_test, y_test) |
| 134 | + |
| 135 | + |
| 136 | +Classification example |
| 137 | +---------------------- |
| 138 | +To measure feature importance in a classification task, a classification loss should be |
| 139 | +used, in addition, the prediction method of the estimator should output the corresponding |
| 140 | +type of prediction (probabilities or classes). The following example illustrates the use |
| 141 | +of CFI on a classification task:: |
| 142 | + |
| 143 | + >>> from sklearn.datasets import make_classification |
| 144 | + >>> from sklearn.ensemble import RandomForestClassifier |
| 145 | + >>> from sklearn.linear_model import LinearRegression |
| 146 | + >>> from sklearn.metrics import log_loss |
| 147 | + >>> from sklearn.model_selection import train_test_split |
| 148 | + >>> from hidimstat import CFI |
| 149 | + |
| 150 | + >>> X, y = make_classification(n_features=4) |
| 151 | + >>> X_train, X_test, y_train, y_test = train_test_split(X, y) |
| 152 | + >>> model = RandomForestClassifier().fit(X_train, y_train) |
| 153 | + >>> cfi = CFI( |
| 154 | + ... estimator=model, |
| 155 | + ... imputation_model_continuous=LinearRegression(), |
| 156 | + ... loss=log_loss, |
| 157 | + ... method="predict_proba", |
| 158 | + ... ) |
| 159 | + >>> cfi = cfi.fit(X_train, y_train) |
| 160 | + >>> features_importance = cfi.importance(X_test, y_test) |
| 161 | + |
| 162 | +References |
| 163 | +---------- |
| 164 | +.. footbibliography:: |
0 commit comments