-
Notifications
You must be signed in to change notification settings - Fork 12
[DOC] Section 1 of user guide/definition of concepts #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b2a620d
240d499
9a71090
0eb1d1d
84aadca
c5f4c3a
e0bb238
2b9e618
f5b8afb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,4 +3,128 @@ | |||||||||||||||||
|
|
||||||||||||||||||
| ====================== | ||||||||||||||||||
| Definition of concepts | ||||||||||||||||||
| ====================== | ||||||||||||||||||
| ====================== | ||||||||||||||||||
|
|
||||||||||||||||||
| Variable Importance | ||||||||||||||||||
| ------------------- | ||||||||||||||||||
|
|
||||||||||||||||||
| Global Variable Importance (VI) aims to assign a measure of | ||||||||||||||||||
| relevance to each feature :math:`X^j` with respect to a target :math:`y` in the | ||||||||||||||||||
| data-generating process. In Machine Learning, it can be seen as a measure | ||||||||||||||||||
| of how much a variable contributes to the predictive power of a model. We | ||||||||||||||||||
| can then define "important" variables as those whose absence degrades | ||||||||||||||||||
| the model's performance :footcite:p:`Covert2020`. | ||||||||||||||||||
|
|
||||||||||||||||||
| So if ``VI`` is a variable importance method, ``X`` a variable matrix and ``y`` | ||||||||||||||||||
| the target variable, the importance of all the variables | ||||||||||||||||||
| can be estimated as follows: | ||||||||||||||||||
|
|
||||||||||||||||||
| .. code-block:: | ||||||||||||||||||
|
|
||||||||||||||||||
| # instantiate the object | ||||||||||||||||||
| vi = VI() | ||||||||||||||||||
| # fit the models in the method | ||||||||||||||||||
| vi.fit(X, y) | ||||||||||||||||||
| # compute the importance and the pvalues | ||||||||||||||||||
| importance = vi.importance(X, y) | ||||||||||||||||||
| # get importance for each feature | ||||||||||||||||||
| importance = vi.importances_ | ||||||||||||||||||
|
|
||||||||||||||||||
| It allow us to rank the variables from more to less important. | ||||||||||||||||||
|
|
||||||||||||||||||
| Here, ``VI`` can be a variable importance method implemented in HiDimStat, | ||||||||||||||||||
| such as Leave One Covariate Out :class:`hidimstat.LOCO` (other methods will support the same API | ||||||||||||||||||
| soon). | ||||||||||||||||||
|
|
||||||||||||||||||
| Variable Selection | ||||||||||||||||||
| ------------------- | ||||||||||||||||||
|
|
||||||||||||||||||
| (Controlled) Variable selection is then the next step that entails filtering out the | ||||||||||||||||||
| significant features in a way that provides statistical guarantees, | ||||||||||||||||||
| e.g. type-I error or False Discovery Rate (FDR). | ||||||||||||||||||
|
|
||||||||||||||||||
| For example, if we want to select the variables with a p-value lower than | ||||||||||||||||||
| a threshold ``p``, we can do: | ||||||||||||||||||
|
|
||||||||||||||||||
| .. code-block:: | ||||||||||||||||||
|
|
||||||||||||||||||
| # selection of the importance and pvalues | ||||||||||||||||||
| vi.selection(threshold_pvalue=p) | ||||||||||||||||||
|
|
||||||||||||||||||
| This step is important to make insighful discoveries. Even if variable | ||||||||||||||||||
| importance provides a ranking, due to the estimation step, we need | ||||||||||||||||||
| statistical control to do reliable selection. | ||||||||||||||||||
|
|
||||||||||||||||||
| Variable Selection vs Variable Importance | ||||||||||||||||||
| ------------------------------------------ | ||||||||||||||||||
|
|
||||||||||||||||||
| In the literature, there is a gap between *variable selection* and | ||||||||||||||||||
| *variable importance*, as most methods are dedicated to one of these goals | ||||||||||||||||||
| exclusively :footcite:p:`reyerolobo2025principledapproachcomparingvariable`. | ||||||||||||||||||
| For instance, Conditional Feature Importance (:class:`hidimstat.CFI`) typically | ||||||||||||||||||
| serves only as a measure of importance without offering statistical guarantees, | ||||||||||||||||||
| whereas Model-X Knockoffs (:class:`hidimstat.model_x_knockoff`) generally | ||||||||||||||||||
| provide selection but little beyond that. For this reason, we have adapted the | ||||||||||||||||||
| methods to provide both types of information while preserving their standard | ||||||||||||||||||
| names. | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| Types of VI methods | ||||||||||||||||||
| ------------------- | ||||||||||||||||||
|
|
||||||||||||||||||
| There are two main types of VI methods implemented in HiDimStat: | ||||||||||||||||||
|
|
||||||||||||||||||
| 1. Marginal methods: these methods provide importance to all the features | ||||||||||||||||||
| that are related to the output, even if it is caused by spurius correlation. They | ||||||||||||||||||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
| are related with testing if :math:`X^j\perp\!\!\!\!\perp Y`. | ||||||||||||||||||
|
Comment on lines
+79
to
+80
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe that sounds better?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is because they do not directly test whether X is independent of Y because they are variable importance measures, not just for selection. That is why I would say that implicitly they are related to this testing, but they do not consist on this testing.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok makes sense! |
||||||||||||||||||
| An example of such methods is Leave One Covariate In (LOCI). | ||||||||||||||||||
|
|
||||||||||||||||||
| 2. Conditional methods: these methods assign importance only to features that | ||||||||||||||||||
| provide exclusive information beyond what is already captured by the others, | ||||||||||||||||||
| i.e., they contribute unique knowledge. They are related with Conditional | ||||||||||||||||||
| Independence Testing, which consist in testing if | ||||||||||||||||||
| :math:`X^j\perp\!\!\!\!\perp Y\mid X^{-j}`. Examples of such methods are | ||||||||||||||||||
| :class:`hidimstat.LOCO` and :class:`hidimstat.CFI`. | ||||||||||||||||||
|
Comment on lines
+85
to
+88
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| Generally, conditional methods address the issue of false positives that often | ||||||||||||||||||
| arise with marginal methods, which may assign importance to variables just | ||||||||||||||||||
| because they are correlated with truly important ones. By focusing on unique | ||||||||||||||||||
| contributions, conditional methods help preserve parsimony, yielding a smaller | ||||||||||||||||||
| and more meaningful subset of important features. However, in certain cases, the | ||||||||||||||||||
| distinction between marginal and conditional methods can be more subtle. See | ||||||||||||||||||
| :ref:`sphx_glr_generated_gallery_examples_plot_conditional_vs_marginal_xor_data.py` | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| High-dimension and correlation | ||||||||||||||||||
| ------------------------------- | ||||||||||||||||||
|
|
||||||||||||||||||
| In high-dimensional and highly correlated settings, estimation becomes | ||||||||||||||||||
| particularly challenging, as it is difficult to clearly distinguish important | ||||||||||||||||||
| features from unimportant ones. For such problems, a preliminary filtering step | ||||||||||||||||||
| can be applied to avoid having duplicate or redundant input features, or | ||||||||||||||||||
| alternatively, one can consider grouping them :footcite:p:`Chamma_AAAI2024` . | ||||||||||||||||||
| Grouping consists of treating together features that represent the same | ||||||||||||||||||
| underlying concept. This approach extends naturally to many methods, | ||||||||||||||||||
| for example :class:`hidimstat.CFI`. | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| Statistical Inference | ||||||||||||||||||
| --------------------- | ||||||||||||||||||
|
|
||||||||||||||||||
| Given the variability inherent in estimation, it is necessary to apply | ||||||||||||||||||
| statistical control to the discoveries made. Simply selecting the most important | ||||||||||||||||||
| features without such control is not valid. Different forms of guarantees can | ||||||||||||||||||
| be employed, such as controlling the type-I error or the False Discovery Rate. | ||||||||||||||||||
| This step is directly related to the task of Variable Selection. | ||||||||||||||||||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be very wrong, but isn't this section somewhat redundant to the Variable Selection section? Could it be incorporated with the Variable Selection section?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but I am not sure how. Indeed it is important to make explicit that the power of the library is to provide statistical guarantees too. |
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| References | ||||||||||||||||||
| ---------- | ||||||||||||||||||
|
|
||||||||||||||||||
| .. footbibliography:: | ||||||||||||||||||
Uh oh!
There was an error while loading. Please reload this page.