From b2a620d50d7036192b7cd9ed2f6791940a02c4cc Mon Sep 17 00:00:00 2001 From: Himanshu Aggarwal Date: Mon, 15 Sep 2025 12:50:52 +0200 Subject: [PATCH 1/8] init definition of concepts --- docs/src/concepts.rst | 47 ++++++++++++++++++++++++++++++++++++++- docs/tools/references.bib | 11 +++++++++ 2 files changed, 57 insertions(+), 1 deletion(-) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 30524faea..3a2a45f8b 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -3,4 +3,49 @@ ====================== Definition of concepts -====================== \ No newline at end of file +====================== + +Variable Importance +------------------- + +Variable Importance (VI) is a measure of how much a variable contributes to +the predictive power of a model. We can then define "important" variables +as those whose absence degrades the model's performance +:footcite:p:`Covert2020`. + +So if ``VI`` is a variable importance method, ``X`` a variable matrix and ``y`` +the target variable, the importance of a variable can be estimated as follows: + +.. code-block:: + + # instantiate the object + vi = VI() + # fit the models in the method + vi.fit(X, y) + # compute the importance and the pvalues + importance = vi.importance(X, y) + # get importance for each feature + importance = vi.importances_ + # get pvalues + pvalue = vi.pvalues_ + + +(Controlled) Variable Selection +------------------------------- + +Variable selection is then the next step that entails filtering out the +significant features in a way that provides statistical guarantees, +e.g. type-I error or False Discovery Rate (FDR). + +So, if we want to select the variables with a p-value lower than a threshold +``p``, we can do: + +.. code-block:: + + # selection of the importance and pvalues + vi.selection(threshold_pvalue=p) + +References +---------- + +.. footbibliography:: diff --git a/docs/tools/references.bib b/docs/tools/references.bib index fe04dc12f..75b4aa026 100644 --- a/docs/tools/references.bib +++ b/docs/tools/references.bib @@ -135,6 +135,16 @@ @article{chevalier_statistical_2020 year = {2020} } +@inproceedings{Covert2020, + title = {Understanding {{Global Feature Contributions With Additive Importance Measures}}}, + booktitle = {Advances in {{Neural Information Processing Systems}}}, + author = {Covert, Ian and Lundberg, Scott M and Lee, Su-In}, + year = {2020}, + volume = {33}, + pages = {17212--17223}, + publisher = {Curran Associates, Inc.} +} + @article{eshel2003yule, author = {Eshel, Gidon}, journal = {Internet resource}, @@ -348,3 +358,4 @@ @article{zhang2014confidence volume = {76}, year = {2014} } + From 240d499ebaa7cc10e68fb30eb7258178da328c53 Mon Sep 17 00:00:00 2001 From: Himanshu Aggarwal Date: Mon, 15 Sep 2025 14:27:05 +0200 Subject: [PATCH 2/8] point to specific VI classes --- docs/src/concepts.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 3a2a45f8b..ff2d6da53 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -29,6 +29,9 @@ the target variable, the importance of a variable can be estimated as follows: # get pvalues pvalue = vi.pvalues_ +Here, ``VI`` can be any variable importance method implemented in HiDimStat, +such as :class:`hidimstat.LOCO`, :class:`hidimstat.CFI`, :class:`hidimstat.PFI`, +:class:`hidimstat.D0CRT` etc. (Controlled) Variable Selection ------------------------------- From 9a710906251f1cb97c463e351339f4cbd5839dde Mon Sep 17 00:00:00 2001 From: Himanshu Aggarwal Date: Mon, 15 Sep 2025 14:36:21 +0200 Subject: [PATCH 3/8] only d0crt works rn --- docs/src/concepts.rst | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index ff2d6da53..d4c617292 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -29,9 +29,8 @@ the target variable, the importance of a variable can be estimated as follows: # get pvalues pvalue = vi.pvalues_ -Here, ``VI`` can be any variable importance method implemented in HiDimStat, -such as :class:`hidimstat.LOCO`, :class:`hidimstat.CFI`, :class:`hidimstat.PFI`, -:class:`hidimstat.D0CRT` etc. +Here, ``VI`` can be a variable importance method implemented in HiDimStat, +such as :class:`hidimstat.D0CRT`. (Controlled) Variable Selection ------------------------------- From 0eb1d1de2782470f70ed8913fe94d69cb13bd805 Mon Sep 17 00:00:00 2001 From: Himanshu Aggarwal Date: Mon, 15 Sep 2025 14:43:34 +0200 Subject: [PATCH 4/8] section on types of variable imp methods --- docs/src/concepts.rst | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index d4c617292..0afead946 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -47,6 +47,32 @@ So, if we want to select the variables with a p-value lower than a threshold # selection of the importance and pvalues vi.selection(threshold_pvalue=p) + +Types of Variable Importance methods +------------------------------------ + +There are two main types of variable importance methods implemented in +HiDimStat: + +1. Conditional methods: these methods estimate the importance of a variable + conditionally to all the other variables. Examples of such methods are + :class:`hidimstat.LOCO` and :class:`hidimstat.CFI`. + +2. Marginal methods: these methods estimate the importance of a variable + marginally to all the other variables. Examples of such methods are + :class:`hidimstat.PFI` and :class:`hidimstat.D0CRT`. + +The main difference between these two types of methods is that conditional +methods are more computationally expensive but they can handle correlated +variables better than marginal methods :footcite:p:`Covert2020`. + +In particular, marginal methods can be too conservative when variables are +highly correlated, leading to a loss of power in the variable selection step. +However, marginal methods are more scalable to high-dimensional datasets +and they can be used when the number of samples is smaller than the number of +variables, which is not the case for conditional methods. + + References ---------- From 84aadca0da73979bd3501a041b698aaf15a02c89 Mon Sep 17 00:00:00 2001 From: Himanshu Aggarwal Date: Mon, 15 Sep 2025 14:54:46 +0200 Subject: [PATCH 5/8] minor --- docs/src/concepts.rst | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 0afead946..2760ec92f 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -30,7 +30,8 @@ the target variable, the importance of a variable can be estimated as follows: pvalue = vi.pvalues_ Here, ``VI`` can be a variable importance method implemented in HiDimStat, -such as :class:`hidimstat.D0CRT`. +such as :class:`hidimstat.D0CRT` (other methods will support the same API +soon). (Controlled) Variable Selection ------------------------------- @@ -47,12 +48,10 @@ So, if we want to select the variables with a p-value lower than a threshold # selection of the importance and pvalues vi.selection(threshold_pvalue=p) +Types of VI methods +------------------- -Types of Variable Importance methods ------------------------------------- - -There are two main types of variable importance methods implemented in -HiDimStat: +There are two main types of VI methods implemented in HiDimStat: 1. Conditional methods: these methods estimate the importance of a variable conditionally to all the other variables. Examples of such methods are @@ -72,7 +71,6 @@ However, marginal methods are more scalable to high-dimensional datasets and they can be used when the number of samples is smaller than the number of variables, which is not the case for conditional methods. - References ---------- From c5f4c3ac7aeaff347c71923fe058c962acdd29d2 Mon Sep 17 00:00:00 2001 From: angelReyero Date: Mon, 15 Sep 2025 17:15:51 +0200 Subject: [PATCH 6/8] definition --- docs/src/concepts.rst | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 2760ec92f..25112e223 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -8,13 +8,16 @@ Definition of concepts Variable Importance ------------------- -Variable Importance (VI) is a measure of how much a variable contributes to -the predictive power of a model. We can then define "important" variables -as those whose absence degrades the model's performance -:footcite:p:`Covert2020`. +Variable Importance (VI) aims to assign a measure of +relevance to each feature :math:`X^j` with respect to a target :math:`y` in the +data-generating process. In Machine Learning, it can be seen as a measure +of how much a variable contributes to the predictive power of a model. We +can then define "important" variables as those whose absence degrades +the model's performance :footcite:p:`Covert2020`. So if ``VI`` is a variable importance method, ``X`` a variable matrix and ``y`` -the target variable, the importance of a variable can be estimated as follows: +the target variable, the importance of all the variables +can be estimated as follows: .. code-block:: @@ -26,8 +29,8 @@ the target variable, the importance of a variable can be estimated as follows: importance = vi.importance(X, y) # get importance for each feature importance = vi.importances_ - # get pvalues - pvalue = vi.pvalues_ + +It allow us to rank the variables from more to less important. Here, ``VI`` can be a variable importance method implemented in HiDimStat, such as :class:`hidimstat.D0CRT` (other methods will support the same API @@ -71,6 +74,21 @@ However, marginal methods are more scalable to high-dimensional datasets and they can be used when the number of samples is smaller than the number of variables, which is not the case for conditional methods. + +High-dimensionality and correlation +----------------------------------- + +Problem: with high-dimension + +Solution: prior filtering of redundant variables or considering grouping. Brief definition of grouping. + + + +Statistical Inference +--------------------- + + + References ---------- From e0bb2386626847cdaa720e85bf327f97e7b60a78 Mon Sep 17 00:00:00 2001 From: angelReyero Date: Tue, 16 Sep 2025 16:30:58 +0200 Subject: [PATCH 7/8] Statistical Inference and concept description --- docs/src/concepts.rst | 71 ++++++++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 24 deletions(-) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 25112e223..83e0fcb01 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -8,7 +8,7 @@ Definition of concepts Variable Importance ------------------- -Variable Importance (VI) aims to assign a measure of +Global Variable Importance (VI) aims to assign a measure of relevance to each feature :math:`X^j` with respect to a target :math:`y` in the data-generating process. In Machine Learning, it can be seen as a measure of how much a variable contributes to the predictive power of a model. We @@ -33,60 +33,83 @@ can be estimated as follows: It allow us to rank the variables from more to less important. Here, ``VI`` can be a variable importance method implemented in HiDimStat, -such as :class:`hidimstat.D0CRT` (other methods will support the same API +such as :class:`hidimstat.LOCO` (other methods will support the same API soon). -(Controlled) Variable Selection +Variable Selection ------------------------------- -Variable selection is then the next step that entails filtering out the +(Controlled) Variable selection is then the next step that entails filtering out the significant features in a way that provides statistical guarantees, e.g. type-I error or False Discovery Rate (FDR). -So, if we want to select the variables with a p-value lower than a threshold -``p``, we can do: +For example, if we want to select the variables with a p-value lower than +a threshold ``p``, we can do: .. code-block:: # selection of the importance and pvalues vi.selection(threshold_pvalue=p) +This step is important to make insighful discoveries. Even if variable +importance provides a ranking, due to the estimation step, we need +statistical control to do reliable selection. + + + + Types of VI methods ------------------- There are two main types of VI methods implemented in HiDimStat: -1. Conditional methods: these methods estimate the importance of a variable - conditionally to all the other variables. Examples of such methods are - :class:`hidimstat.LOCO` and :class:`hidimstat.CFI`. +1. Marginal methods: these methods provide importance to all the features +that are related to the output, even if it is caused by spurius correlation. They +are related with testing if :math:`X^j\perp\!\!\!\!\perp Y`. +Example of such methods is LOCI. -2. Marginal methods: these methods estimate the importance of a variable - marginally to all the other variables. Examples of such methods are - :class:`hidimstat.PFI` and :class:`hidimstat.D0CRT`. +2. Conditional methods: these methods assign importance only to features that +provide exclusive information beyond what is already captured by the others, +i.e., they contribute unique knowledge. They are related with Conditional +Independence Testing, which consist in testing if +:math:`X^j\perp\!\!\!\!\perp Y\mid X^{-j}`. Examples of such methods are +:class:`hidimstat.LOCO` and :class:`hidimstat.CFI`. -The main difference between these two types of methods is that conditional -methods are more computationally expensive but they can handle correlated -variables better than marginal methods :footcite:p:`Covert2020`. -In particular, marginal methods can be too conservative when variables are -highly correlated, leading to a loss of power in the variable selection step. -However, marginal methods are more scalable to high-dimensional datasets -and they can be used when the number of samples is smaller than the number of -variables, which is not the case for conditional methods. +Generally, conditional methods address the issue of false positives that often +arise with marginal methods, which may assign importance to variables just +because they are correlated with truly important ones. By focusing on unique +contributions, conditional methods help preserve parsimony, yielding a smaller +and more meaningful subset of important features. However, in certain cases, the +distinction between marginal and conditional methods can be more subtle. See +:ref:`sphx_glr_generated_gallery_examples_plot_conditional_vs_marginal_xor_data.py` -High-dimensionality and correlation +High-dimension and correlation ----------------------------------- -Problem: with high-dimension - -Solution: prior filtering of redundant variables or considering grouping. Brief definition of grouping. +In high-dimensional and highly correlated settings, estimation becomes +particularly challenging, as it is difficult to clearly distinguish important +features from unimportant ones. For such problems, a preliminary filtering step +can be applied to avoid having duplicate or redundant input features, or +alternatively, one can consider grouping them :footcite:p:`Chamma_AAAI2024` . +Grouping consists of treating together features that represent the same +underlying concept. This approach extends naturally to many methods, +for example :class:`hidimstat.CFI`. Statistical Inference --------------------- +Given the variability inherent in estimation, it is necessary to apply +statistical control to the discoveries made. Simply selecting the most important +features without such control is not valid. Different forms of guarantees can +be employed, such as controlling the type-I error or the False Discovery Rate. +This step is directly related to the task of Variable Selection. + + + References From 2b9e618ef59c0678068f2d494ebe77cb57d03711 Mon Sep 17 00:00:00 2001 From: angelReyero Date: Wed, 17 Sep 2025 12:21:48 +0200 Subject: [PATCH 8/8] Add explicit information about the gap between variable importance and selection --- docs/src/concepts.rst | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 83e0fcb01..849577936 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -33,11 +33,11 @@ can be estimated as follows: It allow us to rank the variables from more to less important. Here, ``VI`` can be a variable importance method implemented in HiDimStat, -such as :class:`hidimstat.LOCO` (other methods will support the same API +such as Leave One Covariate Out :class:`hidimstat.LOCO` (other methods will support the same API soon). Variable Selection -------------------------------- +------------------- (Controlled) Variable selection is then the next step that entails filtering out the significant features in a way that provides statistical guarantees, @@ -55,6 +55,18 @@ This step is important to make insighful discoveries. Even if variable importance provides a ranking, due to the estimation step, we need statistical control to do reliable selection. +Variable Selection vs Variable Importance +------------------------------------------ + +In the literature, there is a gap between *variable selection* and +*variable importance*, as most methods are dedicated to one of these goals +exclusively :footcite:p:`reyerolobo2025principledapproachcomparingvariable`. +For instance, Conditional Feature Importance (:class:`hidimstat.CFI`) typically +serves only as a measure of importance without offering statistical guarantees, +whereas Model-X Knockoffs (:class:`hidimstat.model_x_knockoff`) generally +provide selection but little beyond that. For this reason, we have adapted the +methods to provide both types of information while preserving their standard +names. @@ -66,7 +78,7 @@ There are two main types of VI methods implemented in HiDimStat: 1. Marginal methods: these methods provide importance to all the features that are related to the output, even if it is caused by spurius correlation. They are related with testing if :math:`X^j\perp\!\!\!\!\perp Y`. -Example of such methods is LOCI. +An example of such methods is Leave One Covariate In (LOCI). 2. Conditional methods: these methods assign importance only to features that provide exclusive information beyond what is already captured by the others, @@ -86,7 +98,7 @@ distinction between marginal and conditional methods can be more subtle. See High-dimension and correlation ------------------------------------ +------------------------------- In high-dimensional and highly correlated settings, estimation becomes particularly challenging, as it is difficult to clearly distinguish important