diff --git a/docs/src/concepts.rst b/docs/src/concepts.rst index 30524faea..849577936 100644 --- a/docs/src/concepts.rst +++ b/docs/src/concepts.rst @@ -3,4 +3,128 @@ ====================== Definition of concepts -====================== \ No newline at end of file +====================== + +Variable Importance +------------------- + +Global Variable Importance (VI) aims to assign a measure of +relevance to each feature :math:`X^j` with respect to a target :math:`y` in the +data-generating process. In Machine Learning, it can be seen as a measure +of how much a variable contributes to the predictive power of a model. We +can then define "important" variables as those whose absence degrades +the model's performance :footcite:p:`Covert2020`. + +So if ``VI`` is a variable importance method, ``X`` a variable matrix and ``y`` +the target variable, the importance of all the variables +can be estimated as follows: + +.. code-block:: + + # instantiate the object + vi = VI() + # fit the models in the method + vi.fit(X, y) + # compute the importance and the pvalues + importance = vi.importance(X, y) + # get importance for each feature + importance = vi.importances_ + +It allow us to rank the variables from more to less important. + +Here, ``VI`` can be a variable importance method implemented in HiDimStat, +such as Leave One Covariate Out :class:`hidimstat.LOCO` (other methods will support the same API +soon). + +Variable Selection +------------------- + +(Controlled) Variable selection is then the next step that entails filtering out the +significant features in a way that provides statistical guarantees, +e.g. type-I error or False Discovery Rate (FDR). + +For example, if we want to select the variables with a p-value lower than +a threshold ``p``, we can do: + +.. code-block:: + + # selection of the importance and pvalues + vi.selection(threshold_pvalue=p) + +This step is important to make insighful discoveries. Even if variable +importance provides a ranking, due to the estimation step, we need +statistical control to do reliable selection. + +Variable Selection vs Variable Importance +------------------------------------------ + +In the literature, there is a gap between *variable selection* and +*variable importance*, as most methods are dedicated to one of these goals +exclusively :footcite:p:`reyerolobo2025principledapproachcomparingvariable`. +For instance, Conditional Feature Importance (:class:`hidimstat.CFI`) typically +serves only as a measure of importance without offering statistical guarantees, +whereas Model-X Knockoffs (:class:`hidimstat.model_x_knockoff`) generally +provide selection but little beyond that. For this reason, we have adapted the +methods to provide both types of information while preserving their standard +names. + + + +Types of VI methods +------------------- + +There are two main types of VI methods implemented in HiDimStat: + +1. Marginal methods: these methods provide importance to all the features +that are related to the output, even if it is caused by spurius correlation. They +are related with testing if :math:`X^j\perp\!\!\!\!\perp Y`. +An example of such methods is Leave One Covariate In (LOCI). + +2. Conditional methods: these methods assign importance only to features that +provide exclusive information beyond what is already captured by the others, +i.e., they contribute unique knowledge. They are related with Conditional +Independence Testing, which consist in testing if +:math:`X^j\perp\!\!\!\!\perp Y\mid X^{-j}`. Examples of such methods are +:class:`hidimstat.LOCO` and :class:`hidimstat.CFI`. + + +Generally, conditional methods address the issue of false positives that often +arise with marginal methods, which may assign importance to variables just +because they are correlated with truly important ones. By focusing on unique +contributions, conditional methods help preserve parsimony, yielding a smaller +and more meaningful subset of important features. However, in certain cases, the +distinction between marginal and conditional methods can be more subtle. See +:ref:`sphx_glr_generated_gallery_examples_plot_conditional_vs_marginal_xor_data.py` + + +High-dimension and correlation +------------------------------- + +In high-dimensional and highly correlated settings, estimation becomes +particularly challenging, as it is difficult to clearly distinguish important +features from unimportant ones. For such problems, a preliminary filtering step +can be applied to avoid having duplicate or redundant input features, or +alternatively, one can consider grouping them :footcite:p:`Chamma_AAAI2024` . +Grouping consists of treating together features that represent the same +underlying concept. This approach extends naturally to many methods, +for example :class:`hidimstat.CFI`. + + + +Statistical Inference +--------------------- + +Given the variability inherent in estimation, it is necessary to apply +statistical control to the discoveries made. Simply selecting the most important +features without such control is not valid. Different forms of guarantees can +be employed, such as controlling the type-I error or the False Discovery Rate. +This step is directly related to the task of Variable Selection. + + + + + +References +---------- + +.. footbibliography:: diff --git a/docs/tools/references.bib b/docs/tools/references.bib index 1bcedb8e8..08993a569 100644 --- a/docs/tools/references.bib +++ b/docs/tools/references.bib @@ -143,6 +143,16 @@ @article{chevalier_statistical_2020 year = {2020} } +@inproceedings{Covert2020, + title = {Understanding {{Global Feature Contributions With Additive Importance Measures}}}, + booktitle = {Advances in {{Neural Information Processing Systems}}}, + author = {Covert, Ian and Lundberg, Scott M and Lee, Su-In}, + year = {2020}, + volume = {33}, + pages = {17212--17223}, + publisher = {Curran Associates, Inc.} +} + @article{eshel2003yule, author = {Eshel, Gidon}, journal = {Internet resource}, @@ -368,3 +378,4 @@ @article{zhang2014confidence volume = {76}, year = {2014} } +