You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/high_dimension.rst
+86-24Lines changed: 86 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ In some cases, data represent high-dimensional measurements of some phenomenon o
14
14
* powerless: As dimensionality and correlation increase, it becomes harder and harder to isolate the contribution of each variable, meaning that conditional inference is ill-posed.
15
15
16
16
This is illustrated in the above example, where the Desparsified Lasso struggles
17
-
to identify relevant features::
17
+
to identify relevant features. We need some data to start::
18
18
19
19
n_samples = 100
20
20
shape = (40, 40)
@@ -24,47 +24,99 @@ to identify relevant features::
24
24
# generating the data
25
25
from hidimstat._utils.scenario import multivariate_simulation_spatial
26
26
X_init, y, beta, epsilon = multivariate_simulation_spatial(
As discussed earlier, feature grouping is a meaningful solution to deal with such cases: it reduces the number of features to condition on, and generally also decreases the level of correlation between features (XXX see grouping section).
61
-
As hinted in [Meinshausen XXX] an efficient way to deal with such configuration is to take the per-group average of the features: this leads to a *reduced design*. After inference, all the feature in a given group obtain the p-value of the group representative. When the inference engine is Desparsified Lasso, the resulting mùethod is called Clustered Desparsified lasso, or **CluDL**.
52
+
As discussed earlier, feature grouping is a meaningful solution to deal with such cases: it reduces the number of features to condition on, and generally also decreases the level of correlation between features.
53
+
54
+
.. seealso::
55
+
56
+
* The :ref:`Grouping documentation <grouping>`
57
+
58
+
59
+
As hinted in :footcite:t:`meinshausen2009pvalues` an efficient way to deal with such configuration is to take the per-group average of the features: this leads to a *reduced design*. After inference, all the feature in a given group obtain the p-value of the group representative. When the inference engine is Desparsified Lasso, the resulting method is called Clustered Desparsified lasso, or **CluDL**.
60
+
61
+
Using the same example as previously, we start by defining a clustering method that will perform the grouping. For image data, Ward clustering is a good deaults model::
The issue is that very-high-dimensional data (biological, images, etc.) do not have any canonical grouping structure. Hence, they rely on grouping obtained from the data, typically with clustering technique. However, the resulting clusters bring some undesirable randomness. Think that imputing slightly differnt data would lead to different clusters. Since there is no globally optimal clustering, the wiser solution is to *average* the results across clusterings. Since it may not be a good idea to average p-values, an alternative *ensembling* or *aggregation* strategy is sued instead. When the inference engine is Desparsified Lasso, the resulting mùethod is called Ensemble of Clustered Desparsified lasso, or **EnCluDL**.
print(f'Clustered Desparsified Lasso selected {np.sum(selected_cdl)} features among {np.sum(beta > 0)} ')
88
+
89
+
Note that inference is also way faster on the compressed representation.
90
+
91
+
The issue is that very-high-dimensional data (biological, images, etc.) do not have any canonical grouping structure. Hence, they rely on grouping obtained from the data, typically with clustering technique. However, the resulting clusters bring some undesirable randomness. Think that imputing slightly differnt data would lead to different clusters. Since there is no globally optimal clustering, the wiser solution is to *average* the results across clusterings. Since it may not be a good idea to average p-values, an alternative *ensembling* or *aggregation* strategy is used instead. When the inference engine is Desparsified Lasso, the resulting method is called Ensemble of Clustered Desparsified lasso, or **EnCluDL**.
64
92
65
-
Example
66
-
-------
93
+
The behavior is illustrated here::
67
94
95
+
from hidimstat.ensemble_clustered_inference import (
96
+
ensemble_clustered_inference,
97
+
ensemble_clustered_inference_pvalue,
98
+
)
99
+
100
+
# ensemble of clustered desparsified lasso (EnCluDL)
0 commit comments