Skip to content

Commit 3eab43f

Browse files
committed
updated doc
1 parent 155a0bb commit 3eab43f

File tree

1 file changed

+87
-24
lines changed

1 file changed

+87
-24
lines changed

docs/src/high_dimension.rst

Lines changed: 87 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ In some cases, data represent high-dimensional measurements of some phenomenon o
1414
* powerless: As dimensionality and correlation increase, it becomes harder and harder to isolate the contribution of each variable, meaning that conditional inference is ill-posed.
1515

1616
This is illustrated in the above example, where the Desparsified Lasso struggles
17-
to identify relevant features::
17+
to identify relevant features. We need some data to start::
1818

1919
n_samples = 100
2020
shape = (40, 40)
@@ -24,48 +24,100 @@ to identify relevant features::
2424
# generating the data
2525
from hidimstat._utils.scenario import multivariate_simulation_spatial
2626
X_init, y, beta, epsilon = multivariate_simulation_spatial(
27-
n_samples, shape, roi_size, signal_noise_ratio=10., smooth_X=1
27+
n_samples, shape, roi_size, signal_noise_ratio=10., smooth_X=1
2828
)
2929

30+
Then we perform inference on this data using the Desparsified Lasso::
31+
3032
from hidimstat.desparsified_lasso import (
3133
desparsified_lasso,
3234
desparsified_lasso_pvalue,
3335
)
3436
beta_hat, sigma_hat, precision_diagonal = desparsified_lasso(X_init, y)
35-
pval, pval_corr, one_minus_pval, one_minus_pval_corr, cb_min, cb_max = (
36-
desparsified_lasso_pvalue(X_init.shape[0], beta_hat, sigma_hat, precision_diagonal)
37+
_, pval_corr, _, one_minus_pval_corr, cb_min, cb_max = (
38+
desparsified_lasso_pvalue(X_init.shape[0], beta_hat, sigma_hat, precision_diagonal)
3739
)
38-
39-
# compute estimated support (first method)
40-
from hidimstat.statistical_tools.p_values import zscore_from_pval
41-
zscore = zscore_from_pval(pval, one_minus_pval)
42-
selected_dl = zscore > thr_nc # use the "no clustering threshold"
43-
40+
4441
# compute estimated support (second method)
45-
selected_dl = np.logical_or(
46-
pval_corr < fwer_target / 2, one_minus_pval_corr < .05
47-
)
48-
print(f'Desparsified Lasso selected {np.sum(selected_dl)} features')
49-
print(f'among {np.sum(beta_hat > 0)} ')
42+
import numpy as np
43+
alpha = .05
44+
selected_dl = np.logical_or(pval_corr < alpha , one_minus_pval_corr < alpha)
45+
print(f'Desparsified Lasso selected {np.sum(selected_dl)} features among {np.sum(beta > 0)} ')
5046

51-
.. topic:: **Full example**
52-
53-
See the following example for a full file running the analysis:
54-
:ref:`sphx_glr_generated_gallery_examples_plot_2D_simulation_example.py`
5547

5648

5749
Feature Grouping and its shortcomings
5850
-------------------------------------
5951

60-
As discussed earlier, feature grouping is a meaningful solution to deal with such cases: it reduces the number of features to condition on, and generally also decreases the level of correlation between features (XXX see grouping section).
61-
As hinted in [Meinshausen XXX] an efficient way to deal with such configuration is to take the per-group average of the features: this leads to a *reduced design*. After inference, all the feature in a given group obtain the p-value of the group representative. When the inference engine is Desparsified Lasso, the resulting mùethod is called Clustered Desparsified lasso, or **CluDL**.
52+
As discussed earlier, feature grouping is a meaningful solution to deal with such cases: it reduces the number of features to condition on, and generally also decreases the level of correlation between features.
6253

63-
The issue is that very-high-dimensional data (biological, images, etc.) do not have any canonical grouping structure. Hence, they rely on grouping obtained from the data, typically with clustering technique. However, the resulting clusters bring some undesirable randomness. Think that imputing slightly differnt data would lead to different clusters. Since there is no globally optimal clustering, the wiser solution is to *average* the results across clusterings. Since it may not be a good idea to average p-values, an alternative *ensembling* or *aggregation* strategy is sued instead. When the inference engine is Desparsified Lasso, the resulting mùethod is called Ensemble of Clustered Desparsified lasso, or **EnCluDL**.
54+
.. seealso::
6455

65-
Example
66-
-------
56+
* The :ref:`Grouping documentation <grouping>`
6757

6858

59+
As hinted in :footcite:t:`meinshausen2009pvalues` an efficient way to deal with such configuration is to take the per-group average of the features: this leads to a *reduced design*. After inference, all the feature in a given group obtain the p-value of the group representative. When the inference engine is Desparsified Lasso, the resulting method is called Clustered Desparsified lasso, or **CluDL**.
60+
61+
Using the same example as previously, we start by defining a clustering method that will perform the grouping. For image data, Ward clustering is a good deaults model::
62+
63+
from sklearn.feature_extraction import image
64+
from sklearn.cluster import FeatureAgglomeration
65+
n_clusters = 200
66+
connectivity = image.grid_to_graph(n_x=shape[0], n_y=shape[1])
67+
ward = FeatureAgglomeration(
68+
n_clusters=n_clusters, connectivity=connectivity, linkage="ward"
69+
)
70+
71+
Equiped with this, we can use CluDL::
72+
73+
from sklearn.preprocessing import StandardScaler
74+
from hidimstat.ensemble_clustered_inference import (
75+
clustered_inference,
76+
clustered_inference_pvalue,
77+
)
78+
ward_, beta_hat, theta_hat, omega_diag = clustered_inference(
79+
X_init, y, ward, n_clusters, scaler_sampling=StandardScaler()
80+
)
81+
_, _, pval_corr, _, one_minus_pval_corr = (
82+
clustered_inference_pvalue(X_init.shape[0], False, ward_, beta_hat, theta_hat, omega_diag)
83+
)
84+
85+
# compute estimated support
86+
selected_cdl = np.logical_or(pval_corr < alpha, one_minus_pval_corr < alpha)
87+
print(f'Clustered Desparsified Lasso selected {np.sum(selected_cdl)} features among {np.sum(beta > 0)} ')
88+
89+
Note that inference is also way faster on the compressed representation.
90+
91+
The issue is that very-high-dimensional data (biological, images, etc.) do not have any canonical grouping structure. Hence, they rely on grouping obtained from the data, typically with clustering technique. However, the resulting clusters bring some undesirable randomness. Think that imputing slightly differnt data would lead to different clusters. Since there is no globally optimal clustering, the wiser solution is to *average* the results across clusterings. Since it may not be a good idea to average p-values, an alternative *ensembling* or *aggregation* strategy is used instead. When the inference engine is Desparsified Lasso, the resulting method is called Ensemble of Clustered Desparsified lasso, or **EnCluDL**.
92+
93+
The behavior is illustrated here::
94+
95+
from hidimstat.ensemble_clustered_inference import (
96+
ensemble_clustered_inference,
97+
ensemble_clustered_inference_pvalue,
98+
)
99+
100+
# ensemble of clustered desparsified lasso (EnCluDL)
101+
list_ward, list_beta_hat, list_theta_hat, list_omega_diag = (
102+
ensemble_clustered_inference(
103+
X_init,
104+
y,
105+
ward,
106+
n_clusters,
107+
scaler_sampling=StandardScaler(),
108+
)
109+
)
110+
beta_hat, selected_ecdl = ensemble_clustered_inference_pvalue(
111+
n_samples,
112+
False,
113+
list_ward,
114+
list_beta_hat,
115+
list_theta_hat,
116+
list_omega_diag,
117+
fdr=fwer_target,
118+
)
119+
print(f'Ensemble of Clustered Desparsified Lasso selected {np.sum(selected_ecdl)} features among {np.sum(beta > 0)} ')
120+
69121

70122
.. topic:: **Full example**
71123

@@ -74,3 +126,14 @@ Example
74126

75127
What type of Control does this Ensemble of CLustered inference come with ?
76128
--------------------------------------------------------------------------
129+
130+
131+
132+
The details of the method are described in :footcite:t:`chevalier2022spatially`
133+
134+
135+
136+
References
137+
----------
138+
.. footbibliography::
139+

0 commit comments

Comments
 (0)