-
Notifications
You must be signed in to change notification settings - Fork 12
[DOC] Very high dim #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DOC] Very high dim #425
Changes from 5 commits
0e4434c
59be6f6
ebfc3b3
5634cde
eb99558
2a3ffcb
42e5262
847439b
2a05511
b7dc476
433943c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,145 @@ | |
| =========================== | ||
| Inference in high dimension | ||
| =========================== | ||
|
|
||
| Naive inference in high dimension is ill-posed | ||
| ---------------------------------------------- | ||
|
|
||
| In some cases, data represent high-dimensional measurements of some phenomenon of interest (e.g. imaging or genotyping). The common characteristic of these problems is to be very high-dimensional and lead to correlated features. Both aspects are clearly detrimental to conditional inference, making it both expensive and powerless: | ||
|
|
||
| * Expensive: most learers are quadratic or cubic in the number of features. Moreover per-feature inference generally entails a loop over features | ||
| * powerless: As dimensionality and correlation increase, it becomes harder and harder to isolate the contribution of each variable, meaning that conditional inference is ill-posed. | ||
|
|
||
| This is illustrated in the above example, where the Desparsified Lasso struggles | ||
| to identify relevant features. We need some data to start:: | ||
|
|
||
| n_samples = 100 | ||
| shape = (40, 40) | ||
| n_features = shape[1] * shape[0] | ||
| roi_size = 4 # size of the edge of the four predictive regions | ||
|
|
||
| # generating the data | ||
| from hidimstat._utils.scenario import multivariate_simulation_spatial | ||
| X_init, y, beta, epsilon = multivariate_simulation_spatial( | ||
| n_samples, shape, roi_size, signal_noise_ratio=10., smooth_X=1 | ||
| ) | ||
|
|
||
| Then we perform inference on this data using the Desparsified Lasso:: | ||
|
|
||
| from hidimstat.desparsified_lasso import ( | ||
| desparsified_lasso, | ||
| desparsified_lasso_pvalue, | ||
| ) | ||
| beta_hat, sigma_hat, precision_diagonal = desparsified_lasso(X_init, y) | ||
| _, pval_corr, _, one_minus_pval_corr, cb_min, cb_max = ( | ||
| desparsified_lasso_pvalue(X_init.shape[0], beta_hat, sigma_hat, precision_diagonal) | ||
| ) | ||
| # compute estimated support (second method) | ||
| import numpy as np | ||
| alpha = .05 | ||
| selected_dl = np.logical_or(pval_corr < alpha , one_minus_pval_corr < alpha) | ||
| print(f'Desparsified Lasso selected {np.sum(selected_dl)} features among {np.sum(beta > 0)} ') | ||
|
|
||
|
|
||
|
|
||
| Feature Grouping and its shortcomings | ||
| ------------------------------------- | ||
|
|
||
| As discussed earlier, feature grouping is a meaningful solution to deal with such cases: it reduces the number of features to condition on, and generally also decreases the level of correlation between features. | ||
|
|
||
| .. seealso:: | ||
|
|
||
| * The :ref:`Grouping documentation <grouping>` | ||
|
|
||
|
|
||
| As hinted in :footcite:t:`meinshausen2009pvalues` an efficient way to deal with such configuration is to take the per-group average of the features: this leads to a *reduced design*. After inference, all the feature in a given group obtain the p-value of the group representative. When the inference engine is Desparsified Lasso, the resulting method is called Clustered Desparsified lasso, or **CluDL**. | ||
|
|
||
| Using the same example as previously, we start by defining a clustering method that will perform the grouping. For image data, Ward clustering is a good default model:: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add a bit more details explaining that the clustering is constrained by pixel's neighborhood? |
||
|
|
||
| from sklearn.feature_extraction import image | ||
| from sklearn.cluster import FeatureAgglomeration | ||
| n_clusters = 200 | ||
| connectivity = image.grid_to_graph(n_x=shape[0], n_y=shape[1]) | ||
| ward = FeatureAgglomeration( | ||
| n_clusters=n_clusters, connectivity=connectivity, linkage="ward" | ||
| ) | ||
|
|
||
| Equipped with this, we can use CluDL:: | ||
|
|
||
| from sklearn.preprocessing import StandardScaler | ||
| from hidimstat.ensemble_clustered_inference import ( | ||
| clustered_inference, | ||
| clustered_inference_pvalue, | ||
| ) | ||
| ward_, beta_hat, theta_hat, omega_diag = clustered_inference( | ||
| X_init, y, ward, n_clusters, scaler_sampling=StandardScaler() | ||
| ) | ||
| _, _, pval_corr, _, one_minus_pval_corr = ( | ||
| clustered_inference_pvalue(X_init.shape[0], False, ward_, beta_hat, theta_hat, omega_diag) | ||
| ) | ||
|
|
||
| # compute estimated support | ||
| selected_cdl = np.logical_or(pval_corr < alpha, one_minus_pval_corr < alpha) | ||
| print(f'Clustered Desparsified Lasso selected {np.sum(selected_cdl)} features among {np.sum(beta > 0)} ') | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code is not executed in the rendered doc. Is there a way to run the code snippets?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the user guide, I don't think that it's a good idea to have code. |
||
| Note that inference is also way faster on the compressed representation. | ||
|
|
||
| The issue is that very-high-dimensional data (biological, images, etc.) do not have any canonical grouping structure. Hence, they rely on grouping obtained from the data, typically with clustering technique. However, the resulting clusters bring some undesirable randomness. Think that imputing slightly different data would lead to different clusters. Since there is no globally optimal clustering, the wiser solution is to *average* the results across clusterings. Since it may not be a good idea to average p-values, an alternative *ensembling* or *aggregation* strategy is used instead. When the inference engine is Desparsified Lasso, the resulting method is called Ensemble of Clustered Desparsified lasso, or **EnCluDL**. | ||
bthirion marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The behavior is illustrated here:: | ||
|
|
||
| from hidimstat.ensemble_clustered_inference import ( | ||
| ensemble_clustered_inference, | ||
| ensemble_clustered_inference_pvalue, | ||
| ) | ||
|
|
||
| # ensemble of clustered desparsified lasso (EnCluDL) | ||
| list_ward, list_beta_hat, list_theta_hat, list_omega_diag = ( | ||
| ensemble_clustered_inference( | ||
| X_init, | ||
| y, | ||
| ward, | ||
| n_clusters, | ||
| scaler_sampling=StandardScaler(), | ||
| ) | ||
| ) | ||
| beta_hat, selected_ecdl = ensemble_clustered_inference_pvalue( | ||
| n_samples, | ||
| False, | ||
| list_ward, | ||
| list_beta_hat, | ||
| list_theta_hat, | ||
| list_omega_diag, | ||
| fdr=fwer_target, | ||
| ) | ||
| print(f'Ensemble of Clustered Desparsified Lasso selected {np.sum(selected_ecdl)} features among {np.sum(beta > 0)} ') | ||
|
|
||
|
|
||
| .. topic:: **Full example** | ||
|
|
||
| See the following example for a full file running the analysis: | ||
| :ref:`sphx_glr_generated_gallery_examples_plot_2D_simulation_example.py` | ||
|
|
||
| What type of Control does this Ensemble of CLustered inference come with ? | ||
| -------------------------------------------------------------------------- | ||
|
|
||
| Ensemble of Clustered Inference is not a local method, so control cannot be maintained at each brain site in isolation. | ||
| The notion of a false positive must be mitigated by the non-local characteristic of the inference performed. | ||
| Thus, we introduce the concept of a $\delta$-false positive: | ||
bthirion marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| A detection is a delta-false positive if it is at a distance greater than $\delta$ from the support, which is the set of true positives. | ||
| Thus, what is controlled is the $\delta$-FWER, i.e., the probability of reporting a single false $\delta$-false positive. | ||
bthirion marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| In other words, EnCluDL will likely only report detections at a distance less than $\delta$ from the true support. | ||
bthirion marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| What is $\delta$ ? It is the diameter of the clusters used in the CluDL procedure. | ||
bthirion marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| The details of the method and the underlying guarantees are described in :footcite:t:`chevalier2022spatially` | ||
|
|
||
|
|
||
|
|
||
| References | ||
| ---------- | ||
| .. footbibliography:: | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this links to the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other models to cite it in this section; it should be interesting to add the link, for example Desparsified Lasso and EnCluDL.