Skip to content

Commit 11c7c79

Browse files
committed
DOC add redundancy
1 parent bdfdab7 commit 11c7c79

File tree

5 files changed

+250
-79
lines changed

5 files changed

+250
-79
lines changed

doc/multioutput.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@
66
Multioutput feature selection
77
==============================
88

9-
We can use :class:`FastCan` for multioutput feature selection.
9+
We can use :class:`FastCan` to handle multioutput feature selection.

doc/redundancy.rst

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,30 @@
66
Feature redundancy
77
==================
88

9-
:class:`FastCan` can effectively skip the redundant features.
9+
:class:`FastCan` can effectively skip the linearly redundant features.
10+
Here a feature :math:`x_r\in \mathbb{R}^{N\times 1}` is linearly
11+
redundant to a set of features :math:`X\in \mathbb{R}^{N\times n}` means that
12+
:math:`x_r` can be obtained from an affine transformation of :math:`X`, given by
13+
14+
.. math::
15+
x_r = Xa + b
16+
17+
where :math:`a\in \mathbb{R}^{n\times 1}` and :math:`b\in \mathbb{R}^{N\times 1}`.
18+
In other words, the feature can be acquired by a linear transformation of :math:`X`,
19+
i.e. :math:`Xa`, and a translation, i.e. :math:`+b`.
20+
21+
This capability of :class:`FastCan` is benefited from the
22+
`Modified Gram-Schmidt <https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process>`_,
23+
which gives large rounding-errors when linearly redundant features appears.
24+
25+
.. rubric:: References
26+
27+
* `"Canonical-correlation-based fast feature selection for structural
28+
health monitoring" <https://doi.org/10.1016/j.ymssp.2024.111895>`_
29+
Zhang, S., Wang, T., Worden, K., Sun L., & Cross, E. J.
30+
Mechanical Systems and Signal Processing, 223:111895 (2025).
31+
32+
.. rubric:: Examples
33+
34+
* See :ref:`sphx_glr_auto_examples_plot_redundancy.py` for an example of
35+
feature selection on datasets with redundant features.

doc/unsupervised.rst

Lines changed: 9 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -10,72 +10,29 @@ We can use :class:`FastCan` to do unsupervised feature selection.
1010
The unsupervised application of :class:`FastCan` tries to select features, which
1111
maximize the sum of the squared canonical correlation (SSC) with the principal
1212
components (PCs) acquired from PCA (principal component analysis) of the feature
13-
matrix :math:`X`.
13+
matrix :math:`X`. See the example below.
1414

1515
>>> from sklearn.decomposition import PCA
1616
>>> from sklearn import datasets
1717
>>> from fastcan import FastCan
1818
>>> iris = datasets.load_iris()
1919
>>> X = iris["data"]
20-
>>> y = iris["target"]
21-
>>> f_names = iris["feature_names"]
22-
>>> t_names = iris["target_names"]
2320
>>> pca = PCA(n_components=2)
2421
>>> X_pcs = pca.fit_transform(X)
25-
>>> selector = FastCan(n_features_to_select=2, verbose=0)
26-
>>> selector.fit(X, X_pcs[:, :2])
22+
>>> selector = FastCan(n_features_to_select=2, verbose=0).fit(X, X_pcs[:, :2])
2723
>>> selector.indices_
2824
array([2, 1], dtype=int32)
2925

3026
.. note::
3127
There is no guarantee that this unsupervised :class:`FastCan` will select
3228
the optimal subset of the features, which has the highest SSC with PCs.
3329
Because :class:`FastCan` selects features in a greedy manner, which may lead to
34-
suboptimal results. See the following plots.
30+
suboptimal results.
3531

36-
.. plot::
37-
:context: close-figs
38-
:align: center
39-
40-
from itertools import combinations
41-
import matplotlib.pyplot as plt
42-
from sklearn.cross_decomposition import CCA
43-
44-
def ssc(X, y):
45-
"""Sum of the squared canonical correlation coefficients.
46-
Parameters
47-
----------
48-
X : array-like of shape (n_samples, n_features)
49-
Feature matrix.
50-
51-
y : array-like of shape (n_samples, n_outputs)
52-
Target matrix.
53-
54-
Returns
55-
-------
56-
ssc : float
57-
Sum of the squared canonical correlation coefficients.
58-
"""
59-
n_components = min(X.shape[1], y.shape[1])
60-
cca = CCA(n_components=n_components)
61-
X_c, y_c = cca.fit_transform(X, y)
62-
corrcoef = np.diagonal(
63-
np.corrcoef(X_c, y_c, rowvar=False),
64-
offset=n_components
65-
)
66-
return sum(corrcoef**2)
67-
68-
comb = list(combinations([0, 1, 2, 3], 2))
69-
fig, axs = plt.subplots(ncols=3, nrows=2, figsize=(8, 6), layout="constrained")
70-
for i in range(2):
71-
for j in range(3):
72-
f1_idx = comb[i*3+j][0]
73-
f2_idx = comb[i*3+j][1]
74-
score = ssc(X[:, [f1_idx, f2_idx]], X_pcs)
75-
scatter = axs[i, j].scatter(X[:, f1_idx], X[:, f2_idx], c=y)
76-
axs[i, j].set(xlabel=f_names[f1_idx], ylabel=f_names[f2_idx])
77-
axs[i, j].set_title(f"SSC: {score:.3f}")
78-
for spine in axs[1, 0].spines.values():
79-
spine.set_edgecolor('red')
80-
_ = axs[1, 2].legend(scatter.legend_elements()[0], t_names, loc="lower right")
32+
.. rubric:: References
8133

34+
* `"Automatic Selection of Optimal Structures for Population-Based
35+
Structural Health Monitoring" <https://doi.org/10.1007/978-3-031-34946-1_10>`_
36+
Wang, T., Worden, K., Wagg, D.J., Cross, E.J., Maguire, A.E., Lin, W.
37+
In: Conference Proceedings of the Society for Experimental Mechanics Series.
38+
Springer, Cham. (2023).

examples/plot_redundancy.py

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
"""
2+
===================================================
3+
Feature selection performance on redundant features
4+
===================================================
5+
6+
In this examples, we will compare the performance of feature selectors on the
7+
datasets, which contain redundant features.
8+
Here four types of features should be distinguished:
9+
10+
* Unuseful features: the features do not contribute to the target
11+
* Dependent informative features: the features contribute to the target and form
12+
the redundant features
13+
* Redundant features: the features are constructed by linear transformation of
14+
dependent informative features
15+
* Independent informative features: the features contribute to the target but
16+
does not contribute to the redundant features.
17+
18+
.. note::
19+
If we do not distinguish dependent and independent informative features and use
20+
informative features to form both the target and the redundant features. The task
21+
will be much easier.
22+
"""
23+
24+
# Authors: Sikai Zhang
25+
# SPDX-License-Identifier: MIT
26+
27+
# %%
28+
# Define dataset generator
29+
# ------------------------
30+
31+
import numpy as np
32+
33+
34+
def make_redundant(
35+
n_samples,
36+
n_features,
37+
dep_info_ids,
38+
indep_info_ids,
39+
redundant_ids,
40+
random_seed,
41+
):
42+
"""Make a dataset with linearly redundant features.
43+
44+
Parameters
45+
----------
46+
n_samples : int
47+
The number of samples.
48+
49+
n_features : int
50+
The number of features.
51+
52+
dep_info_ids : list[int]
53+
The indices of dependent informative features.
54+
55+
indep_info_ids : list[int]
56+
The indices of independent informative features.
57+
58+
redundant_ids : list[int]
59+
The indices of redundant features.
60+
61+
random_seed : int
62+
Random seed.
63+
64+
Returns
65+
-------
66+
X : array-like of shape (n_samples, n_features)
67+
Feature matrix.
68+
69+
y : array-like of shape (n_samples,)
70+
Target vector.
71+
"""
72+
rng = np.random.default_rng(random_seed)
73+
info_ids = dep_info_ids + indep_info_ids
74+
n_dep_info = len(dep_info_ids)
75+
n_info = len(info_ids)
76+
n_redundant = len(redundant_ids)
77+
informative_coef = rng.random(n_info)
78+
redundant_coef = rng.random((n_dep_info, n_redundant))
79+
80+
X = rng.random((n_samples, n_features))
81+
y = np.dot(X[:, info_ids], informative_coef)
82+
83+
X[:, redundant_ids] = X[:, dep_info_ids]@redundant_coef
84+
return X, y
85+
86+
# %%
87+
# Define score function
88+
# ---------------------
89+
# This function is used to compute the number of correct features missed by selectors.
90+
#
91+
# * For independent informative features, selectors should select all of them
92+
# * For dependent informative features, selectors only need to select any
93+
# ``n_dep_info``-combination of the set ``dep_info_ids`` + ``redundant_ids``. That
94+
# means if the indices of dependent informative features are :math:`[0, 1]` and the
95+
# indices of the redundant features are :math:`[5]`, then the correctly selected
96+
# indices can be any of :math:`[0, 1]`, :math:`[0, 5]`, and :math:`[1, 5]`.
97+
98+
def get_n_missed(
99+
dep_info_ids,
100+
indep_info_ids,
101+
redundant_ids,
102+
selected_ids
103+
):
104+
"""Get the number of features miss selected."""
105+
n_redundant = len(redundant_ids)
106+
n_missed_indep = len(np.setdiff1d(indep_info_ids, selected_ids))
107+
n_missed_dep = len(
108+
np.setdiff1d(dep_info_ids+redundant_ids, selected_ids)
109+
)-n_redundant
110+
if n_missed_dep < 0:
111+
n_missed_dep = 0
112+
return n_missed_indep+n_missed_dep
113+
114+
# %%
115+
# Prepare selectors
116+
# -----------------
117+
# We compare :class:`fastcan.FastCan` with eight selectors of :mod:`sklearn`, which
118+
# include the Select From a Model (SFM) algorithm, the Recursive Feature Elimination
119+
# (RFE) algorithm, the Sequential Feature Selection (SFS) algorithm, and Select K Best
120+
# (SKB) algorithm.
121+
# The list of the selectors are given below:
122+
#
123+
# * fastcan: :class:`fastcan.FastCan` selector
124+
# * skb_reg: is the SKB algorithm ranking features with ANOVA (analysis of variance)
125+
# F-statistic and p-values
126+
# * skb_mir: is the SKB algorithm ranking features mutual information for regression
127+
# * sfm_lsvr: the SFM algorithm with a linear support vector regressor
128+
# * sfm_rfr: the SFM algorithm with a random forest regressor
129+
# * rfe_lsvr: is the RFE algorithm with a linear support vector regressor
130+
# * rfe_rfr: is the RFE algorithm with a random forest regressor
131+
# * sfs_lsvr: is the forward SFS algorithm with a linear support vector regressor
132+
# * sfs_rfr: is the forward SFS algorithm with a random forest regressor
133+
134+
135+
from sklearn.ensemble import RandomForestRegressor
136+
from sklearn.feature_selection import (
137+
RFE,
138+
SelectFromModel,
139+
SelectKBest,
140+
SequentialFeatureSelector,
141+
f_regression,
142+
mutual_info_regression,
143+
)
144+
from sklearn.svm import LinearSVR
145+
146+
from fastcan import FastCan
147+
148+
lsvr = LinearSVR(C = 1, dual="auto", max_iter=100000, random_state=0)
149+
rfr = RandomForestRegressor(n_estimators = 10, random_state=0)
150+
151+
152+
N_SAMPLES = 1000
153+
N_FEATURES = 10
154+
DEP_INFO_IDS = [2, 4, 7, 9]
155+
INDEP_INFO_IDS = [0, 1, 6]
156+
REDUNDANT_IDS = [5, 8]
157+
N_SELECTED = len(DEP_INFO_IDS+INDEP_INFO_IDS)
158+
N_REPEATED = 10
159+
160+
selector_dict = {
161+
"fastcan": FastCan(N_SELECTED, verbose=0),
162+
"skb_reg": SelectKBest(f_regression, k=N_SELECTED),
163+
"skb_mir": SelectKBest(mutual_info_regression, k=N_SELECTED),
164+
"sfm_lsvr": SelectFromModel(lsvr, max_features=N_SELECTED, threshold=-np.inf),
165+
"sfm_rfr": SelectFromModel(rfr, max_features=N_SELECTED, threshold=-np.inf),
166+
"rfe_lsvr": RFE(lsvr, n_features_to_select=N_SELECTED, step=1),
167+
"rfe_rfr": RFE(rfr, n_features_to_select=N_SELECTED, step=1),
168+
"sfs_lsvr": SequentialFeatureSelector(lsvr, n_features_to_select=N_SELECTED, cv=2),
169+
"sfs_rfr": SequentialFeatureSelector(rfr, n_features_to_select=N_SELECTED, cv=2),
170+
}
171+
172+
# %%
173+
# Run test
174+
# --------
175+
176+
N_SELECTORS = len(selector_dict)
177+
n_missed = np.zeros((N_REPEATED, N_SELECTORS), dtype=int)
178+
179+
for i in range(N_REPEATED):
180+
X, y = make_redundant(
181+
n_samples=N_SAMPLES,
182+
n_features=N_FEATURES,
183+
dep_info_ids=DEP_INFO_IDS,
184+
indep_info_ids=INDEP_INFO_IDS,
185+
redundant_ids=REDUNDANT_IDS,
186+
random_seed=i,
187+
)
188+
for j, selector in enumerate(selector_dict.values()):
189+
result_ids = selector.fit(X, y).get_support(indices=True)
190+
n_missed[i, j] = get_n_missed(
191+
dep_info_ids=DEP_INFO_IDS,
192+
indep_info_ids=INDEP_INFO_IDS,
193+
redundant_ids=REDUNDANT_IDS,
194+
selected_ids=result_ids,
195+
)
196+
197+
# %%
198+
# Plot results
199+
# ------------
200+
# :class:`fastcan.FastCan` correctly selects all informative features with zero missed
201+
# features.
202+
203+
import matplotlib.pyplot as plt
204+
205+
fig, ax = plt.subplots(figsize = (8, 5))
206+
rects = ax.bar(selector_dict.keys(), n_missed.sum(0), width = 0.5)
207+
ax.bar_label(rects, n_missed.sum(0), padding=3)
208+
plt.xlabel("Selector")
209+
plt.ylabel("No. of missed features")
210+
plt.title("Performance of selectors on datasets with linear redundant features")
211+
plt.show()

examples/plot_speed.py

Lines changed: 2 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -37,35 +37,12 @@
3737
# canonical correlation coefficients may be more than one, the feature ranking
3838
# criterion used here is the sum squared of all canonical correlation coefficients.
3939

40-
from sklearn.cross_decomposition import CCA
41-
42-
def ssc(X, y):
43-
"""Sum of the squared canonical correlation coefficients.
44-
Parameters
45-
----------
46-
X : array-like of shape (n_samples, n_features)
47-
Feature matrix.
48-
49-
y : array-like of shape (n_samples, n_outputs)
50-
Target matrix.
51-
52-
Returns
53-
-------
54-
ssc : float
55-
Sum of the squared canonical correlation coefficients.
56-
"""
57-
n_components = min(X.shape[1], y.shape[1])
58-
cca = CCA(n_components=n_components)
59-
X_c, y_c = cca.fit_transform(X, y)
60-
corrcoef = np.diagonal(
61-
np.corrcoef(X_c, y_c, rowvar=False),
62-
offset=n_components
63-
)
64-
return sum(corrcoef**2)
40+
from fastcan import ssc
6541

6642

6743
def baseline(X, y, t):
6844
"""Baseline method using CCA from sklearn.
45+
6946
Parameters
7047
----------
7148
X : array-like of shape (n_samples, n_features)

0 commit comments

Comments
 (0)