Skip to content

Commit fae8244

Browse files
committed
updating chapter on squashingscaler
1 parent 6aec6ee commit fae8244

File tree

4 files changed

+182
-12
lines changed

4 files changed

+182
-12
lines changed

content/chapters/05_feat_eng_numerical.qmd

Lines changed: 75 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,84 @@ format:
1010
---
1111

1212
Now that we can apply selections to any column we want thanks to `ApplyToCols` and
13-
the selectors, it is time to s
13+
the selectors, it is time to scale numerical features safely.
1414

15+
## Numerical features with outliers
1516

16-
1. Numerical features with outliers
17-
2. Regular scalers
18-
3. SquashingScaler
19-
- SquashingScaler can't invert the transformation
17+
When dealing with numerical features that contain outliers (including infinite values), standard scaling methods can be problematic. Outliers can dramatically affect the centering and scaling of the entire dataset, causing the scaled inliers to be compressed into a narrow range.
2018

19+
Consider this example:
2120

22-
## Exercise: comparing scalers
23-
Scale the following data with the scikit-learn StandardScaler and RobustScaler,
24-
and then with the SquashingScaler.
21+
```{python}
22+
from helpers import (
23+
generate_data_with_outliers,
24+
plot_feature_with_outliers
25+
)
2526
26-
Answer the following questions:
27-
- Mean/std of the transformed data
28-
- Max/min of the transformed data
29-
- Mean/std of the inliers (use provided iqr function)
27+
values = generate_data_with_outliers()
3028
29+
plot_feature_with_outliers(values)
30+
31+
```
32+
33+
In this case, most of the values are in the range `[-2, 2]`, but there are some
34+
large outliers in the range `[-40, 40]` that can cause issues when the feature
35+
needs to be scaled.
36+
37+
38+
### Regular scalers and their limitations
39+
40+
The **StandardScaler** computes mean and standard deviation across all values.
41+
With outliers present, these statistics become unreliable, and the scaling factor
42+
can become too small, squashing inlier values.
43+
44+
The **RobustScaler** uses quantiles (typically the 25th and 75th percentiles)
45+
instead of mean/std, which makes it more resistant to outliers. However, it
46+
doesn't bound the output values, so extreme outliers can still have very large
47+
scaled values.
48+
49+
### SquashingScaler: A robust solution
50+
51+
The `SquashingScaler` combines robust centering with smooth clipping to handle
52+
outliers effectively. It works in two stages:
53+
54+
### Stage 1: Robust Scaling
55+
- Centers the median to zero
56+
- Scales using quantile-based statistics (by default, the interquartile range)
57+
- For columns where quantiles are equal, uses a custom MinMaxScaler
58+
- For columns with constant values, fills with zeros
59+
60+
### Stage 2: Soft Clipping
61+
- Applies a smooth squashing function:
62+
$x_{\text{out}} = \frac{z}{\sqrt{1 + (z/B)^2}}$
63+
- Constrains all values to the range
64+
$[-\texttt{max\_absolute\_value}, \texttt{max\_absolute\_value}]$ (default: 3)
65+
- Maps infinite values to the corresponding boundaries
66+
- Preserves NaN values unchanged
67+
68+
### Key advantages of `SquashingScaler`
69+
The `SquashingScaler` has various advantages over traditional scalers:
70+
71+
- It is **Outlier-resistant**: Outliers don't affect inlier scaling, unlike the
72+
`StandardScaler`.
73+
- It has **Bounded output**: All values stay in a predictable range, ideal for
74+
neural networks and linear models.
75+
- It **Handles edge cases**: The scaler works with infinite values and constant
76+
columns.
77+
- It **Preserves missing data**: NaN values are kept unchanged.
78+
79+
A disadvantage of the `SquashingScaler` is that it is**Non-invertible**:
80+
The soft clipping function is smooth but cannot be exactly inverted.
81+
82+
## Comparison with other scalers
83+
84+
When compared on data with outliers:
85+
- **StandardScaler** compresses inliers due to large scaling factors
86+
- **RobustScaler** preserves relative scales but allows extreme outlier values
87+
- **SquashingScaler** keeps inliers in a reasonable range while smoothly bounding all values
88+
89+
If we plot the impact of each scaler on the result, this is what we can see:
90+
```{python}
91+
from helpers import scale_feature_and_plot
92+
scale_feature_and_plot(values)
93+
```
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .generate_synthetic_data import *
2+
from .plot_squashing_scaler import *
File renamed without changes.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
import numpy as np
2+
import matplotlib.pyplot as plt
3+
from sklearn.preprocessing import RobustScaler, StandardScaler
4+
from skrub import SquashingScaler
5+
6+
7+
def generate_data_with_outliers():
8+
np.random.seed(0) # for reproducibility
9+
values = np.random.rand(100, 1)
10+
n_outliers = 15
11+
outlier_indices = np.random.choice(values.shape[0], size=n_outliers, replace=False)
12+
values[outlier_indices] = np.random.rand(n_outliers, 1) * 100 - 50
13+
return values
14+
15+
16+
def plot_feature_with_outliers(values):
17+
"""Plot a feature with outliers and annotate it."""
18+
x = np.arange(values.shape[0])
19+
fig, axs = plt.subplots(1, layout="constrained", figsize=(6, 4))
20+
21+
axs.plot(x, values)
22+
_ = axs.set(title="Feature with outliers", ylabel="value", xlabel="Sample ID")
23+
axs.axhspan(-2, 2, color="gray", alpha=0.15)
24+
25+
x_data, y_data = [30, 2]
26+
desc = "Data is mostly\nin [-2, 2]"
27+
axs.annotate(
28+
desc,
29+
xy=(x_data, y_data),
30+
xytext=(0.15, 0.8),
31+
textcoords="axes fraction",
32+
arrowprops=dict(arrowstyle="->", color="red"),
33+
)
34+
35+
x_outlier, y_outlier = np.argmax(values), np.max(values)
36+
desc = "There are large\noutliers throughout."
37+
_ = axs.annotate(
38+
desc,
39+
xy=(x_outlier, y_outlier),
40+
xytext=(0.6, 0.85),
41+
textcoords="axes fraction",
42+
arrowprops=dict(arrowstyle="->", color="red"),
43+
)
44+
45+
46+
def scale_feature_and_plot(values):
47+
48+
squash_scaler = SquashingScaler()
49+
squash_scaled = squash_scaler.fit_transform(values)
50+
51+
robust_scaler = RobustScaler()
52+
robust_scaled = robust_scaler.fit_transform(values)
53+
54+
standard_scaler = StandardScaler()
55+
standard_scaled = standard_scaler.fit_transform(values)
56+
57+
x = np.arange(values.shape[0])
58+
fig, axs = plt.subplots(1, 2, layout="constrained", figsize=(8, 5))
59+
60+
ax = axs[0]
61+
ax.plot(x, sorted(values), label="Original Values", linewidth=2.5)
62+
ax.plot(x, sorted(squash_scaled), label="SquashingScaler")
63+
ax.plot(x, sorted(robust_scaled), label="RobustScaler", linestyle="--")
64+
ax.plot(x, sorted(standard_scaled), label="StandardScaler")
65+
66+
# Add a horizontal band in [-4, +4]
67+
ax.axhspan(-4, 4, color="gray", alpha=0.15)
68+
ax.set(title="Original data", xlim=[0, values.shape[0]], xlabel="Percentile")
69+
ax.legend()
70+
71+
ax = axs[1]
72+
ax.plot(x, sorted(values), label="Original Values", linewidth=2.5)
73+
ax.plot(x, sorted(squash_scaled), label="SquashingScaler")
74+
ax.plot(x, sorted(robust_scaled), label="RobustScaler", linestyle="--")
75+
ax.plot(x, sorted(standard_scaled), label="StandardScaler")
76+
77+
ax.set(ylim=[-4, 4])
78+
ax.set(title="In range [-4, 4]", xlim=[0, values.shape[0]], xlabel="Percentile")
79+
80+
# Highlight the bounds of the SquashingScaler
81+
ax.axhline(y=3, alpha=0.2)
82+
ax.axhline(y=-3, alpha=0.2)
83+
84+
fig.suptitle(
85+
"Comparison of different scalers on sorted data with outliers", fontsize=20
86+
)
87+
fig.supylabel("Value")
88+
89+
desc = "The RobustScaler is\naffected by outliers"
90+
axs[0].annotate(
91+
desc,
92+
xy=(0, -70),
93+
xytext=(0.4, 0.2),
94+
textcoords="axes fraction",
95+
arrowprops=dict(arrowstyle="->", color="red"),
96+
)
97+
98+
desc = "The SquashingScaler is\nclipped to a finite value"
99+
_ = axs[1].annotate(
100+
desc,
101+
xy=(0, -3),
102+
xytext=(0.4, 0.2),
103+
textcoords="axes fraction",
104+
arrowprops=dict(arrowstyle="->", color="red"),
105+
)

0 commit comments

Comments
 (0)