This repository presents the illustrations in the paper "A Comprehensive Survey on Imbalanced Regression: Definitions, Solutions, and Future Directions". The notebook used to obtain the results in the paper is made available and the protocol is explained below.
Imbalanced regression is inherently influenced by multiple factors, including the level of imbalance, sample size, and the complexity of the regression task. This section aims to illustrate and analyze the phenomenon through a series of simulations by varying these characteristics. We assess the sensitivity of predictive performance to the following factors:
- Sample size — 5 ordered levels: [200, 500, 1000, 2000, 5000]
- Regression complexity — 5 unordered levels: [1, 2, 3, 4, 5]. Details below
- Imbalance level — 5 ordered levels: [0.5, 0, –0.5, –1, –1.5]. Details below
To mitigate random variation, each configuration is evaluated across 5 independent runs using different random seeds. To ensure model-agnostic conclusions, we apply 10 models from the H2O AutoML library, covering a range of algorithms such as Distributed Random Forest, Extremely Randomized Trees, Regularized Generalized Linear Models, Gradient Boosting Machines, Extreme Gradient Boosting, and Multi-layer Feedforward Neural Networks.
For each configuration, a dataset of size
- A balanced training set (baseline), serving as an ideal reference.
- An imbalanced training set, varying according to both sample size and imbalance level. A level of 0.5 indicates balance, while –1.5 represents strong imbalance.
Each dataset includes 5 Gaussian features and 5 non-linear transformations of them. The target variable
$Y \sim \mathcal{N}(\mu, \sigma)$ is constructed using five selected features. At complexity level 1, these are the original Gaussian variables (yielding a simple regression task). At level 5, only the non-linear features are used, making the task highly complex. Intermediate levels (2–4) use a mix of both.
Here is the pseudo-algorithm to simulate a synthetic dataset with controlled complexity Bien sûr ! Voici la version Markdown parfaitement adaptée pour intégrer dans un dépôt GitHub (ou tout autre rendu Markdown). Elle est claire, lisible et conforme aux bonnes pratiques académiques :
We simulate a synthetic dataset to evaluate model performance under different levels of functional complexity. Let:
-
$n$ : the number of observations -
$c \in {1, 2, 3, 4, 5}$ : the complexity level - All features
$X_j$ are normalized using min-max scaling (denoted as$\tilde{X}_j$ )
Each observation is composed of the following variables:
Each variable
The final dataset contains
The level of imbalance in the training sample is managed by a train-test splitting controlled by an imbalance force. This procedure creates train/test subsets from a dataset while allowing imbalance control in the training subset through importance weighting based on the target variable.
data: Input dataset, including both the target and featurestest-size∈ (0, 1): Proportion of the dataset allocated to the test settrain-size∈ (0, 1): Proportion allocated to the training subset (optional)w-test: Sampling distribution for the test set (default: uniform)imbForce≥ 0: Controls how much importance is given to rare values in the training setnp-seed: Seed for reproducibility
Let n be the number of observations. Define:
ntest = round(n ×
test-size)
Then draw ntest indices from the dataset using the sampling weights wtest:
Test indices ∼ Multinomial(ntest, wtest)
The test set is then formed from the selected indices, and the remaining data is assigned to the training pool.
If train-size is provided, two training subsets are built:
To emphasize rare values, a relevance-based sampling distribution is computed via the IR_weighting function:
IR Weight Function: For a target value y, the weight is defined as:
$w(y) = \frac{1}{\hat{f}(y)^{\alpha}} \Big/ \sum_{i=1}^{n} \frac{1}{\hat{f}(y_i)^{\alpha}}$ where
$\hat{f}(y)$ is the kernel density estimate (KDE) of the target variable, and$\alpha$ controls the strength of emphasis on rare values.
Then, the training sample is drawn from this relevance-weighted distribution:
Train indices ∼ Multinomial(ntrain, wimb)
A second sample of equal size is drawn from the uniform weights (same as w-test):
Train indices ∼ Multinomial(ntrain, wtest)
The function returns the following objects:
X-train: Training set sampled using relevance-weighted distributionX-test: Randomly sampled test setX-bal: Baseline training set sampled using uniform distribution