Skip to content

Fix the error proportion in regression models#34

Open
SanGoku95 wants to merge 1 commit intodataiku-research:mainfrom
SanGoku95:patch-1
Open

Fix the error proportion in regression models#34
SanGoku95 wants to merge 1 commit intodataiku-research:mainfrom
SanGoku95:patch-1

Conversation

@SanGoku95
Copy link
Copy Markdown

@SanGoku95 SanGoku95 commented Dec 19, 2024

Problem

The method self.error_tree.estimator_.tree_.value does not correctly return the error count when the model analyzed is a regression model. Instead, it provides normalized class proportions, which leads to incorrect calculations of n_errors for regression tasks in the get_error_leaf_summary function.

Reproduction of error

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mealy.error_analyzer import ErrorAnalyzer

# Fetch California Housing Dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y_true = housing.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y_true, test_size=0.3, random_state=42)

# Fit Primary Model
primary_model = DecisionTreeRegressor(max_depth=2, random_state=42)
primary_model.fit(X_train, y_train)

# Initialize ErrorAnalyzer
error_analyzer = ErrorAnalyzer(primary_model, feature_names=housing.feature_names, param_grid={"max_depth": [2]}, random_state=42)
error_analyzer.epsilon = 0.5
error_analyzer.fit(X_test, y_test)

# Validate Issue: Inspect a Node
leaf_id = 2
n_errors = error_analyzer.get_error_leaf_summary(leaf_id)[0]["n_errors"]

# Manual Verification
y_pred = primary_model.predict(X_test)
node_samples = error_analyzer.error_tree.estimator_.apply(X_test) == leaf_id
node_errors = (np.abs(y_test[node_samples] - y_pred[node_samples]) > error_analyzer.epsilon).sum()

# The assertion highlights the discrepancy
assert n_errors != node_errors  # n_errors from the library is incorrect for regression tasks

Root Cause

For regression tasks, self.error_tree.estimator_.tree_.value returns proportions, not absolute counts, for the class predictions. This behavior is appropriate for classification tasks but not for regression, leading to incorrect calculations of n_errors.

Solution

The fix involves correctly handling regression-specific logic by:

  1. Multiplying the normalized proportion (self.error_tree.estimator_.tree_.value) by the total number of samples in the node (self.error_tree.estimator_.tree_.n_node_samples).
  2. Ensuring consistent behavior for both regression and classification models.

@SanGoku95
Copy link
Copy Markdown
Author

Copy link
Copy Markdown
Contributor

@AgatheG AgatheG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@du-phan Could you merge this one?

@du-phan
Copy link
Copy Markdown
Contributor

du-phan commented May 14, 2025

Unfortunately I don't have merge rights

@SanGoku95
Copy link
Copy Markdown
Author

@simonamaggio

@AgatheG
Copy link
Copy Markdown
Contributor

AgatheG commented Jun 26, 2025

@instanceofme

@SanGoku95 SanGoku95 requested a review from AgatheG August 18, 2025 08:51
@SanGoku95
Copy link
Copy Markdown
Author

@instanceofme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants