Fix the error proportion in regression models by SanGoku95 · Pull Request #34 · dataiku-research/mealy

SanGoku95 · 2024-12-19T14:28:18Z

Problem

The method self.error_tree.estimator_.tree_.value does not correctly return the error count when the model analyzed is a regression model. Instead, it provides normalized class proportions, which leads to incorrect calculations of n_errors for regression tasks in the get_error_leaf_summary function.

Reproduction of error

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mealy.error_analyzer import ErrorAnalyzer

# Fetch California Housing Dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y_true = housing.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y_true, test_size=0.3, random_state=42)

# Fit Primary Model
primary_model = DecisionTreeRegressor(max_depth=2, random_state=42)
primary_model.fit(X_train, y_train)

# Initialize ErrorAnalyzer
error_analyzer = ErrorAnalyzer(primary_model, feature_names=housing.feature_names, param_grid={"max_depth": [2]}, random_state=42)
error_analyzer.epsilon = 0.5
error_analyzer.fit(X_test, y_test)

# Validate Issue: Inspect a Node
leaf_id = 2
n_errors = error_analyzer.get_error_leaf_summary(leaf_id)[0]["n_errors"]

# Manual Verification
y_pred = primary_model.predict(X_test)
node_samples = error_analyzer.error_tree.estimator_.apply(X_test) == leaf_id
node_errors = (np.abs(y_test[node_samples] - y_pred[node_samples]) > error_analyzer.epsilon).sum()

# The assertion highlights the discrepancy
assert n_errors != node_errors  # n_errors from the library is incorrect for regression tasks

Root Cause

For regression tasks, self.error_tree.estimator_.tree_.value returns proportions, not absolute counts, for the class predictions. This behavior is appropriate for classification tasks but not for regression, leading to incorrect calculations of n_errors.

Solution

The fix involves correctly handling regression-specific logic by:

Multiplying the normalized proportion (self.error_tree.estimator_.tree_.value) by the total number of samples in the node (self.error_tree.estimator_.tree_.n_node_samples).
Ensuring consistent behavior for both regression and classification models.

SanGoku95 · 2024-12-30T09:49:48Z

@du-phan @simonamaggio @AgatheG

AgatheG

@du-phan Could you merge this one?

du-phan · 2025-05-14T07:57:12Z

Unfortunately I don't have merge rights

SanGoku95 · 2025-06-26T13:47:16Z

@simonamaggio

AgatheG · 2025-06-26T13:54:57Z

@instanceofme

SanGoku95 · 2025-08-18T08:55:14Z

@instanceofme

fix the error proportion in regression models

dd36748

AgatheG approved these changes May 13, 2025

View reviewed changes

SanGoku95 requested a review from AgatheG August 18, 2025 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the error proportion in regression models#34

Fix the error proportion in regression models#34
SanGoku95 wants to merge 1 commit intodataiku-research:mainfrom
SanGoku95:patch-1

SanGoku95 commented Dec 19, 2024 •

edited

Loading

Uh oh!

SanGoku95 commented Dec 30, 2024

Uh oh!

AgatheG left a comment

Uh oh!

du-phan commented May 14, 2025

Uh oh!

SanGoku95 commented Jun 26, 2025

Uh oh!

AgatheG commented Jun 26, 2025

Uh oh!

SanGoku95 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SanGoku95 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SanGoku95 commented Dec 30, 2024

Uh oh!

AgatheG left a comment

Choose a reason for hiding this comment

Uh oh!

du-phan commented May 14, 2025

Uh oh!

SanGoku95 commented Jun 26, 2025

Uh oh!

AgatheG commented Jun 26, 2025

Uh oh!

SanGoku95 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SanGoku95 commented Dec 19, 2024 •

edited

Loading