Merge pull request #14 from artefactory/dev_paper_submission

Abdoulaye-SAKHO · web-flow · commit 25da5861f68c · 2025-11-15T18:13:42.000+01:00
Add last modifications on apper before submission
diff --git a/README.md b/README.md
@@ -42,8 +42,7 @@ pip install woodtapper
 
 **From this repository, within a pip/conda/mamba environment (python=3.12)**:
 ```bash
-pip install -r requirements.txt
-pip install -e '.[dev]'
+pip install -e .[dev,docs]
 ```
 
 ## 🌿 WoodTapper RulesExtraction module
diff --git a/docs/2_tutorials_example_exp.md b/docs/2_tutorials_example_exp.md
@@ -54,5 +54,5 @@ RF.fit(X_train, y_train)
 
 # Load an existing RandomForestClassifier into the explainer
 RFExplained = RandomForestClassifierExplained.load_forest(RandomForestClassifierExplained,RF,X_train,y_train)
-X_explain, y_explain = RFExplained.explanation(X_test)
+Xy_explain = RFExplained.explanation(X_test)
 ```
diff --git a/docs/installation.md b/docs/installation.md
@@ -16,11 +16,7 @@ git clone https://github.com/artefactory/woodtapper.git
 ```
 And install the required packages into your environment (conda, mamba or pip):
 ```bash
-pip install -r requirements.txt
-```
-Then run the following command from the repository root directory :
-```
-pip install -e .[dev]
+pip install -e .[dev,docs]
 ```
 
 ## Dependencies
diff --git a/paper/images/rules-titanic-py-final.pdf b/paper/images/rules-titanic-py-final.pdf
diff --git a/paper/paper.md b/paper/paper.md
@@ -1,5 +1,5 @@
 ---
-title: 'WoodTapper: a Python package for tapping decision tree ensembles'
+title: 'WoodTapper: a Python package for explaining decision tree ensembles'
 tags:
   - Python
   - Machine Learning
@@ -55,28 +55,28 @@ In a tree $\mathcal{T}$, we denote the path of successive splits from the root n
 $$
     \mathcal{P} = \{(j_k,r_k,s_k), k=1, \dots, K\},
 $$
-where $K$ is the path length, $j_k$ is the selected feature at depth $k$, $r_k$ the selected splitting position along $X^{(j_k)}$ and $s_k$ the corresponding sign (either $\leq$ corresponding to the left node or $>$ corresponding to the right node).
-Thus, each path defines a hyperrectangle in the input space, denoted $\hat{H}(\mathcal{P}) \subset \mathbb{R}^p$. Hence, each path can be associated with a rule function $\hat{g}_{\mathcal{D},\mathcal{P}}$, that returns the mean of $Y$ from the training sample inside and outside of $\hat{H}(\mathcal{P})$:
+where $K$ is the path length, $j_k \in \{1, \dots,p\}$ is the selected feature at depth $k$, $r_k \in \mathbb{R}$ the selected splitting position along $x^{(j_k)}$ and $s_k$ the corresponding sign (either $\leq$ corresponding to the left node or $>$ corresponding to the right node).
+Thus, each path defines a hyperrectangle in the input space, denoted $\hat{H}(\mathcal{P}) \subset \mathbb{R}^p$. Hence, each path can be associated with a rule function $\hat{g}_{\mathcal{P}}$, that returns the mean of $Y$ from the training sample inside and outside of $\hat{H}(\mathcal{P})$:
 $$
     \hat{g}_{\mathcal{P}}(x) =
     \begin{cases}
         \frac{\sum_{i=1}^{n}y_i \mathbb{I}_{\{x_i \in \hat{H}(\mathcal{P})\}}}{\sum_{i=1}^{n} \mathbb{I}_{\{x_i \in \hat{H}(\mathcal{P})\}}}  \text{ if } x \in \hat{H}(\mathcal{P})\\
         \frac{\sum_{i=1}^{n}y_i \mathbb{I}_{\{x_i \not\in \hat{H}(\mathcal{P})\}}}{\sum_{i=1}^{n} \mathbb{I}_{\{x_i \not\in \hat{H}(\mathcal{P})\}}}  \text{ otherwise }.
     \end{cases}
 $$
-We suppose we have a set of trees $\{\mathcal{T}_m, m=1, \dots, M \}$ from a random forest, each grown with randomness $\Theta_m$. For a path $\mathcal{P}$, we estimate the rule probability $p\left(\mathcal{P}\right)$ via Monte-Carlo sampling with $\hat{p}$,
+We suppose we have a set of trees $\{\mathcal{T}_m, m=1, \dots, M \}$ from a tree ensemble procedure, each grown with randomness $\Theta_m$. We denote by $\Pi$ the set of all possibles paths from $\{\mathcal{T}_m, m=1, \dots, M \}$. For a path $\mathcal{P} \in \Pi$, we estimate the rule probability $p\left(\mathcal{P}\right)$ via Monte-Carlo sampling with $\hat{p}\left(\mathcal{P}\right)$:
 $$
-    \hat{p}_{}\left(\mathcal{P}\right) = \frac{1}{M} \sum_{m=1}^{M} \mathbb{1}_{\{\mathcal{P} \in \mathcal{T}(\Theta_m,\mathcal{D}_n)\}},
+    \hat{p}\left(\mathcal{P}\right) = \frac{1}{M} \sum_{m=1}^{M} \mathbb{1}_{\{\mathcal{P} \in \mathcal{T}(\Theta_m,\mathcal{D}_n)\}},
 $$
-which corresponds to the probability that the path $\mathcal{P}$ belongs to the set of trees $\{\mathcal{T}_m, m=1, \dots, M \}$. We denote by $\Pi$ the set of extracted rules from $\{\mathcal{T}_m, m=1, \dots, M \}$.
+which corresponds to the empirical probability that the path $\mathcal{P} \in \Pi$ belongs to the set of trees $\{\mathcal{T}_m, m=1, \dots, M \}$.
 
 The set of final rules is $\{\hat{g}_{\mathcal{P}}, \mathcal{P} \in  \hat{\mathcal{P}}_{p_0}\}$ where $\hat{\mathcal{P}}_{p_0} = \left\{ \mathcal{P} \in \Pi, \, \hat{p}(\mathcal{P}) > p_0\right\}$ with $p_0 \in [0,1)$. The finals rules are aggregated as follows for building the final estimator:
 $$
     \hat{\eta}_{p_0}(x) = \frac{1}{|\hat{\mathcal{P}}_{p_0}|} \sum_{\mathcal{P} \in \hat{\mathcal{P}}_{p_0}} \hat{g}_{\mathcal{P}}(x).
 $$
 
 So far, we have focused on binary classification for clarity.
-We also implemented SIRUS for regression, where final rules are aggregated using weights learned via ridge regression. Our implementation extends SIRUS to multiclass classification (not available in the original R version) as well as regression. It also leverages scikit-learn's implementations for tree-based models fitting.
+We also implemented the rule extractor for regression, where final rules are aggregated using weights learned via ridge regression. Our implementation extends SIRUS, i.e. rules extracted from random forest, to multiclass classification (not available in the original R version). Finally, our implementation also leverages scikit-learn's implementations for tree-based models fitting.
 
 ## Implementation and running time
 WoodTapper adheres to the scikit-learn [@pedregosa2011scikit] estimator interface, providing familiar methods such as $fit$, $predict$, and $get\_params$. This design enables smooth integration with existing workflows involving pipelines, cross-validation, and model selection (see Table \ref{tab:comparison}).
@@ -126,7 +126,7 @@ We compare the rules produced by the original SIRUS (R) and our Python implement
 ## Formulation
 
 The $\texttt{ExampleExplanation}$ module of WoodTapper is independent of the rule extraction module and provides an example-based explainability.
-It enables tree-based models to identify the $l \in \mathbb{N}$ most similar training samples to $x$, using the similarity measure induced by random forests [@breiman2001random;@grf].
+It enables tree-based models to identify the $l \in \mathbb{N}$ most similar training samples to $x$, using the similarity measure induced by generalized random forests [@breiman2001random;@grf].
 For a new sample $x$ with unknown label and $\mathcal{T}_m$ a decision tree, let $\mathcal{L}_m(x)$ denote the set of training samples that share the same leaf as $x$ in tree $\mathcal{T}_m$ for $m = 1, \dots, M$.
 Letting $w(x,x_i)$ be the similarity between $x$ and $x_i$, we have
 $$
@@ -138,7 +138,7 @@ Finally, the $l$ training samples with the highest $w(x,x_i)$ values, along with
 The $\textit{skgrf}$ [@skgrf] package is an interface for using the R implementation of generalized random forest in Python. $\textit{skgrf}$ has a specific number of classifiers for specifics learning tasks (causal inference, quantile regression,...). For each task, the user can compute the kernel weights, which are equivalent to our leaf frequency match introduce above. Thus, we aim at comparing the kernel weights derivation from $\textit{skgrf}$ to our $\texttt{ExampleExplanation}$ module. We stress on the fact that our $\texttt{ExampleExplanation}$ is designed for usual tree-based models such as random forest of extra trees and not specifically in a context of causal inference or quantile regression. Thus, the tree building (splitting criterion) of our forest are different from the ones from $\textit{skgrf}$.
 
 ## Implementation and running time
-As for SIRUS, our Python implementation of $\texttt{ExampleExplanation}$ adheres to the scikit-learn interface. Our $\texttt{ExampleExplanation}$ module is agnostic to the underlying tree ensemble, and can be used with random forests or extra trees (\ref{tab:comparison-grf}). For each ensemble type, a subclass inherits both the original scikit-learn class and our implemented class. The standard $\texttt{fit}$ and $\texttt{predict}$ methods remain unchanged, while an additional $\texttt{explain}$ method provides example-based explanations for new samples. This allows users to train and predict using standard scikit-learn workflows, while enabling access to $\texttt{ExampleExplanation}$ for interpretability analyses. We also have imlemented a method to load an already trained tree-basedd model into an $\texttt{ExampleExplanation}$ classifier.
+As for SIRUS, our Python implementation of $\texttt{ExampleExplanation}$ adheres to the scikit-learn interface. Our $\texttt{ExampleExplanation}$ module is agnostic to the underlying tree ensemble, and can be used with random forests or extra trees (\ref{tab:comparison-grf}). The standard $\texttt{fit}$ and $\texttt{predict}$ methods remain unchanged, while an additional $\texttt{explain}$ method provides example-based explanations for new samples. This allows users to train and predict using standard scikit-learn workflows, while enabling access to $\texttt{ExampleExplanation}$ for interpretability analyses. We also have implemented a method to load an already trained tree-based model into an $\texttt{ExampleExplanation}$ classifier.
 
 : **Comparison of GRF weight computations in several Python packages.**\label{tab:comparison-grf}
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -8,7 +8,7 @@ authors = [
     {name = "Abdoulaye SAKHO", email = "abdoulaye7020@gmail.com"},
     {name = "artefactory", email = "abdoulaye.sakho@artefact.com"},
 ]
-version = "0.0.11"
+version = "0.0.12"
 description = "A Python toolbox for interpretable and explainable tree ensembles."
 readme = "README.md"
 license = "MIT"
diff --git a/woodtapper/extract_rules/visualization.py b/woodtapper/extract_rules/visualization.py
@@ -8,34 +8,38 @@
 
 
 def show_rules(
-    RulesExtractorModel, max_rules=9, target_class_index=1, is_regression=False
+    RulesExtractorModel,
+    max_rules=9,
+    target_class_index=1,
+    is_regression=False,
+    value_mappings=None,
 ):
     """
-    Display the rules in a structured format, showing the conditions and associated probabilities for a specified target class.
+    Display the rules in a structured format.
+
     Parameters
     ----------
     RulesExtractorModel : object
-        The fitted rules extraction model containing the rules and probabilities.
-    max_rules : int, optional (default=9)
-        The maximum number of rules to display.
-    target_class_index : int, optional (default=1)
-        The index of the target class for which to display probabilities.
-    list_indices_features_bin : list of int, optional (default=None)
-        List of feature indices that are binary (0/1) for special formatting.
-    Returns
-    ----------
-    None
-    1. Validate the presence of necessary attributes in the model.
-    2. Extract rules and their associated probabilities.
-    3. Format and display the rules in a tabular format.
-    4. Include estimated average rates for the specified target class.
-    5. Handle feature names for better readability, using provided mappings if available.
-    6. Adjust formatting for binary features if specified.
-    7. Ensure that the display is clear and informative, with appropriate headers and alignment.
-    8. If the model lacks the required attributes, print an error message and exit.
-    9. If there are no rules to display, print a corresponding message and exit.
-    10. Calculate and display the estimated average probability for the target class based on 'else' clauses.
-    11. Print the rules along with their conditions, 'then' probabilities, and 'else' probabilities in a structured table.
+        Fitted rules extraction model.
+    max_rules : int, default=9
+        Max number of rules to display.
+    target_class_index : int, default=1
+        Class index whose probability to show (classification).
+    is_regression : bool, default=False
+        Switch to regression formatting.
+    value_mappings : dict, optional
+        {
+            <feature_index or feature_name>: {
+                <raw_value>: <display_string>,
+                ...
+            },
+            ...
+        }
+        For binary features with both 0 and 1 mapped, rules become:
+            FeatureName is <mapped_1>   (if sign_internal == "R")
+            FeatureName is <mapped_0>   (if sign_internal == "L")
+        (Instead of using negations.)
+
     """
     if (
         not hasattr(RulesExtractorModel, "rules_")
@@ -51,7 +55,10 @@ def show_rules(
         raise ValueError(
             "For regression, model must have 'list_probas_by_rules_without_coefficients' attribute."
         )
-    list_indices_features_bin = RulesExtractorModel._list_categorical_indexes
+
+    list_indices_features_bin = getattr(
+        RulesExtractorModel, "_list_categorical_indexes", None
+    )
 
     rules_all = RulesExtractorModel.rules_
     if is_regression:
@@ -76,27 +83,19 @@ def show_rules(
             "No rules to display. try to increase the number of rules extracted or check model fitting."
         )
 
-    # Attempt to build/use feature mapping
+    # Feature name mapping
     feature_mapping = None
-    if hasattr(
-        RulesExtractorModel, "feature_names_in_"
-    ):  # Standard scikit-learn attribute
-        # Create a mapping from index to name if feature_names_in_ is a list
+    if hasattr(RulesExtractorModel, "feature_names_in_"):
         feature_mapping = {
             i: name for i, name in enumerate(RulesExtractorModel.feature_names_in_)
         }
-    elif hasattr(
-        RulesExtractorModel, "feature_names_"
-    ):  # Custom attribute for feature names
+    elif hasattr(RulesExtractorModel, "feature_names_"):
         if isinstance(RulesExtractorModel.feature_names_, dict):
-            feature_mapping = (
-                RulesExtractorModel.feature_names_
-            )  # Assumes it's already index:name
+            feature_mapping = RulesExtractorModel.feature_names_
         elif isinstance(RulesExtractorModel.feature_names_, list):
             feature_mapping = {
                 i: name for i, name in enumerate(RulesExtractorModel.feature_names_)
             }
-    # If no mapping, column_name will default to using indices.
 
     base_ps_text = ""
     if not is_regression:
@@ -125,6 +124,51 @@ def show_rules(
     max_condition_len = 0
     condition_strings_for_rules = []
 
+    def _map_value(dim, dim_name, raw_val):
+        if value_mappings is None:
+            return None
+        candidates = [dim]
+        if dim_name is not None:
+            candidates.append(dim_name)
+        for c in candidates:
+            if c in value_mappings:
+                nested = value_mappings[c]
+                if raw_val in nested:
+                    return nested[raw_val]
+                if isinstance(raw_val, (float, np.floating)) and int(raw_val) in nested:
+                    return nested[int(raw_val)]
+        return None
+
+    def _format_binary_condition(dimension, column_name, sign_internal):
+        # Determine which side of binary (0 or 1) the rule represents.
+        positive_val = 1
+        negative_val = 0
+        # Try to map both
+        mapped_pos = _map_value(dimension, column_name, positive_val)
+        mapped_neg = _map_value(dimension, column_name, negative_val)
+
+        # If both mapped, choose directly
+        if mapped_pos is not None and mapped_neg is not None:
+            if sign_internal == "R":  # >
+                return f"{column_name} is {mapped_pos}"
+            else:  # "<=" side
+                return f"{column_name} is {mapped_neg}"
+        # If only one mapped
+        if mapped_pos is not None:
+            if sign_internal == "R":
+                return f"{column_name} is {mapped_pos}"
+            else:
+                return f"{column_name} is not {mapped_pos}"
+        if mapped_neg is not None:
+            if sign_internal == "L":
+                return f"{column_name} is {mapped_neg}"
+            else:
+                return f"{column_name} is not {mapped_neg}"
+
+        # Fallback numeric
+        raw_indicator = 0 if sign_internal == "L" else 1
+        return f"{column_name} is {raw_indicator}"
+
     for i in range(num_rules_to_show):
         current_rule_conditions = rules_all[i]
         condition_parts_str = []
@@ -133,31 +177,44 @@ def show_rules(
                 rule=current_rule_conditions[j]
             )
 
-            column_name = f"Feature[{dimension}]"  # Default if no mapping
+            column_name = f"Feature[{dimension}]"
             if feature_mapping and dimension in feature_mapping:
                 column_name = feature_mapping[dimension]
             elif (
                 feature_mapping
                 and isinstance(dimension, str)
                 and dimension in feature_mapping.values()
             ):
-                # If dimension is already a name that's in the mapping's values (less common for index)
                 column_name = dimension
-            if (
+
+            is_binary = (
                 list_indices_features_bin is not None
                 and dimension in list_indices_features_bin
-            ):
-                sign_display = "is"  # if sign_internal == "L" else "is not"
-                # treshold_display = str(treshold)
-                treshold_display = str(0) if sign_internal == "L" else str(1)
+            )
+
+            if is_binary:
+                condition_parts_str.append(
+                    _format_binary_condition(dimension, column_name, sign_internal)
+                )
             else:
                 sign_display = "<=" if sign_internal == "L" else ">"
+                if isinstance(treshold, float):
+                    treshold_display_raw = float(f"{treshold:.2f}")
+                else:
+                    treshold_display_raw = treshold
+                mapped = _map_value(dimension, column_name, treshold_display_raw)
                 treshold_display = (
-                    f"{treshold:.2f}" if isinstance(treshold, float) else str(treshold)
+                    mapped
+                    if mapped is not None
+                    else (
+                        f"{treshold:.2f}"
+                        if isinstance(treshold, float)
+                        else str(treshold)
+                    )
+                )
+                condition_parts_str.append(
+                    f"{column_name} {sign_display} {treshold_display}"
                 )
-            condition_parts_str.append(
-                f"{column_name} {sign_display} {treshold_display}"
-            )
 
         full_condition_str = " & ".join(condition_parts_str)
         condition_strings_for_rules.append(full_condition_str)
@@ -187,12 +244,10 @@ def show_rules(
             then_val_str = f"{p_s_if_true:.2f}"
             p_s_if_false = prob_if_false_list
             else_val_str = f"{p_s_if_false:.2f} | coeff={coefficients_all[i]:.2f}"
-
-        else:  # classification
+        else:
             if prob_if_true_list and len(prob_if_true_list) > target_class_index:
                 p_s_if_true = prob_if_true_list[target_class_index] * 100
                 then_val_str = f"{p_s_if_true:.0f}%"
-
             if prob_if_false_list and len(prob_if_false_list) > target_class_index:
                 p_s_if_false = prob_if_false_list[target_class_index] * 100
                 else_val_str = f"{p_s_if_false:.0f}%"

Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,7 @@ authors = [`
`8`	`8`	`{name = "Abdoulaye SAKHO", email = "abdoulaye7020@gmail.com"},`
`9`	`9`	`{name = "artefactory", email = "abdoulaye.sakho@artefact.com"},`
`10`	`10`	`]`
`11`		`-version = "0.0.11"`
	`11`	`+version = "0.0.12"`
`12`	`12`	`description = "A Python toolbox for interpretable and explainable tree ensembles."`
`13`	`13`	`readme = "README.md"`
`14`	`14`	`license = "MIT"`