Skip to content

freedatoms/rashomon_pdp_experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rashomon PDP

Quick and dirty exploration of ideas from [1].

NOTE: This is done just to satisfy my curiosity, expect mistakes. I tried to make it as reproducible as possible but due to setting max_runtime_secs and running on my old personal laptop from 2012 you might end up with different results.

UPDATE: I reran the experiments for all the configurations and I changed the transformation of X1 in scenario 3 to:

The first version used np.exp(np.cos(5*X1*np.pi)) as X1 transformation in scenario 3 since it wasn't mentioned in the paper.

Structure

  • utils.py - contains mean_distance_to_gt_pdp and get_rashomon_set
  • scenarios.py - contains ground truth functions used for data generation and gt pdp calculation
  • results/X_Y_Z.ipynb - rendered notebooks; X = automl configuration [A...P], Y = scenario, Z = variance {1, 4, 9}
  • results/pdps/X/Y_Z/{COL}_{PDP_TYPE}.csv - raw PDP values; X = automl configuration [A...P], Y = scenario, Z = variance {1, 4, 9}

Experiment setting

I had some ideas about what could be better than the Rashomon PDP presented in [1]. None proved significantly better.

Rashomom PDP uses models solely based on their performance. My idea was that diversity in model types might be too useful to discard e.g. tree-based models don't extrapolate well and they can influence the combined PDP too much. This can be seen in the 3rd quartile of the following plot. The PDP of the best model is closer to the ground truth than the combination of multiple PDPs. In this plot best model is also the best base model (no Stacked Ensemble trained within the time limit).

PDP extrapolation

Here we can also see that using "Best of Family" models also doesn't perform well. So I tried using intersection of models in the Rashomon set ($\varepsilon = 0.05$) and the "Best of Family".

So I have tested following scenarios:

  • Best Model
  • Best Basemodel (excludes Stacked Ensemble)
  • Best of Family - uses best model from each family
  • Rashomon - uses models from Rashomon set
  • Rashomon intersected with Best of Family - uses only one model from each type within the Rashomon set
  • All Models - uses all models
  • Best Stacked Ensemble - SE already combines models so it seems logical to me to have it as as separately

Knowing the inner workings of H2O-3 AutoML, I ran the AutoML with nfolds=5 which forces Stacked Ensemble to use CV scores to train the metalearner. The default (nfolds=-1) might end up training Stacked Ensemble in blending mode.

Since there are those potential extrapolation issues I decided to look at the full range, inner 2 quartiles, and outer 2 quartiles to see if there is any significant difference in how the performance of the methods differs.

Results

Due to time constraints (max_runtime_secs) Stacked Ensembles are not always present so they might look much better than they really are. Stacked Ensembles failed to train in my case due to time constraints in roughly 1/4 cases.

Nevertheless it seems possible to me that Stacked Ensembles could be this good as they try to combine models so that they generalize better in more sophisticated way than just picking models above some threshold or looking at their type. More evaluation results are in 03_evaluation.ipynb.

UPDATE: I also looked at performance in different noise levels. Generally speaking, it seems to me that Rashomon behaves better when there is lower amount of noise. This can be demonstrated by both looking at inner vs outer halves of the PDP and different noise levels. In the inner half, models tend to agree likely due to higher training data density => lower variance in PDPs. In the noise level == 3, the Rashomon is significantly worse than the "Rashomon intersected with Best of Family".

Average ranks in whole region

Average ranks in inner quartiles

Average ranks in outer quartiles

Noise level == 1

Average ranks in whole region

Average ranks in inner quartiles

Average ranks in outer quartiles

Noise level == 2 (error variance = 4)

Average ranks in whole region

Average ranks in inner quartiles

Average ranks in outer quartiles

Noise level == 3 (error variance = 9)

Average ranks in whole region

Average ranks in inner quartiles

Average ranks in outer quartiles

References

[1] M. Cavus, J. N. van Rijn, a P. Biecek, „Quantifying Model Uncertainty with AutoML and Rashomon Partial Dependence Profiles: Enabling Trustworthy and Human-centered XAI“, Inf Syst Front, Feb. 2026, doi: 10.1007/s10796-026-10698-3. https://doi.org/10.1007/s10796-026-10698-3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors