Rashomon PDP

Quick and dirty exploration of ideas from [1].

NOTE: This is done just to satisfy my curiosity, expect mistakes. I tried to make it as reproducible as possible but due to setting max_runtime_secs and running on my old personal laptop from 2012 you might end up with different results.

UPDATE: I reran the experiments for all the configurations and I changed the transformation of X1 in scenario 3 to:

the correct version (latest on main)
a more similar to the one in the paper 0.08*np.where(x>=0, 0.5*np.tanh(x)+0.1*np.sqrt(np.abs(x)), -np.sqrt(np.abs(2*x))) + 0.85.

The first version used np.exp(np.cos(5*X1*np.pi)) as X1 transformation in scenario 3 since it wasn't mentioned in the paper.

Structure

utils.py - contains mean_distance_to_gt_pdp and get_rashomon_set
scenarios.py - contains ground truth functions used for data generation and gt pdp calculation
results/X_Y_Z.ipynb - rendered notebooks; X = automl configuration [A...P], Y = scenario, Z = variance {1, 4, 9}
results/pdps/X/Y_Z/{COL}_{PDP_TYPE}.csv - raw PDP values; X = automl configuration [A...P], Y = scenario, Z = variance {1, 4, 9}

Experiment setting

I had some ideas about what could be better than the Rashomon PDP presented in [1]. None proved significantly better.

Rashomom PDP uses models solely based on their performance. My idea was that diversity in model types might be too useful to discard e.g. tree-based models don't extrapolate well and they can influence the combined PDP too much. This can be seen in the 3rd quartile of the following plot. The PDP of the best model is closer to the ground truth than the combination of multiple PDPs. In this plot best model is also the best base model (no Stacked Ensemble trained within the time limit).

Here we can also see that using "Best of Family" models also doesn't perform well. So I tried using intersection of models in the Rashomon set ($\varepsilon = 0.05$) and the "Best of Family".

So I have tested following scenarios:

Best Model
Best Basemodel (excludes Stacked Ensemble)
Best of Family - uses best model from each family
Rashomon - uses models from Rashomon set
Rashomon intersected with Best of Family - uses only one model from each type within the Rashomon set
All Models - uses all models
Best Stacked Ensemble - SE already combines models so it seems logical to me to have it as as separately

Knowing the inner workings of H2O-3 AutoML, I ran the AutoML with nfolds=5 which forces Stacked Ensemble to use CV scores to train the metalearner. The default (nfolds=-1) might end up training Stacked Ensemble in blending mode.

Since there are those potential extrapolation issues I decided to look at the full range, inner 2 quartiles, and outer 2 quartiles to see if there is any significant difference in how the performance of the methods differs.

Results

Due to time constraints (max_runtime_secs) Stacked Ensembles are not always present so they might look much better than they really are. Stacked Ensembles failed to train in my case due to time constraints in roughly 1/4 cases.

Nevertheless it seems possible to me that Stacked Ensembles could be this good as they try to combine models so that they generalize better in more sophisticated way than just picking models above some threshold or looking at their type. More evaluation results are in 03_evaluation.ipynb.

UPDATE: I also looked at performance in different noise levels. Generally speaking, it seems to me that Rashomon behaves better when there is lower amount of noise. This can be demonstrated by both looking at inner vs outer halves of the PDP and different noise levels. In the inner half, models tend to agree likely due to higher training data density => lower variance in PDPs. In the noise level == 3, the Rashomon is significantly worse than the "Rashomon intersected with Best of Family".

Noise level == 1

Noise level == 2 (error variance = 4)

Noise level == 3 (error variance = 9)

References

[1] M. Cavus, J. N. van Rijn, a P. Biecek, „Quantifying Model Uncertainty with AutoML and Rashomon Partial Dependence Profiles: Enabling Trustworthy and Human-centered XAI“, Inf Syst Front, Feb. 2026, doi: 10.1007/s10796-026-10698-3. https://doi.org/10.1007/s10796-026-10698-3

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
results		results
.gitignore		.gitignore
01_data_generation.ipynb		01_data_generation.ipynb
02_experiment_run.ipynb		02_experiment_run.ipynb
03_evaluation.ipynb		03_evaluation.ipynb
04_interactive_exploration.ipynb		04_interactive_exploration.ipynb
README.md		README.md
avg_ranks_full_range.png		avg_ranks_full_range.png
avg_ranks_full_range_noise_1.png		avg_ranks_full_range_noise_1.png
avg_ranks_full_range_noise_4.png		avg_ranks_full_range_noise_4.png
avg_ranks_full_range_noise_9.png		avg_ranks_full_range_noise_9.png
avg_ranks_inner_half.png		avg_ranks_inner_half.png
avg_ranks_inner_half_noise_1.png		avg_ranks_inner_half_noise_1.png
avg_ranks_inner_half_noise_4.png		avg_ranks_inner_half_noise_4.png
avg_ranks_inner_half_noise_9.png		avg_ranks_inner_half_noise_9.png
avg_ranks_outer_half.png		avg_ranks_outer_half.png
avg_ranks_outer_half_noise_1.png		avg_ranks_outer_half_noise_1.png
avg_ranks_outer_half_noise_4.png		avg_ranks_outer_half_noise_4.png
avg_ranks_outer_half_noise_9.png		avg_ranks_outer_half_noise_9.png
main.py		main.py
pdp_extrapol.png		pdp_extrapol.png
pyproject.toml		pyproject.toml
scenarios.py		scenarios.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rashomon PDP

Structure

Experiment setting