Add the feature contribution argument output as an option at predict#39
Add the feature contribution argument output as an option at predict#39Fish-Soup wants to merge 6 commits intoStatMixedML:masterfrom
Conversation
|
Thanks for opening the PR and for your interest in the proect, very much appreciated! I`d need some time, though, to look into it in detail. May I ask you to also give an example of how to use and interpret it. That would help, thanks! |
|
Hi I added an example in the examples section. There is lots more you can use the output for. At a very high level it provides SHAP like information but directly from lightGBM's internal calculations. When a distribution_arg is used we can also use it to get the actual contribution to the final parameter value |
|
I've added a little code to give the pandas columns a level name based on the pred_type argument. This is helpful when doing pandas operations like stack. For example when pred_type="quantiles", the pandas output columns will have name "quantiles". This means we can pred_samples.stack("quantiles") to create a multi index series. Ive also changed the names for the multi-index with pred_type="contributions" to ["parameters", "feature_contributions"] from ["distribution_args", "FeatureContributions"] to alligh with the pred_type naming convention |
|
Thanks for your changses. I am currently occupied with the Hyper-Tree paper, so please do expect some delay in my review. |
|
Is there anything I can do to help speed it along? The PR is unit tested and is essentially just passing arguments to lightgbm.booster then doing some reshaping of the output. |
Add the option to call lightGBM.Booster.predict(...., pred_contrib=True)
This generates an output with the number of columns = the number distribution arguments * (number of features + 1).
Output is converted to a multi-index column of two levels, distribution args and feature contributions (and Constant)
Unit test added for pred_contributions to test if when you sum up all contributions and apply response function you get the same result as when predicting the parameters.
I also noticed that when predicting sampling is always applied even if we are returning an output that does not require sampling. This must make predictions a little slower on larger data sets. As such I moved the sampling code so it only gets called when required.