ModelOriented · mayer79 · Jul 21, 2025 · Jul 19, 2025 · Jul 19, 2025 · Jul 19, 2025
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: kernelshap
 Title: Kernel SHAP
-Version: 0.8.1
+Version: 0.9.0
 Authors@R: c(
     person("Michael", "Mayer", , "[email protected]", role = c("aut", "cre"),
            comment = c(ORCID = "0009-0007-2540-9629")),

diff --git a/NEWS.md b/NEWS.md
@@ -1,4 +1,16 @@
-# kernelshap 0.8.1
+# kernelshap 0.9.0
+
+### Bug fix
+
+With input from Mario Wuethrich and Ian Covert and his [repo](https://github.com/iancovert/shapley-regression), 
+we have fixed a bug in how `kernelshap()` calculates Kernel weights. 
+
+- The differences caused by this are typically very small.
+- Models with interactions of order up to two have been unaffected.
+- Exact Kernel SHAP now provides identical results to exact permutation SHAP.
+
+Fixed in [#168](https://github.com/ModelOriented/kernelshap/pull/168), which also has received
+unit tests against Python's "shap".
 
 ### API
 

diff --git a/R/kernelshap.R b/R/kernelshap.R
@@ -8,17 +8,11 @@
 #' Otherwise, an almost exact hybrid algorithm combining exact calculations and
 #' iterative paired sampling is used, see Details.
 #'
-#' Note that (exact) Kernel SHAP is only an approximation of (exact) permutation SHAP.
-#' Thus, for up to eight features, we recommend [permshap()]. For more features,
-#' [permshap()] tends to be inefficient compared the optimized hybrid strategy
-#' of Kernel SHAP.
-#'
 #' @details
 #' The pure iterative Kernel SHAP sampling as in Covert and Lee (2021) works like this:
 #'
-#' 1. A binary "on-off" vector \eqn{z} is drawn from \eqn{\{0, 1\}^p}
-#'   such that its sum follows the SHAP Kernel weight distribution
-#'   (normalized to the range \eqn{\{1, \dots, p-1\}}).
+#' 1. A binary "on-off" vector \eqn{z} is drawn from \eqn{\{0, 1\}^p} according to
+#'   a special weighting logic.
 #' 2. For each \eqn{j} with \eqn{z_j = 1}, the \eqn{j}-th column of the
 #'   original background data is replaced by the corresponding feature value \eqn{x_j}
 #'   of the observation to be explained.
@@ -33,17 +27,13 @@
 #'
 #' This is repeated multiple times until convergence, see CL21 for details.
 #'
-#' A drawback of this strategy is that many (at least 75%) of the \eqn{z} vectors will
-#' have \eqn{\sum z \in \{1, p-1\}}, producing many duplicates. Similarly, at least 92%
-#' of the mass will be used for the \eqn{p(p+1)} possible vectors with
-#' \eqn{\sum z \in \{1, 2, p-2, p-1\}}.
-#' This inefficiency can be fixed by a hybrid strategy, combining exact calculations
-#' with sampling.
+#' To avoid the re-evaluation of identical coalition vectors, we have implemented
+#' a hybrid strategy, combining exact calculations with sampling.
 #'
 #' The hybrid algorithm has two steps:
 #' 1. Step 1 (exact part): There are \eqn{2p} different on-off vectors \eqn{z} with
-#'   \eqn{\sum z \in \{1, p-1\}}, covering a large proportion of the Kernel SHAP
-#'   distribution. The degree 1 hybrid will list those vectors and use them according
+#'   \eqn{\sum z \in \{1, p-1\}}.
+#'   The degree 1 hybrid will list those vectors and use them according
 #'   to their weights in the upcoming calculations. Depending on \eqn{p}, we can also go
 #'   a step further to a degree 2 hybrid by adding all \eqn{p(p-1)} vectors with
 #'   \eqn{\sum z \in \{2, p-2\}} to the process etc. The necessary predictions are
@@ -96,12 +86,10 @@
 #'     worse than the hybrid strategy and should therefore only be used for
 #'     studying properties of the Kernel SHAP algorithm.
 #'   - `1`: Uses all \eqn{2p} on-off vectors \eqn{z} with \eqn{\sum z \in \{1, p-1\}}
-#'     for the exact part, which covers at least 75% of the mass of the Kernel weight
-#'     distribution. The remaining mass is covered by random sampling.
+#'     for the exact part. The remaining mass is covered by random sampling.
 #'   - `2`: Uses all \eqn{p(p+1)} on-off vectors \eqn{z} with
-#'     \eqn{\sum z \in \{1, 2, p-2, p-1\}}. This covers at least 92% of the mass of the
-#'     Kernel weight distribution. The remaining mass is covered by sampling.
-#'     Convergence usually happens in the minimal possible number of iterations of two.
+#'     \eqn{\sum z \in \{1, 2, p-2, p-1\}}. The remaining mass is covered by sampling.
+#'     Convergence usually happens very fast.
 #'   - `k>2`: Uses all on-off vectors with
 #'     \eqn{\sum z \in \{1, \dots, k, p-k, \dots, p-1\}}.
 #' @param m Even number of on-off vectors sampled during one iteration.

diff --git a/R/permshap.R b/R/permshap.R
@@ -32,7 +32,7 @@
 #' Furthermore, the 2p on-off vectors with sum <=1 or >=p-1 are evaluated only once,
 #' similar to the degree 1 hybrid in [kernelshap()] (but covering less weight).
 #'
-#' @param exact If `TRUE`, the algorithm will produce exact SHAP values
+#' @param exact If `TRUE`, the algorithm produces exact SHAP values
 #'   with respect to the background data.
 #'   The default is `TRUE` for up to eight features, and `FALSE` otherwise.
 #' @param low_memory If `FALSE` (default up to p = 15), the algorithm evaluates p

diff --git a/R/utils_kernelshap.R b/R/utils_kernelshap.R
@@ -107,8 +107,7 @@ solver <- function(A, b, constraint) {
 # to Kernel SHAP weights -> (m x p) matrix.
 # The argument S can be used to restrict the range of sum(z).
 sample_Z <- function(p, m, feature_names, S = 1:(p - 1L)) {
-  # First draw s = sum(z) according to Kernel weights (renormalized to sum 1)
-  probs <- kernel_weights(p, S = S)
+  probs <- kernel_weights(p, per_coalition_size = TRUE, S = S)
   N <- S[sample.int(length(S), m, replace = TRUE, prob = probs)]
 
   # Then, conditional on that number, set random positions of z to 1
@@ -144,8 +143,8 @@ input_sampling <- function(p, m, deg, feature_names) {
   S <- (deg + 1L):(p - deg - 1L)
   Z <- sample_Z(p = p, m = m / 2, feature_names = feature_names, S = S)
   Z <- rbind(Z, !Z)
-  w_total <- if (deg == 0L) 1 else 1 - 2 * sum(kernel_weights(p)[seq_len(deg)])
-  w <- w_total / m
+  w <- if (deg == 0L) 1 else 1 - prop_exact(p, deg = deg)
+  w <- w / m
   list(Z = Z, w = rep.int(w, m), A = crossprod(Z) * w)
 }
 
@@ -159,33 +158,10 @@ input_sampling <- function(p, m, deg, feature_names) {
 input_exact <- function(p, feature_names) {
   Z <- exact_Z(p, feature_names = feature_names)
   Z <- Z[2L:(nrow(Z) - 1L), , drop = FALSE]
-  # Each Kernel weight(j) is divided by the number of vectors z having sum(z) = j
-  w <- kernel_weights(p) / choose(p, 1:(p - 1L))
-  list(Z = Z, w = w[rowSums(Z)], A = exact_A(p, feature_names = feature_names))
-}
-
-#' Exact Matrix A
-#'
-#' Internal function that calculates exact A.
-#' Notice the difference to the off-diagnonals in the Supplement of
-#' Covert and Lee (2021). Credits to David Watson for figuring out the correct formula,
-#' see our discussions in https://github.com/ModelOriented/kernelshap/issues/22
-#'
-#' @noRd
-#' @keywords internal
-#'
-#' @param p Number of features.
-#' @param feature_names Feature names.
-#' @returns A (p x p) matrix.
-exact_A <- function(p, feature_names) {
-  S <- 1:(p - 1L)
-  c_pr <- S * (S - 1) / p / (p - 1)
-  off_diag <- sum(kernel_weights(p) * c_pr)
-  A <- matrix(
-    data = off_diag, nrow = p, ncol = p, dimnames = list(feature_names, feature_names)
-  )
-  diag(A) <- 0.5
-  A
+  kw <- kernel_weights(p, per_coalition_size = FALSE) # Kernel weights for all subsets
+  w <- kw[rowSums(Z)] # Corresponding weight for each row in Z
+  w <- w / sum(w)
+  list(Z = Z, w = w, A = crossprod(Z, w * Z))
 }
 
 # List all length p vectors z with sum(z) in {k, p - k}
@@ -227,23 +203,39 @@ input_partly_exact <- function(p, deg, feature_names) {
     stop("p must be >=2*deg")
   }
 
-  kw <- kernel_weights(p)
-  Z <- w <- vector("list", deg)
+  kw <- kernel_weights(p, per_coalition_size = FALSE)
 
+  Z <- vector("list", deg)
   for (k in seq_len(deg)) {
     Z[[k]] <- partly_exact_Z(p, k = k, feature_names = feature_names)
-    n <- nrow(Z[[k]])
-    w_tot <- kw[k] * (2 - (p == 2L * k))
-    w[[k]] <- rep.int(w_tot / n, n)
   }
-  w <- unlist(w, recursive = FALSE, use.names = FALSE)
   Z <- do.call(rbind, Z)
-
+  w <- kw[rowSums(Z)]
+  w_target <- prop_exact(p, deg = deg) # How much of total weight to spend here
+  w <- w / sum(w) * w_target
   list(Z = Z, w = w, A = crossprod(Z, w * Z))
 }
 
-# Kernel weights normalized to a non-empty subset S of {1, ..., p-1}
-kernel_weights <- function(p, S = seq_len(p - 1L)) {
-  probs <- (p - 1L) / (choose(p, S) * S * (p - S))
-  probs / sum(probs)
+# Kernel weight distribution
+#
+# `per_coalition_size = TRUE` is required, e.g., when one wants to sample random masks
+# according to the Kernel SHAP distribution: Pick a coalition size as per
+# these weights, then randomly place "on" positions. `FALSE` refer to weights
+# if all masks has been calculated and one wants to calculate their weights based
+# on the number of "on" positions.
+kernel_weights <- function(p, per_coalition_size, S = seq_len(p - 1L)) {
+  const <- if (per_coalition_size) 1 else choose(p, S)
+  probs <- (p - 1) / (const * S * (p - S)) # could drop the numerator
+  return(probs / sum(probs))
+}
+
+# How much Kernel SHAP weights do coalitions of size
+# {1, ..., deg, ..., p-deg-1 ..., p-1} have?
+prop_exact <- function(p, deg) {
+  if (deg == 0) {
+    return(0)
+  }
+  w <- kernel_weights(p, per_coalition_size = TRUE)
+  w_total <- 2 * sum(w[seq_len(deg)]) - w[deg] * (p == 2 * deg)
+  return(w_total)
 }
diff --git a/README.md b/README.md
@@ -15,20 +15,18 @@
 
 The package contains three functions to crunch SHAP values:
 
-- **`permshap()`**: Permutation SHAP algorithm of [1]. Recommended for models with up to 8 features, or if you don't trust Kernel SHAP. Both exact and sampling versions are available.
-- **`kernelshap()`**: Kernel SHAP algorithm of [2] and [3]. Recommended for models with more than 8 features. Both exact and (pseudo-exact) sampling versions are available.
+- **`permshap()`**: Permutation SHAP algorithm of [1]. Both exact and sampling versions are available.
+- **`kernelshap()`**: Kernel SHAP algorithm of [2] and [3]. Both exact and (pseudo-exact) sampling versions are available.
 - **`additive_shap()`**: For *additive models* fitted via `lm()`, `glm()`, `mgcv::gam()`, `mgcv::bam()`, `gam::gam()`, `survival::coxph()`, or `survival::survreg()`. Exponentially faster than the model-agnostic options above, and recommended if possible.
 
-To explain your model, select an explanation dataset `X` (up to 1000 rows from the training data, feature columns only) and apply the recommended function. Use {shapviz} to visualize the resulting SHAP values. 
+To explain your model, select an explanation dataset `X` (up to 1000 rows from the training data, feature columns only). Use {shapviz} to visualize the resulting SHAP values. 
 
 **Remarks to `permshap()` and `kernelshap()`**
 
 - Both algorithms need a representative background data `bg_X` to calculate marginal means (up to 500 rows from the training data). In cases with a natural "off" value (like MNIST digits), this can also be a single row with all values set to the off value. If unspecified, 200 rows are randomly sampled from `X`.
-- Exact Kernel SHAP is an approximation to exact permutation SHAP. Since exact calculations are usually sufficiently fast for up to eight features, we recommend `permshap()` in this case. With more features, `kernelshap()` switches to a comparably fast, almost exact algorithm with faster convergence than the sampling version of permutation SHAP.
-  That is why we recommend `kernelshap()` in this case.
-- For models with interactions of order up to two, SHAP values of permutation SHAP and Kernel SHAP agree, 
-and the implemented sampling versions provide the same results as the exact versions.
-In the presence of interactions of order three or higher, this is no longer the case.
+- Exact Kernel SHAP gives identical results as exact permutation SHAP. Both algorithms are fast up to 8 features.
+  With more features, `kernelshap()` switches to an almost exact algorithm with faster convergence than the sampling version of permutation SHAP.
+- For models with interactions of order up to two, the sampling versions provide the same results as the exact versions.
 - For additive models, `permshap()` and `kernelshap()` give the same results as `additive_shap` 
 as long as the full training data would be used as background data.
 
@@ -89,13 +87,12 @@ ps
 [1,]  1.1913247  0.09005467 -0.13430720 0.000682593
 [2,] -0.4931989 -0.11724773  0.09868921 0.028563613
 
-# Kernel SHAP gives very slightly different values because the model contains
-# interations of order > 2:
+# Indeed, Kernel SHAP gives the same:
 ks <- kernelshap(fit, X, bg_X = bg_X)
 ks
-#       log_carat     clarity       color        cut
-# [1,]  1.1911791  0.0900462 -0.13531648 0.001845958
-# [2,] -0.4927482 -0.1168517  0.09815062 0.028255442
+      log_carat     clarity       color         cut
+[1,]  1.1913247  0.09005467 -0.13430720 0.000682593
+[2,] -0.4931989 -0.11724773  0.09868921 0.028563613
 
 # 4) Analyze with {shapviz}
 ps <- shapviz(ps)

diff --git a/backlog/compare_with_python2.R b/backlog/compare_with_python2.R
@@ -0,0 +1,63 @@
+library(kernelshap)
+
+n <- 100
+
+X <- data.frame(
+  x1 = seq(1:n) / 100,
+  x2 = log(1:n),
+  x3 = sqrt(1:n),
+  x4 = sin(1:n),
+  x5 = (seq(1:n) / 100)^2,
+  x6 = cos(1:n)
+)
+head(X)
+
+pf <- function(model, newdata) {
+  x <- newdata
+  x[, 1] * x[, 2] * x[, 3] * x[, 4] + x[, 5] + x[, 6]
+}
+ks <- kernelshap(pf, head(X), bg_X = X, pred_fun = pf)
+ks # -1.196216 -1.241848 -0.9567848 3.879420 -0.33825  0.5456252
+es <- permshap(pf, head(X), bg_X = X, pred_fun = pf)
+es # -1.196216 -1.241848 -0.9567848 3.879420 -0.33825  0.5456252
+
+set.seed(10)
+kss <- kernelshap(
+  pf,
+  head(X, 1),
+  bg_X = X,
+  pred_fun = pf,
+  hybrid_degree = 0,
+  exact = F,
+  m = 9000,
+  max_iter = 100,
+  tol = 0.0005
+)
+kss # -1.198078 -1.246508 -0.9580638 3.877532 -0.3241824 0.541247
+
+set.seed(2)
+ksh <- kernelshap(
+  pf,
+  head(X, 1),
+  bg_X = X,
+  pred_fun = pf,
+  hybrid_degree = 1,
+  exact = FALSE,
+  max_iter = 10000,
+  tol = 0.0005
+)
+ksh # -1.191981 -1.240656 -0.9516264 3.86776 -0.3342143 0.5426642
+
+set.seed(1)
+ksh2 <- kernelshap(
+  pf,
+  head(X, 1),
+  bg_X = X,
+  pred_fun = pf,
+  hybrid_degree = 2,
+  exact = FALSE,
+  m = 10000,
+  max_iter = 10000,
+  tol = 0.0001
+)
+ksh2 # 1.195976 -1.241107 -0.9565121 3.878891 -0.3384621 0.5451118
diff --git a/cran-comments.md b/cran-comments.md
@@ -1,9 +1,13 @@
-# kernelshap 0.8.0
+# kernelshap 0.9.0
 
-Dear CRAN team
+We have figured out a bug in the weighting logic of Kernel SHAP.
 
-The package (finally!) contains a sampling version of permutation SHAP. In contrast 
-to other implementations, it iterates until convergence, and standard errors are provided.
+This update comes with a fix which has been tested against two other implementations.
+
+I am aware that the last release of {kernelshap} is not too long ago, but I still would love to see
+this fixed before the (well-deserved) summer break.
+
+Thanks a lot!
 
 ## Checks
 
@@ -18,9 +22,8 @@ R Under development (unstable) (2025-07-05 r88387 ucrt)
 
 ### Revdep OK
 
-survex 1.2.0                           ── E: 0     | W: 0     | N: 0
-XAItest 1.0.1                          ── E: 1     | W: 0     | N: 0
-SEMdeep 1.0.0                          ── E: 1     | W: 1     | N: 0     
+✔ survex 1.2.0                           ── E: 0     | W: 0     | N: 0        
+✔ XAItest 1.0.1                          ── E: 1     | W: 0     | N: 0        
+✔ SEMdeep 1.0.0                          ── E: 1     | W: 1     | N: 0  
 
-OK: 3
-BROKEN: 0
+OK: 3