@@ -118,6 +118,34 @@ is implemented, it is important not to reuse `Utility` objects for different
118118datasets. You can read more about :ref: `caching setup ` in the installation guide
119119and the documentation of the :mod: `pydvl.utils.caching ` module.
120120
121+ Using custom scorers
122+ ^^^^^^^^^^^^^^^^^^^^
123+
124+ The `scoring ` argument of :class: `~pydvl.utils.utility.Utility ` can be used to
125+ specify a custom :class: `~pydvl.utils.utility.Scorer ` object. This is a simple
126+ wrapper for a callable that takes a model, and test data and returns a score.
127+
128+ More importantly, the object provides information about the range of the score,
129+ which is used by some methods by estimate the number of samples necessary, and
130+ about what default value to use when the model fails to train.
131+
132+ .. note ::
133+ The most important property of a `Scorer ` is its default value. Because many
134+ models will fail to fit on small subsets of the data, it is important to
135+ provide a sensible default value for the score.
136+
137+ It is possible to skip the construction of the :class: `~pydvl.utils.utility.Scorer `
138+ when constructing the `Utility ` object. The two following calls are equivalent:
139+
140+ .. code-block :: python
141+
142+ utility = Utility(
143+ model, dataset, " explained_variance" , score_range = (- np.inf, 1 ), default_score = 0.0
144+ )
145+ utility = Utility(
146+ model, dataset, Scorer(" explained_variance" , range = (- np.inf, 1 ), default = 0.0 )
147+ )
148+
121149 Learning the utility
122150^^^^^^^^^^^^^^^^^^^^
123151
@@ -174,7 +202,7 @@ definitions, but other methods are typically preferable.
174202 values = naive_loo(utility)
175203
176204 The return value of all valuation functions is an object of type
177- :class: `~pydvl.value.results .ValuationResult `. This can be iterated over,
205+ :class: `~pydvl.value.result .ValuationResult `. This can be iterated over,
178206indexed with integers, slices and Iterables, as well as converted to a
179207`pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html >`_.
180208
@@ -217,11 +245,11 @@ v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus \{x_i\}}
217245 values = compute_shapley_values(utility, mode = " combinatorial_exact" )
218246 df = values.to_dataframe(column = ' value' )
219247
220- We convert the return value to a
248+ We can convert the return value to a
221249`pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html >`_
222250and name the column with the results as `value `. Please refer to the
223251documentation in :mod: `pydvl.value.shapley ` and
224- :class: `~pydvl.value.results .ValuationResult ` for more information.
252+ :class: `~pydvl.value.result .ValuationResult ` for more information.
225253
226254Monte Carlo Combinatorial Shapley
227255^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -240,12 +268,19 @@ same pattern:
240268 model = ...
241269 data = Dataset(... )
242270 utility = Utility(model, data)
243- values = compute_shapley_values(utility, mode = " combinatorial_montecarlo" )
271+ values = compute_shapley_values(
272+ utility, mode = " combinatorial_montecarlo" , done = MaxUpdates(1000 )
273+ )
244274 df = values.to_dataframe(column = ' cmc' )
245275
246276 The DataFrames returned by most Monte Carlo methods will contain approximate
247277standard errors as an additional column, in this case named `cmc_stderr `.
248278
279+ Note the usage of the object :class: `~pydvl.value.stopping.MaxUpdates ` as the
280+ stop condition. This is an instance of a
281+ :class: `~pydvl.value.stopping.StoppingCriterion `. Other examples are
282+ :class: `~pydvl.value.stopping.MaxTime ` and :class: `~pydvl.value.stopping.StandardError `.
283+
249284
250285Owen sampling
251286^^^^^^^^^^^^^
@@ -281,6 +316,10 @@ sampling, and its variant *Antithetic Owen Sampling* in the documentation for th
281316function doing the work behind the scenes:
282317:func: `~pydvl.value.shapley.montecarlo.owen_sampling_shapley `.
283318
319+ Note that in this case we do not pass a
320+ :class: `~pydvl.value.stopping.StoppingCriterion ` to the function, but instead
321+ the number of iterations and the maximum number of samples to use in the
322+ integration.
284323
285324Permutation Shapley
286325^^^^^^^^^^^^^^^^^^^
@@ -309,7 +348,7 @@ efficient enough to be useful in some applications.
309348 data = Dataset(... )
310349 utility = Utility(model, data)
311350 values = compute_shapley_values(
312- u = utility, mode = " truncated_montecarlo" , n_iterations = 100
351+ u = utility, mode = " truncated_montecarlo" , done = MaxUpdates( 1000 )
313352 )
314353
315354
358397 but we don't advocate its use because of the speed and memory cost. Despite
359398 our best efforts, the number of samples required in practice for convergence
360399 can be several orders of magnitude worse than with e.g. Truncated Monte Carlo.
400+ Additionally, the CSP can sometimes turn out to be infeasible.
361401
362402Usage follows the same pattern as every other Shapley method, but with the
363- addition of an ``eps `` parameter required for the solution of the CSP. It should
364- be the same value used to compute the minimum number of samples required. This
365- can be done with :func: `~pydvl.value.shapley.gt.num_samples_eps_delta `, but note
366- that the number returned will be huge! In practice, fewer samples can be enough,
367- but the actual number will strongly depend on the utility, in particular its
368- variance.
403+ addition of an ``epsilon `` parameter required for the solution of the CSP. It
404+ should be the same value used to compute the minimum number of samples required.
405+ This can be done with :func: `~pydvl.value.shapley.gt.num_samples_eps_delta `, but
406+ note that the number returned will be huge! In practice, fewer samples can be
407+ enough, but the actual number will strongly depend on the utility, in particular
408+ its variance.
369409
370410.. code-block :: python
371411
@@ -459,29 +499,18 @@ Monte Carlo Least Core
459499Because the number of subsets $S \s ubseteq D \s etminus \{ x_i\} $ is
460500$2^{ | D | - 1 }$, one typically must resort to approximations.
461501
462- The simplest approximation consists of two relaxations of the Least Core
463- (:footcite:t: `yan_if_2021 `):
464-
465- - Further relaxing the coalitional rationality property by
466- a constant value $\e psilon > 0$:
467-
468- $$
469- \s um_{x_i\i n S} v_u(x_i) + e + \e psilon \g eq u(S)
470- $$
471-
472- - Using a fraction of all subsets instead of all possible subsets.
473-
474- Combined, this gives us the $(\e psilon, \d elta)$-*probably approx-
475- imate least core * that satisfies the following property:
502+ The simplest approximation consists in using a fraction of all subsets for the
503+ constraints. :footcite:t: `yan_if_2021 ` show that a quantity of order
504+ $\m athcal{O}((n - \l og \D elta ) / \d elta^2)$ is enough to obtain a so-called
505+ $\d elta$-*approximate least core * with high probability. I.e. the following
506+ property holds with probability $1-\D elta$ over the choice of subsets:
476507
477508$$
478- P_{ S\s im D}\l eft[\s um_{x_i\i n S} v_u(x_i) + e^{*} + \e psilon \g eq u(S)\r ight]
479- \g eq 1 - \d elta
509+ \m athbb{P}_{ S\s im D}\l eft[\s um_{x_i\i n S} v_u(x_i) + e^{*} \g eq u(S)\r ight]
510+ \g eq 1 - \d elta,
480511$$
481512
482- Where $e^{*}$ is the optimal least core subsidy.
483-
484- With these relaxations, we obtain a polynomial running time.
513+ where $e^{*}$ is the optimal least core subsidy.
485514
486515.. code-block :: python
487516
@@ -497,6 +526,28 @@ With these relaxations, we obtain a polynomial running time.
497526
498527 ``n_iterations `` needs to be at least equal to the number of data points.
499528
529+ Because computing the Least Core values requires the solution of a linear and a
530+ quadratic problem *after * computing all the utility values, we offer the
531+ possibility of splitting the latter from the former. This is useful when running
532+ multiple experiments: use
533+ :func: `~pydvl.value.least_core.montecarlo.mclc_prepare_problem ` to prepare a
534+ list of problems to solve, then solve them in parallel with
535+ :func: `~pydvl.value.least_core.common.lc_solve_problems `.
536+
537+ .. code-block :: python
538+
539+ from pydvl.utils import Dataset, Utility
540+ from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
541+ model = ...
542+ dataset = Dataset(... )
543+ n_iterations = ...
544+ utility = Utility(data, model)
545+ n_experiments = 10
546+ problems = [mclc_prepare_problem(utility, n_iterations = n_iterations)
547+ for _ in range (n_experiments)]
548+ values = lc_solve_problems(problems)
549+
550+
500551 Other methods
501552=============
502553
@@ -528,7 +579,7 @@ nature of every (non-trivial) ML problem can have an effect:
528579
529580 pyDVL offers a dedicated :func: `function composition
530581 <pydvl.utils.types.compose_score> ` for scorer functions which can be used to
531- squash a score. The following is defined in module :mod: `~pydvl.utils.numeric `:
582+ squash a score. The following is defined in module :mod: `~pydvl.utils.scorer `:
532583
533584 .. code-block :: python
534585
0 commit comments