Skip to content

Commit 4f18ec6

Browse files
committed
Merge branch 'feature/scorer' into fix/missing-tests
# Conflicts: # CHANGELOG.md
2 parents 2ee2059 + fa87e70 commit 4f18ec6

File tree

8 files changed

+353
-177
lines changed

8 files changed

+353
-177
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@
1313
[PR #252](https://github.com/appliedAI-Initiative/pyDVL/pull/250)
1414
- Operations on `ValuationResult` and `Status` and cleanup
1515
[PR #248](https://github.com/appliedAI-Initiative/pyDVL/pull/248)
16+
- Splitting of problem preparation and solution in Least Core computation.
17+
Umbrella function for LC methods.
18+
[PR #257](https://github.com/appliedAI-Initiative/pyDVL/pull/257)
1619
- **Bug fix and minor improvements**: Fixes bug in TMCS with remote Ray cluster,
1720
raises an error for dummy sequential parallel backend with TMCS, clones model
1821
inside `Utility` before fitting by default, with flag `clone_before_fit`

docs/30-data-valuation.rst

Lines changed: 30 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -499,29 +499,18 @@ Monte Carlo Least Core
499499
Because the number of subsets $S \subseteq D \setminus \{x_i\}$ is
500500
$2^{ | D | - 1 }$, one typically must resort to approximations.
501501

502-
The simplest approximation consists of two relaxations of the Least Core
503-
(:footcite:t:`yan_if_2021`):
504-
505-
- Further relaxing the coalitional rationality property by
506-
a constant value $\epsilon > 0$:
507-
508-
$$
509-
\sum_{x_i\in S} v_u(x_i) + e + \epsilon \geq u(S)
510-
$$
511-
512-
- Using a fraction of all subsets instead of all possible subsets.
513-
514-
Combined, this gives us the $(\epsilon, \delta)$-*probably approx-
515-
imate least core* that satisfies the following property:
502+
The simplest approximation consists in using a fraction of all subsets for the
503+
constraints. :footcite:t:`yan_if_2021` show that a quantity of order
504+
$\mathcal{O}((n - \log \Delta ) / \delta^2)$ is enough to obtain a so-called
505+
$\delta$-*approximate least core* with high probability. I.e. the following
506+
property holds with probability $1-\Delta$ over the choice of subsets:
516507

517508
$$
518-
P_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} + \epsilon \geq u(S)\right]
519-
\geq 1 - \delta
509+
\mathbb{P}_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} \geq u(S)\right]
510+
\geq 1 - \delta,
520511
$$
521512

522-
Where $e^{*}$ is the optimal least core subsidy.
523-
524-
With these relaxations, we obtain a polynomial running time.
513+
where $e^{*}$ is the optimal least core subsidy.
525514

526515
.. code-block:: python
527516
@@ -537,6 +526,28 @@ With these relaxations, we obtain a polynomial running time.
537526

538527
``n_iterations`` needs to be at least equal to the number of data points.
539528

529+
Because computing the Least Core values requires the solution of a linear and a
530+
quadratic problem *after* computing all the utility values, we offer the
531+
possibility of splitting the latter from the former. This is useful when running
532+
multiple experiments: use
533+
:func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` to prepare a
534+
list of problems to solve, then solve them in parallel with
535+
:func:`~pydvl.value.least_core.common.lc_solve_problems`.
536+
537+
.. code-block:: python
538+
539+
from pydvl.utils import Dataset, Utility
540+
from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
541+
model = ...
542+
dataset = Dataset(...)
543+
n_iterations = ...
544+
utility = Utility(data, model)
545+
n_experiments = 10
546+
problems = [mclc_prepare_problem(utility, n_iterations=n_iterations)
547+
for _ in range(n_experiments)]
548+
values = lc_solve_problems(problems)
549+
550+
540551
Other methods
541552
=============
542553

src/pydvl/value/least_core/__init__.py

Lines changed: 83 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,87 @@
44
This package holds all routines for the computation of Least Core data values.
55
66
Please refer to :ref:`data valuation` for an overview.
7+
8+
In addition to the standard interface via
9+
:func:`~pydvl.value.least_core.compute_least_core_values`, because computing the
10+
Least Core values requires the solution of a linear and a quadratic problem
11+
*after* computing all the utility values, there is the possibility of performing
12+
each step separately. This is useful when running multiple experiments: use
13+
:func:`~pydvl.value.least_core.naive.lc_prepare_problem` or
14+
:func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` to prepare a
15+
list of problems to solve, then solve them in parallel with
16+
:func:`~pydvl.value.least_core.common.lc_solve_problems`.
17+
18+
Note that :func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` is
19+
parallelized itself, so preparing the problems should be done in sequence in this
20+
case. The solution of the linear systems can then be done in parallel.
21+
722
"""
8-
from .montecarlo import *
9-
from .naive import *
23+
from enum import Enum
24+
from typing import Optional
25+
26+
from pydvl.utils.utility import Utility
27+
from pydvl.value.least_core.montecarlo import *
28+
from pydvl.value.least_core.naive import *
29+
from pydvl.value.result import ValuationResult
30+
31+
32+
class LeastCoreMode(Enum):
33+
"""Available Least Core algorithms."""
34+
35+
MonteCarlo = "montecarlo"
36+
Exact = "exact"
37+
38+
39+
def compute_least_core_values(
40+
u: Utility,
41+
*,
42+
n_jobs: int = 1,
43+
n_iterations: Optional[int] = None,
44+
mode: LeastCoreMode = LeastCoreMode.MonteCarlo,
45+
**kwargs,
46+
) -> ValuationResult:
47+
"""Umbrella method to compute Least Core values with any of the available
48+
algorithms.
49+
50+
See :ref:`data valuation` for an overview.
51+
52+
The following algorithms are available. Note that the exact method can only
53+
work with very small datasets and is thus intended only for testing.
54+
55+
- ``exact``: uses the complete powerset of the training set for the constraints
56+
:func:`~pydvl.value.shapley.naive.combinatorial_exact_shapley`.
57+
- ``montecarlo``: uses the approximate Monte Carlo Least Core algorithm.
58+
Implemented in :func:`~pydvl.value.least_core.montecarlo.montecarlo_least_core`.
59+
60+
:param u: Utility object with model, data, and scoring function
61+
:param n_jobs: Number of jobs to run in parallel. Only used for Monte Carlo
62+
Least Core.
63+
:param n_iterations: Number of subsets to sample and evaluate the utility on.
64+
Only used for Monte Carlo Least Core.
65+
:param mode: Algorithm to use. See :class:`LeastCoreMode` for available
66+
options.
67+
:param kwargs: Additional keyword arguments passed to the solver.
68+
69+
:return: ValuationResult object with the computed values.
70+
71+
.. versionadded:: 0.4.1
72+
"""
73+
progress: bool = kwargs.pop("progress", False)
74+
75+
if mode == LeastCoreMode.MonteCarlo:
76+
# TODO fix progress showing and maybe_progress in remote case
77+
progress = False
78+
if n_iterations is None:
79+
raise ValueError("n_iterations cannot be None for Monte Carlo Least Core")
80+
return montecarlo_least_core(
81+
u=u,
82+
n_iterations=n_iterations,
83+
n_jobs=n_jobs,
84+
progress=progress,
85+
options=kwargs,
86+
)
87+
elif mode == LeastCoreMode.Exact:
88+
return exact_least_core(u=u, progress=progress, options=kwargs)
89+
90+
raise ValueError(f"Invalid value encountered in {mode=}")

src/pydvl/value/least_core/_common.py renamed to src/pydvl/value/least_core/common.py

Lines changed: 135 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,156 @@
1+
import itertools
12
import logging
23
import warnings
3-
from typing import Optional, Tuple
4+
from typing import List, NamedTuple, Optional, Sequence, Tuple
45

56
import cvxpy as cp
67
import numpy as np
78
from numpy.typing import NDArray
89

10+
from pydvl.utils import MapReduceJob, ParallelConfig, Status, Utility
11+
from pydvl.value import ValuationResult
12+
913
__all__ = [
1014
"_solve_least_core_linear_program",
1115
"_solve_egalitarian_least_core_quadratic_program",
16+
"lc_solve_problem",
17+
"lc_solve_problems",
18+
"LeastCoreProblem",
1219
]
1320

1421
logger = logging.getLogger(__name__)
1522

23+
LeastCoreProblem = NamedTuple(
24+
"LeastCoreProblem",
25+
[("utility_values", NDArray[np.float_]), ("A_lb", NDArray[np.float_])],
26+
)
27+
28+
29+
def lc_solve_problem(
30+
problem: LeastCoreProblem, *, u: Utility, algorithm: str, **options
31+
) -> ValuationResult:
32+
"""Solves a linear problem prepared by :func:`mclc_prepare_problem`.
33+
Useful for parallel execution of multiple experiments by running this as a
34+
remote task.
35+
36+
See :func:`~pydvl.value.least_core.naive.exact_least_core` or
37+
:func:`~pydvl.value.least_core.montecarlo.montecarlo_least_core` for
38+
argument descriptions.
39+
"""
40+
if options is None:
41+
options = {}
42+
n = len(u.data)
43+
44+
if np.any(np.isnan(problem.utility_values)):
45+
warnings.warn(
46+
f"Calculation returned "
47+
f"{np.sum(np.isnan(problem.utility_values))} NaN "
48+
f"values out of {problem.utility_values.size}",
49+
RuntimeWarning,
50+
)
51+
52+
logger.debug("Removing possible duplicate values in lower bound array")
53+
b_lb = problem.utility_values
54+
A_lb, unique_indices = np.unique(problem.A_lb, return_index=True, axis=0)
55+
b_lb = b_lb[unique_indices]
56+
57+
logger.debug("Building equality constraint")
58+
A_eq = np.ones((1, n))
59+
# We might have already computed the total utility. That's the index of the
60+
# row in A_lb with all ones.
61+
total_utility_index = np.where(A_lb.sum(axis=1) == n)[0]
62+
if len(total_utility_index) == 0:
63+
b_eq = np.array([u(u.data.indices)])
64+
else:
65+
b_eq = b_lb[total_utility_index]
66+
67+
_, subsidy = _solve_least_core_linear_program(
68+
A_eq=A_eq, b_eq=b_eq, A_lb=A_lb, b_lb=b_lb, **options
69+
)
70+
71+
values: Optional[NDArray[np.float_]]
72+
73+
if subsidy is None:
74+
logger.debug("No values were found")
75+
status = Status.Failed
76+
values = np.empty(n)
77+
values[:] = np.nan
78+
subsidy = np.nan
79+
else:
80+
values = _solve_egalitarian_least_core_quadratic_program(
81+
subsidy,
82+
A_eq=A_eq,
83+
b_eq=b_eq,
84+
A_lb=A_lb,
85+
b_lb=b_lb,
86+
**options,
87+
)
88+
89+
if values is None:
90+
logger.debug("No values were found")
91+
status = Status.Failed
92+
values = np.empty(n)
93+
values[:] = np.nan
94+
subsidy = np.nan
95+
else:
96+
status = Status.Converged
97+
98+
return ValuationResult(
99+
algorithm=algorithm,
100+
status=status,
101+
values=values,
102+
subsidy=subsidy,
103+
stderr=None,
104+
data_names=u.data.data_names,
105+
)
106+
107+
108+
def lc_solve_problems(
109+
problems: Sequence[LeastCoreProblem],
110+
u: Utility,
111+
algorithm: str,
112+
config: ParallelConfig = ParallelConfig(),
113+
n_jobs: int = 1,
114+
**options,
115+
) -> List[ValuationResult]:
116+
"""Solves a list of linear problems in parallel.
117+
118+
:param u: Utility.
119+
:param problems: Least Core problems to solve, as returned by
120+
:func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem`.
121+
:param algorithm: Name of the valuation algorithm.
122+
:param config: Object configuring parallel computation, with cluster
123+
address, number of cpus, etc.
124+
:param n_jobs: Number of parallel jobs to run.
125+
:param options: Additional options to pass to the solver.
126+
:return: List of solutions.
127+
"""
128+
129+
def _map_func(
130+
problems: List[LeastCoreProblem], *args, **kwargs
131+
) -> List[ValuationResult]:
132+
return [lc_solve_problem(p, *args, **kwargs) for p in problems]
133+
134+
map_reduce_job: MapReduceJob[
135+
"LeastCoreProblem", "List[ValuationResult]"
136+
] = MapReduceJob(
137+
inputs=problems,
138+
map_func=_map_func,
139+
map_kwargs=dict(u=u, algorithm=algorithm, **options),
140+
reduce_func=lambda x: list(itertools.chain(*x)),
141+
config=config,
142+
n_jobs=n_jobs,
143+
)
144+
solutions = map_reduce_job()
145+
146+
return solutions
147+
16148

17149
def _solve_least_core_linear_program(
18150
A_eq: NDArray[np.float_],
19151
b_eq: NDArray[np.float_],
20152
A_lb: NDArray[np.float_],
21153
b_lb: NDArray[np.float_],
22-
*,
23-
epsilon: float = 0.0,
24154
**options,
25155
) -> Tuple[Optional[NDArray[np.float_]], Optional[float]]:
26156
"""Solves the Least Core's linear program using cvxopt.
@@ -46,7 +176,6 @@ def _solve_least_core_linear_program(
46176
coefficients of a linear inequality constraint on ``x``.
47177
:param b_lb: The inequality constraint vector. Each element represents a
48178
lower bound on the corresponding value of ``A_lb @ x``.
49-
:param epsilon: Relaxation value by which the subset utility is decreased.
50179
:param options: Keyword arguments that will be used to select a solver
51180
and to configure it. For all possible options, refer to `cvxpy's documentation
52181
<https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options>`_
@@ -57,13 +186,12 @@ def _solve_least_core_linear_program(
57186

58187
x = cp.Variable(n_variables)
59188
e = cp.Variable()
60-
epsilon_parameter = cp.Parameter(name="epsilon", nonneg=True, value=epsilon)
61189

62190
objective = cp.Minimize(e)
63191
constraints = [
64192
e >= 0,
65193
A_eq @ x == b_eq,
66-
(A_lb @ x + e * np.ones(len(A_lb))) >= (b_lb - epsilon_parameter),
194+
(A_lb @ x + e * np.ones(len(A_lb))) >= b_lb,
67195
]
68196
problem = cp.Problem(objective, constraints)
69197

@@ -110,7 +238,6 @@ def _solve_egalitarian_least_core_quadratic_program(
110238
b_eq: NDArray[np.float_],
111239
A_lb: NDArray[np.float_],
112240
b_lb: NDArray[np.float_],
113-
epsilon: float = 0.0,
114241
**options,
115242
) -> Optional[NDArray[np.float_]]:
116243
"""Solves the egalitarian Least Core's quadratic program using cvxopt.
@@ -137,7 +264,6 @@ def _solve_egalitarian_least_core_quadratic_program(
137264
coefficients of a linear inequality constraint on ``x``.
138265
:param b_lb: The inequality constraint vector. Each element represents a
139266
lower bound on the corresponding value of ``A_lb @ x``.
140-
:param epsilon: Relaxation value by which the subset utility is decreased.
141267
:param options: Keyword arguments that will be used to select a solver
142268
and to configure it. Refer to the following page for all possible options:
143269
https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options
@@ -150,12 +276,11 @@ def _solve_egalitarian_least_core_quadratic_program(
150276
n_variables = A_eq.shape[1]
151277

152278
x = cp.Variable(n_variables)
153-
epsilon_parameter = cp.Parameter(name="epsilon", nonneg=True, value=epsilon)
154279

155280
objective = cp.Minimize(cp.norm2(x))
156281
constraints = [
157282
A_eq @ x == b_eq,
158-
(A_lb @ x + subsidy * np.ones(len(A_lb))) >= (b_lb - epsilon_parameter),
283+
(A_lb @ x + subsidy * np.ones(len(A_lb))) >= b_lb,
159284
]
160285
problem = cp.Problem(objective, constraints)
161286

0 commit comments

Comments
 (0)