aai-institute
diff --git a/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/30-data-valuation.rst‎
Lines changed: 30 additions & 19 deletions b/‎docs/30-data-valuation.rst‎
Lines changed: 30 additions & 19 deletions
diff --git a/‎src/pydvl/value/least_core/__init__.py‎
Lines changed: 83 additions & 2 deletions b/‎src/pydvl/value/least_core/__init__.py‎
Lines changed: 83 additions & 2 deletions
diff --git a/‎src/pydvl/value/least_core/_common.py‎ renamed to ‎src/pydvl/value/least_core/common.py‎
Lines changed: 135 additions & 10 deletions b/‎src/pydvl/value/least_core/_common.py‎ renamed to ‎src/pydvl/value/least_core/common.py‎
Lines changed: 135 additions & 10 deletions
@@ -13,6 +13,9 @@
   [PR #252](https://github.com/appliedAI-Initiative/pyDVL/pull/250)
 - Operations on `ValuationResult` and `Status` and cleanup
   [PR #248](https://github.com/appliedAI-Initiative/pyDVL/pull/248)
+- Splitting of problem preparation and solution in Least Core computation.
+  Umbrella function for LC methods.
+  [PR #257](https://github.com/appliedAI-Initiative/pyDVL/pull/257)
 - **Bug fix and minor improvements**: Fixes bug in TMCS with remote Ray cluster,
   raises an error for dummy sequential parallel backend with TMCS, clones model
   inside `Utility` before fitting by default, with flag `clone_before_fit` 
 
@@ -499,29 +499,18 @@ Monte Carlo Least Core
 Because the number of subsets $S \subseteq D \setminus \{x_i\}$ is
 $2^{ | D | - 1 }$, one typically must resort to approximations.
 
-The simplest approximation consists of two relaxations of the Least Core
-(:footcite:t:`yan_if_2021`):
-
-- Further relaxing the coalitional rationality property by
-  a constant value $\epsilon > 0$:
-
-  $$
-  \sum_{x_i\in S} v_u(x_i) + e + \epsilon \geq u(S)
-  $$
-
-- Using a fraction of all subsets instead of all possible subsets.
-
-Combined, this gives us the $(\epsilon, \delta)$-*probably approx-
-imate least core* that satisfies the following property:
+The simplest approximation consists in using a fraction of all subsets for the
+constraints. :footcite:t:`yan_if_2021` show that a quantity of order
+$\mathcal{O}((n - \log \Delta ) / \delta^2)$ is enough to obtain a so-called
+$\delta$-*approximate least core* with high probability. I.e. the following
+property holds with probability $1-\Delta$ over the choice of subsets:
 
 $$
-P_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} + \epsilon \geq u(S)\right]
-\geq 1 - \delta
+\mathbb{P}_{S\sim D}\left[\sum_{x_i\in S} v_u(x_i) + e^{*} \geq u(S)\right]
+\geq 1 - \delta,
 $$
 
-Where $e^{*}$ is the optimal least core subsidy.
-
-With these relaxations, we obtain a polynomial running time.
+where $e^{*}$ is the optimal least core subsidy.
 
 .. code-block:: python
 
@@ -537,6 +526,28 @@ With these relaxations, we obtain a polynomial running time.
 
    ``n_iterations`` needs to be at least equal to the number of data points.
 
+Because computing the Least Core values requires the solution of a linear and a
+quadratic problem *after* computing all the utility values, we offer the
+possibility of splitting the latter from the former. This is useful when running
+multiple experiments: use
+:func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` to prepare a
+list of problems to solve, then solve them in parallel with
+:func:`~pydvl.value.least_core.common.lc_solve_problems`.
+
+.. code-block:: python
+
+   from pydvl.utils import Dataset, Utility
+   from pydvl.value.least_core import mclc_prepare_problem, lc_solve_problems
+   model = ...
+   dataset = Dataset(...)
+   n_iterations = ...
+   utility = Utility(data, model)
+   n_experiments = 10
+   problems = [mclc_prepare_problem(utility, n_iterations=n_iterations)
+        for _ in range(n_experiments)]
+   values = lc_solve_problems(problems)
+
+
 Other methods
 =============
 
 
@@ -4,6 +4,87 @@
 This package holds all routines for the computation of Least Core data values.
 
 Please refer to :ref:`data valuation` for an overview.
+
+In addition to the standard interface via
+:func:`~pydvl.value.least_core.compute_least_core_values`, because computing the
+Least Core values requires the solution of a linear and a quadratic problem
+*after* computing all the utility values, there is the possibility of performing
+each step separately. This is useful when running multiple experiments: use
+:func:`~pydvl.value.least_core.naive.lc_prepare_problem` or
+:func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` to prepare a
+list of problems to solve, then solve them in parallel with
+:func:`~pydvl.value.least_core.common.lc_solve_problems`.
+
+Note that :func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem` is
+parallelized itself, so preparing the problems should be done in sequence in this
+case. The solution of the linear systems can then be done in parallel.
+
 """
-from .montecarlo import *
-from .naive import *
+from enum import Enum
+from typing import Optional
+
+from pydvl.utils.utility import Utility
+from pydvl.value.least_core.montecarlo import *
+from pydvl.value.least_core.naive import *
+from pydvl.value.result import ValuationResult
+
+
+class LeastCoreMode(Enum):
+    """Available Least Core algorithms."""
+
+    MonteCarlo = "montecarlo"
+    Exact = "exact"
+
+
+def compute_least_core_values(
+    u: Utility,
+    *,
+    n_jobs: int = 1,
+    n_iterations: Optional[int] = None,
+    mode: LeastCoreMode = LeastCoreMode.MonteCarlo,
+    **kwargs,
+) -> ValuationResult:
+    """Umbrella method to compute Least Core values with any of the available
+    algorithms.
+
+    See :ref:`data valuation` for an overview.
+
+    The following algorithms are available. Note that the exact method can only
+    work with very small datasets and is thus intended only for testing.
+
+    - ``exact``: uses the complete powerset of the training set for the constraints
+      :func:`~pydvl.value.shapley.naive.combinatorial_exact_shapley`.
+    - ``montecarlo``:  uses the approximate Monte Carlo Least Core algorithm.
+      Implemented in :func:`~pydvl.value.least_core.montecarlo.montecarlo_least_core`.
+
+    :param u: Utility object with model, data, and scoring function
+    :param n_jobs: Number of jobs to run in parallel. Only used for Monte Carlo
+        Least Core.
+    :param n_iterations: Number of subsets to sample and evaluate the utility on.
+        Only used for Monte Carlo Least Core.
+    :param mode: Algorithm to use. See :class:`LeastCoreMode` for available
+        options.
+    :param kwargs: Additional keyword arguments passed to the solver.
+
+    :return: ValuationResult object with the computed values.
+
+    .. versionadded:: 0.4.1
+    """
+    progress: bool = kwargs.pop("progress", False)
+
+    if mode == LeastCoreMode.MonteCarlo:
+        # TODO fix progress showing and maybe_progress in remote case
+        progress = False
+        if n_iterations is None:
+            raise ValueError("n_iterations cannot be None for Monte Carlo Least Core")
+        return montecarlo_least_core(
+            u=u,
+            n_iterations=n_iterations,
+            n_jobs=n_jobs,
+            progress=progress,
+            options=kwargs,
+        )
+    elif mode == LeastCoreMode.Exact:
+        return exact_least_core(u=u, progress=progress, options=kwargs)
+
+    raise ValueError(f"Invalid value encountered in {mode=}")
@@ -1,26 +1,156 @@
+import itertools
 import logging
 import warnings
-from typing import Optional, Tuple
+from typing import List, NamedTuple, Optional, Sequence, Tuple
 
 import cvxpy as cp
 import numpy as np
 from numpy.typing import NDArray
 
+from pydvl.utils import MapReduceJob, ParallelConfig, Status, Utility
+from pydvl.value import ValuationResult
+
 __all__ = [
     "_solve_least_core_linear_program",
     "_solve_egalitarian_least_core_quadratic_program",
+    "lc_solve_problem",
+    "lc_solve_problems",
+    "LeastCoreProblem",
 ]
 
 logger = logging.getLogger(__name__)
 
+LeastCoreProblem = NamedTuple(
+    "LeastCoreProblem",
+    [("utility_values", NDArray[np.float_]), ("A_lb", NDArray[np.float_])],
+)
+
+
+def lc_solve_problem(
+    problem: LeastCoreProblem, *, u: Utility, algorithm: str, **options
+) -> ValuationResult:
+    """Solves a linear problem prepared by :func:`mclc_prepare_problem`.
+    Useful for parallel execution of multiple experiments by running this as a
+    remote task.
+
+    See :func:`~pydvl.value.least_core.naive.exact_least_core` or
+    :func:`~pydvl.value.least_core.montecarlo.montecarlo_least_core` for
+    argument descriptions.
+    """
+    if options is None:
+        options = {}
+    n = len(u.data)
+
+    if np.any(np.isnan(problem.utility_values)):
+        warnings.warn(
+            f"Calculation returned "
+            f"{np.sum(np.isnan(problem.utility_values))} NaN "
+            f"values out of {problem.utility_values.size}",
+            RuntimeWarning,
+        )
+
+    logger.debug("Removing possible duplicate values in lower bound array")
+    b_lb = problem.utility_values
+    A_lb, unique_indices = np.unique(problem.A_lb, return_index=True, axis=0)
+    b_lb = b_lb[unique_indices]
+
+    logger.debug("Building equality constraint")
+    A_eq = np.ones((1, n))
+    # We might have already computed the total utility. That's the index of the
+    # row in A_lb with all ones.
+    total_utility_index = np.where(A_lb.sum(axis=1) == n)[0]
+    if len(total_utility_index) == 0:
+        b_eq = np.array([u(u.data.indices)])
+    else:
+        b_eq = b_lb[total_utility_index]
+
+    _, subsidy = _solve_least_core_linear_program(
+        A_eq=A_eq, b_eq=b_eq, A_lb=A_lb, b_lb=b_lb, **options
+    )
+
+    values: Optional[NDArray[np.float_]]
+
+    if subsidy is None:
+        logger.debug("No values were found")
+        status = Status.Failed
+        values = np.empty(n)
+        values[:] = np.nan
+        subsidy = np.nan
+    else:
+        values = _solve_egalitarian_least_core_quadratic_program(
+            subsidy,
+            A_eq=A_eq,
+            b_eq=b_eq,
+            A_lb=A_lb,
+            b_lb=b_lb,
+            **options,
+        )
+
+        if values is None:
+            logger.debug("No values were found")
+            status = Status.Failed
+            values = np.empty(n)
+            values[:] = np.nan
+            subsidy = np.nan
+        else:
+            status = Status.Converged
+
+    return ValuationResult(
+        algorithm=algorithm,
+        status=status,
+        values=values,
+        subsidy=subsidy,
+        stderr=None,
+        data_names=u.data.data_names,
+    )
+
+
+def lc_solve_problems(
+    problems: Sequence[LeastCoreProblem],
+    u: Utility,
+    algorithm: str,
+    config: ParallelConfig = ParallelConfig(),
+    n_jobs: int = 1,
+    **options,
+) -> List[ValuationResult]:
+    """Solves a list of linear problems in parallel.
+
+    :param u: Utility.
+    :param problems: Least Core problems to solve, as returned by
+        :func:`~pydvl.value.least_core.montecarlo.mclc_prepare_problem`.
+    :param algorithm: Name of the valuation algorithm.
+    :param config: Object configuring parallel computation, with cluster
+        address, number of cpus, etc.
+    :param n_jobs: Number of parallel jobs to run.
+    :param options: Additional options to pass to the solver.
+    :return: List of solutions.
+    """
+
+    def _map_func(
+        problems: List[LeastCoreProblem], *args, **kwargs
+    ) -> List[ValuationResult]:
+        return [lc_solve_problem(p, *args, **kwargs) for p in problems]
+
+    map_reduce_job: MapReduceJob[
+        "LeastCoreProblem", "List[ValuationResult]"
+    ] = MapReduceJob(
+        inputs=problems,
+        map_func=_map_func,
+        map_kwargs=dict(u=u, algorithm=algorithm, **options),
+        reduce_func=lambda x: list(itertools.chain(*x)),
+        config=config,
+        n_jobs=n_jobs,
+    )
+    solutions = map_reduce_job()
+
+    return solutions
+
 
 def _solve_least_core_linear_program(
     A_eq: NDArray[np.float_],
     b_eq: NDArray[np.float_],
     A_lb: NDArray[np.float_],
     b_lb: NDArray[np.float_],
-    *,
-    epsilon: float = 0.0,
     **options,
 ) -> Tuple[Optional[NDArray[np.float_]], Optional[float]]:
     """Solves the Least Core's linear program using cvxopt.
@@ -46,7 +176,6 @@ def _solve_least_core_linear_program(
         coefficients of a linear inequality constraint on ``x``.
     :param b_lb: The inequality constraint vector. Each element represents a
         lower bound on the corresponding value of ``A_lb @ x``.
-    :param epsilon: Relaxation value by which the subset utility is decreased.
     :param options: Keyword arguments that will be used to select a solver
         and to configure it. For all possible options, refer to `cvxpy's documentation
         <https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options>`_
@@ -57,13 +186,12 @@ def _solve_least_core_linear_program(
 
     x = cp.Variable(n_variables)
     e = cp.Variable()
-    epsilon_parameter = cp.Parameter(name="epsilon", nonneg=True, value=epsilon)
 
     objective = cp.Minimize(e)
     constraints = [
         e >= 0,
         A_eq @ x == b_eq,
-        (A_lb @ x + e * np.ones(len(A_lb))) >= (b_lb - epsilon_parameter),
+        (A_lb @ x + e * np.ones(len(A_lb))) >= b_lb,
     ]
     problem = cp.Problem(objective, constraints)
 
@@ -110,7 +238,6 @@ def _solve_egalitarian_least_core_quadratic_program(
     b_eq: NDArray[np.float_],
     A_lb: NDArray[np.float_],
     b_lb: NDArray[np.float_],
-    epsilon: float = 0.0,
     **options,
 ) -> Optional[NDArray[np.float_]]:
     """Solves the egalitarian Least Core's quadratic program using cvxopt.
@@ -137,7 +264,6 @@ def _solve_egalitarian_least_core_quadratic_program(
         coefficients of a linear inequality constraint on ``x``.
     :param b_lb: The inequality constraint vector. Each element represents a
         lower bound on the corresponding value of ``A_lb @ x``.
-    :param epsilon: Relaxation value by which the subset utility is decreased.
     :param options: Keyword arguments that will be used to select a solver
         and to configure it. Refer to the following page for all possible options:
         https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options
@@ -150,12 +276,11 @@ def _solve_egalitarian_least_core_quadratic_program(
     n_variables = A_eq.shape[1]
 
     x = cp.Variable(n_variables)
-    epsilon_parameter = cp.Parameter(name="epsilon", nonneg=True, value=epsilon)
 
     objective = cp.Minimize(cp.norm2(x))
     constraints = [
         A_eq @ x == b_eq,
-        (A_lb @ x + subsidy * np.ones(len(A_lb))) >= (b_lb - epsilon_parameter),
+        (A_lb @ x + subsidy * np.ones(len(A_lb))) >= b_lb,
     ]
     problem = cp.Problem(objective, constraints)