Skip to content

Commit d40e8e2

Browse files
authored
Merge pull request #549 from aai-institute/feature/simplify-parallel-backend-config
Simplify parallel backend config
2 parents 03fd179 + 61f69c1 commit d40e8e2

File tree

30 files changed

+789
-336
lines changed

30 files changed

+789
-336
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@
3636
- Documentation improvements and cleanup
3737
[PR #521](https://github.com/aai-institute/pyDVL/pull/521),
3838
[PR #522](https://github.com/aai-institute/pyDVL/pull/522)
39+
- Simplified parallel backend configuration
40+
[PR #549](https://github.com/mkdocstrings/mkdocstrings/issues/615)
3941

4042
## 0.8.1 - 🆕 🏗 New method and notebook, Games with exact shapley values, bug fixes and cleanup
4143

docs/getting-started/advanced-usage.md

Lines changed: 83 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ keep in mind when using pyDVL namely Parallelization and Caching.
1616
pyDVL uses parallelization to scale and speed up computations. It does so
1717
using one of Dask, Ray or Joblib. The first is used in
1818
the [influence][pydvl.influence] package whereas the other two
19-
are used in the [value][pydvl.value] package.
19+
are used in the [value][pydvl.value] package.
2020

2121
### Data valuation
2222

@@ -37,6 +37,33 @@ and to provide a running cluster (or run ray in local mode).
3737
if the re-training only happens on a subset of the data. This means that you
3838
should make sure that each worker has enough memory to handle the whole dataset.
3939

40+
We use backend classes for both joblib and ray as well as two types
41+
of executors for the different algorithms: the first uses a map reduce pattern as seen in
42+
the [MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob] class
43+
and the second implements the futures executor interface from [concurrent.futures][].
44+
45+
As a convenience, you can also instantiate a parallel backend class
46+
by using the [init_parallel_backend][pydvl.parallel.init_parallel_backend]
47+
function:
48+
49+
```python
50+
from pydvl.parallel import init_parallel_backend
51+
parallel_backend = init_parallel_backend(backend_name="joblib")
52+
```
53+
54+
!!! info
55+
56+
The executor classes are not meant to be instantiated and used by users
57+
of pyDVL. They are used internally as part of the computations of the
58+
different methods.
59+
60+
!!! danger "Deprecation notice"
61+
62+
We are currently planning to deprecate
63+
[MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob] in favour of the
64+
futures executor interface because it allows for more diverse computation
65+
patterns with interruptions.
66+
4067
#### Joblib
4168

4269
Please follow the instructions in Joblib's documentation
@@ -48,19 +75,24 @@ to compute exact shapley values you would use:
4875

4976
```python
5077
import joblib
51-
from pydvl.parallel import ParallelConfig
78+
from pydvl.parallel import JoblibParallelBackend
5279
from pydvl.value.shapley import combinatorial_exact_shapley
5380
from pydvl.utils.utility import Utility
5481

55-
config = ParallelConfig(backend="joblib")
82+
parallel_backend = JoblibParallelBackend()
5683
u = Utility(...)
5784

5885
with joblib.parallel_config(backend="loky", verbose=100):
59-
combinatorial_exact_shapley(u, config=config)
86+
values = combinatorial_exact_shapley(u, parallel_backend=parallel_backend)
6087
```
6188

6289
#### Ray
6390

91+
!!! warning "Additional dependencies"
92+
93+
The Ray parallel backend requires optional dependencies.
94+
See [Extras][installation-extras] for more information.
95+
6496
Please follow the instructions in Ray's documentation to
6597
[set up a remote cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html).
6698
You could alternatively use a local cluster and in that case you don't have to set
@@ -90,14 +122,58 @@ To use the ray parallel backend to compute exact shapley values you would use:
90122

91123
```python
92124
import ray
93-
from pydvl.parallel import ParallelConfig
125+
from pydvl.parallel import RayParallelBackend
94126
from pydvl.value.shapley import combinatorial_exact_shapley
95127
from pydvl.utils.utility import Utility
96128

97129
ray.init()
98-
config = ParallelConfig(backend="ray")
130+
parallel_backend = RayParallelBackend()
99131
u = Utility(...)
100-
combinatorial_exact_shapley(u, config=config)
132+
vaues = combinatorial_exact_shapley(u, parallel_backend=parallel_backend)
133+
```
134+
135+
#### Futures executor
136+
137+
For the futures executor interface, we have implemented an executor
138+
class for ray in [RayExecutor][pydvl.parallel.futures.ray.RayExecutor]
139+
and rely on joblib's loky [get_reusable_executor][loky.get_reusable_executor]
140+
function to instantiate an executor for local parallelization.
141+
142+
They are both compatibles with the builtin
143+
[ThreadPoolExecutor][concurrent.futures.ThreadPoolExecutor]
144+
and [ProcessPoolExecutor][concurrent.futures.ProcessPoolExecutor]
145+
classes.
146+
147+
```pycon
148+
>>> from joblib.externals.loky import _ReusablePoolExecutor
149+
>>> from pydvl.parallel import JoblibParallelBackend
150+
>>> parallel_backend = JoblibParallelBackend()
151+
>>> with parallel_backend.executor() as executor:
152+
... results = list(executor.map(lambda x: x + 1, range(3)))
153+
...
154+
>>> results
155+
[1, 2, 3]
156+
```
157+
158+
#### Map-reduce
159+
160+
The map-reduce interface is older and more limited in the patterns
161+
it allows us to use.
162+
163+
To reproduce the previous example using
164+
[MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob], we would use:
165+
166+
```pycon
167+
>>> from pydvl.parallel import JoblibParallelBackend, MapReduceJob
168+
>>> parallel_backend = JoblibParallelBackend()
169+
>>> map_reduce_job = MapReduceJob(
170+
... list(range(3)),
171+
... map_func=lambda x: x[0] + 1,
172+
... parallel_backend=parallel_backend,
173+
... )
174+
>>> results = map_reduce_job()
175+
>>> results
176+
[1, 2, 3]
101177
```
102178

103179
### Influence functions

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ plugins:
108108
- https://pytorch.org/docs/stable/objects.inv
109109
- https://pymemcache.readthedocs.io/en/latest/objects.inv
110110
- https://joblib.readthedocs.io/en/stable/objects.inv
111+
- https://loky.readthedocs.io/en/stable/objects.inv
111112
- https://docs.dask.org/en/latest/objects.inv
112113
- https://distributed.dask.org/en/latest/objects.inv
113114
- https://docs.ray.io/en/latest/objects.inv

src/pydvl/parallel/__init__.py

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,48 @@
11
"""
22
This module provides a common interface to parallelization backends. The list of
3-
supported backends is [here][pydvl.parallel.backends]. Backends can be
4-
selected with the `backend` argument of an instance of
5-
[ParallelConfig][pydvl.utils.config.ParallelConfig], as seen in the examples
6-
below.
3+
supported backends is [here][pydvl.parallel.backends]. Backends should be
4+
instantiated directly and passed to the respective valuation method.
75
8-
We use [executors][concurrent.futures.Executor] to submit tasks in parallel. The
9-
basic high-level pattern is
6+
We use executors that implement the [Executor][concurrent.futures.Executor]
7+
interface to submit tasks in parallel.
8+
The basic high-level pattern is:
109
1110
```python
12-
from pydvl.parallel import init_executor, ParallelConfig
11+
from pydvl.parallel import JoblibParallelBackend
1312
14-
config = ParallelConfig(backend="ray")
15-
with init_executor(max_workers=1, config=config) as executor:
13+
parallel_backend = JoblibParallelBackend()
14+
with parallel_backend.executor(max_workers=2) as executor:
1615
future = executor.submit(lambda x: x + 1, 1)
1716
result = future.result()
1817
assert result == 2
1918
```
2019
21-
Running a map-reduce job is also easy:
20+
Running a map-style job is also easy:
2221
2322
```python
24-
from pydvl.parallel import init_executor, ParallelConfig
23+
from pydvl.parallel import JoblibParallelBackend
2524
26-
config = ParallelConfig(backend="joblib")
27-
with init_executor(config=config) as executor:
25+
parallel_backend = JoblibParallelBackend()
26+
with parallel_backend.executor(max_workers=2) as executor:
2827
results = list(executor.map(lambda x: x + 1, range(5)))
2928
assert results == [1, 2, 3, 4, 5]
3029
```
31-
30+
!!! tip "Passsing large objects"
31+
When running tasks which accept heavy inputs, it is important
32+
to first use `put()` on the object and use the returned reference
33+
as argument to the callable within `submit()`. For example:
34+
```python
35+
u_ref = parallel_backend.put(u)
36+
...
37+
executor.submit(task, utility=u)
38+
```
39+
Note that `task()` does not need to be changed in any way:
40+
the backend will `get()` the object and pass it to the function
41+
upon invocation.
3242
There is an alternative map-reduce implementation
3343
[MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob] which internally
34-
uses joblib's higher level API with `Parallel()`
44+
uses joblib's higher level API with `Parallel()` which then indirectly also
45+
supports the use of Dask and Ray.
3546
"""
3647
# HACK to avoid circular imports
3748
from ..utils.types import * # pylint: disable=wrong-import-order
@@ -41,5 +52,5 @@
4152
from .futures import *
4253
from .map_reduce import *
4354

44-
if len(BaseParallelBackend.BACKENDS) == 0:
55+
if len(ParallelBackend.BACKENDS) == 0:
4556
raise ImportError("No parallel backend found. Please install ray or joblib.")

0 commit comments

Comments
 (0)