Skip to content

Commit 8e0d851

Browse files
chahankChahan Kropfemanuel-schmidpeanutfun
authored
Update multiprocessing in unsequa (#763)
* Update pandas iteritems to items * Change pd append to concat * Fix minimal chunksize at 1 * Remove not needed parallel pool chunksize argument * Remove global matplotlib styles * Remove deprecated tot_value from unsequa * Add chunked version of parallel computing * Remove deprecated numpy vstack of objects * Feature/order samples unsequa (#766) * Allow to set loglevel * Add method to sort samples * Add advanced examples for unsequa * Remove logging control * Remove unecessary output prints * Update CHANGELOG.md --------- Co-authored-by: Chahan Kropf <[email protected]> Co-authored-by: emanuel-schmid <[email protected]> Co-authored-by: Lukas Riedel <[email protected]>
1 parent ef410e1 commit 8e0d851

File tree

8 files changed

+2152
-1678
lines changed

8 files changed

+2152
-1678
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Removed:
3030
- Added method `Exposures.centroids_total_value` to replace the functionality of `Exposures.affected_total_value`. This method is temporary and deprecated. [#702](https://github.com/CLIMADA-project/climada_python/pull/702)
3131
- New method `climada.util.api_client.Client.purge_cache`: utility function to remove outdated files from the local file system to free disk space.
3232
([#737](https://github.com/CLIMADA-project/climada_python/pull/737))
33+
- Added advanced examples in unsequa tutorial for coupled input variables and for handling efficiently the loading of multiple large files [#766](https://github.com/CLIMADA-project/climada_python/pull/766)
3334

3435
### Changed
3536

@@ -57,6 +58,7 @@ Removed:
5758
- `list_dataset_infos` from `climada.util.api_client.Client`: the `properties` argument, a `dict`, can now have `None` as values. Before, only strings and lists of strings were allowed. Setting a particular property to `None` triggers a search for datasets where this property is not assigned. [#752](https://github.com/CLIMADA-project/climada_python/pull/752)
5859
- Reduce memory requirements of `TropCyclone.from_tracks` [#749](https://github.com/CLIMADA-project/climada_python/pull/749)
5960
- Support for different wind speed and pressure units in `TCTracks` when running `TropCyclone.from_tracks` [#749](https://github.com/CLIMADA-project/climada_python/pull/749)
61+
- Changed the parallel package from Pathos to Multiproess in the unsequa module [#763](https://github.com/CLIMADA-project/climada_python/pull/763)
6062

6163
### Fixed
6264

@@ -65,6 +67,9 @@ Removed:
6567
- Correctly handle assertion errors in `Centroids.values_from_vector_files` and fix the associated test [#768](https://github.com/CLIMADA-project/climada_python/pull/768/)
6668
- Text in `Forecast` class plots can now be adjusted [#769](https://github.com/CLIMADA-project/climada_python/issues/769)
6769
- `Impact.impact_at_reg` now supports impact matrices where all entries are zero [#773](https://github.com/CLIMADA-project/climada_python/pull/773)
70+
- upgrade pathos 0.3.0 -> 0.3.1 issue [#761](https://github.com/CLIMADA-project/climada_python/issues/761) (for unsequa module [#763](https://github.com/CLIMADA-project/climada_python/pull/763))
71+
- Fix bugs with pandas 2.0 (iteritems -> items, append -> concat) (fix issue [#700](https://github.com/CLIMADA-project/climada_python/issues/700) for unsequa module) [#763](https://github.com/CLIMADA-project/climada_python/pull/763))
72+
- Remove matplotlib styles in unsequa module (fixes issue [#758](https://github.com/CLIMADA-project/climada_python/issues/758)) [#763](https://github.com/CLIMADA-project/climada_python/pull/763)
6873

6974
### Deprecated
7075

@@ -77,6 +82,7 @@ Removed:
7782
- `Centroids.set_raster_from_pix_bounds` [#721](https://github.com/CLIMADA-project/climada_python/pull/721)
7883
- `requirements/env_developer.yml` environment specs. Use 'extra' requirements when installing the Python package instead [#712](https://github.com/CLIMADA-project/climada_python/pull/712)
7984
- `Impact.tag` attribute. This change is not backwards-compatible with respect to the files written and read by the `Impact` class [#743](https://github.com/CLIMADA-project/climada_python/pull/743)
85+
- `impact.tot_value ` attribute removed from unsequa module [#763](https://github.com/CLIMADA-project/climada_python/pull/763)
8086

8187
## v3.3.2
8288

climada/engine/unsequa/calc_base.py

Lines changed: 166 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121

2222
import logging
2323
import copy
24+
import itertools
2425

2526
import datetime as dt
2627

@@ -47,6 +48,31 @@ class Calc():
4748
Names of the required uncertainty variables.
4849
_metric_names : tuple(str)
4950
Names of the output metrics.
51+
52+
Notes
53+
-----
54+
Parallelization logics: for computation of the uncertainty users may
55+
specify a number N of processes on which to perform the computations in
56+
parallel. Since the computation for each individual sample of the
57+
input parameters is independent of one another, we implemented a simple
58+
distribution on the processes.
59+
60+
1. The samples are divided in N equal sub-sample chunks
61+
2. Each chunk of samples is sent as one to a node for processing
62+
63+
Hence, this is equivalent to the user running the computation N times,
64+
once for each sub-sample.
65+
Note that for each process, all the input variables must be copied once,
66+
and hence each parallel process requires roughly the same amount of memory
67+
as if a single process would be used.
68+
69+
This approach differs from the usual parallelization strategy (where individual
70+
samples are distributed), because each sample requires the entire input data.
71+
With this method, copying data between processes is reduced to a minimum.
72+
73+
Parallelization is currently not available for the sensitivity computation,
74+
as this requires all samples simoultenaously in the current implementation
75+
of the SaLib library.
5076
"""
5177

5278
_input_var_names = ()
@@ -126,7 +152,7 @@ def distr_dict(self):
126152
distr_dict.update(input_var.distr_dict)
127153
return distr_dict
128154

129-
def est_comp_time(self, n_samples, time_one_run, pool=None):
155+
def est_comp_time(self, n_samples, time_one_run, processes=None):
130156
"""
131157
Estimate the computation time
132158
@@ -154,8 +180,7 @@ def est_comp_time(self, n_samples, time_one_run, pool=None):
154180
"\n If computation cannot be reduced, consider using"
155181
" a surrogate model https://www.uqlab.com/", time_one_run)
156182

157-
ncpus = pool.ncpus if pool else 1
158-
total_time = n_samples * time_one_run / ncpus
183+
total_time = n_samples * time_one_run / processes
159184
LOGGER.info("\n\nEstimated computaion time: %s\n",
160185
dt.timedelta(seconds=total_time))
161186

@@ -354,11 +379,118 @@ def sensitivity(self, unc_output, sensitivity_method = 'sobol',
354379

355380
return sens_output
356381

382+
def _multiprocess_chunksize(samples_df, processes):
383+
"""Divides the samples into chunks for multiprocesses computing
384+
385+
The goal is to send to each processing node an equal number
386+
of samples to process. This make the parallel processing anologous
387+
to running the uncertainty assessment independently on each nodes
388+
for a subset of the samples, instead of distributing individual samples
389+
on the nodes dynamically. Hence, all the heavy input variables
390+
are copied/sent once to each node only.
391+
392+
Parameters
393+
----------
394+
samples_df : pd.DataFrame
395+
samples dataframe
396+
processes : int
397+
number of processes
398+
399+
Returns
400+
-------
401+
int
402+
the number of samples in each chunk
403+
"""
404+
return np.ceil(
405+
samples_df.shape[0] / processes
406+
).astype(int)
407+
408+
def _transpose_chunked_data(metrics):
409+
"""Transposes the output metrics lists from one list per
410+
chunk of samples to one list per output metric
411+
412+
[ [x1, [y1, z1]], [x2, [y2, z2]] ] ->
413+
[ [x1, x2], [[y1, z1], [y2, z2]] ]
414+
415+
Parameters
416+
----------
417+
metrics : list
418+
list of list as returned by the uncertainty mapings
419+
420+
Returns
421+
-------
422+
list
423+
list of climada output uncertainty
424+
425+
See Also
426+
--------
427+
calc_impact._map_impact_calc
428+
map for impact uncertainty
429+
calc_cost_benefits._map_costben_calc
430+
map for cost benefit uncertainty
431+
"""
432+
return [
433+
list(itertools.chain.from_iterable(x))
434+
for x in zip(*metrics)
435+
]
436+
437+
def _sample_parallel_iterator(samples, chunksize, **kwargs):
438+
"""
439+
Make iterator over chunks of samples
440+
with repeated kwargs for each chunk.
441+
442+
Parameters
443+
----------
444+
samples : pd.DataFrame
445+
Dataframe of samples
446+
**kwargs : arguments to repeat
447+
Arguments to repeat for parallel computations
448+
449+
Returns
450+
-------
451+
iterator
452+
suitable for methods _map_impact_calc and _map_costben_calc
453+
454+
"""
455+
def _chunker(df, size):
456+
"""
457+
Divide the dataframe into chunks of size number of lines
458+
"""
459+
for pos in range(0, len(df), size):
460+
yield df.iloc[pos:pos + size]
461+
462+
return zip(
463+
_chunker(samples, chunksize),
464+
*(itertools.repeat(item) for item in kwargs.values())
465+
)
466+
357467

358468
def _calc_sens_df(method, problem_sa, sensitivity_kwargs, param_labels, X, unc_df):
469+
"""Compute the sensitifity indices
470+
471+
Parameters
472+
----------
473+
method : str
474+
SALib sensitivity method name
475+
problem_sa :dict
476+
dictionnary for sensitivty method for SALib
477+
sensitivity_kwargs : kwargs
478+
passed on to SALib.method.analyse
479+
param_labels : list(str)
480+
list of name of uncertainty input parameters
481+
X : numpy.ndarray
482+
array of input parameter samples
483+
unc_df : DataFrame
484+
Dataframe containing the uncertainty values
485+
486+
Returns
487+
-------
488+
DataFrame
489+
Values of the sensitivity indices
490+
"""
359491
sens_first_order_dict = {}
360492
sens_second_order_dict = {}
361-
for (submetric_name, metric_unc) in unc_df.iteritems():
493+
for (submetric_name, metric_unc) in unc_df.items():
362494
Y = metric_unc.to_numpy()
363495
if X is not None:
364496
sens_indices = method.analyze(problem_sa, X, Y,
@@ -404,6 +536,21 @@ def _calc_sens_df(method, problem_sa, sensitivity_kwargs, param_labels, X, unc_d
404536

405537

406538
def _si_param_first(param_labels, sens_indices):
539+
"""Extract the first order sensivity indices from SALib ouput
540+
541+
Parameters
542+
----------
543+
param_labels : list(str)
544+
name of the unceratinty input parameters
545+
sens_indices : dict
546+
sensitivity indidices dictionnary as produced by SALib
547+
548+
Returns
549+
-------
550+
si_names_first_order, param_names_first_order: list, list
551+
Names of the sensivity indices of first order for all input parameters
552+
and Parameter names for each sentivity index
553+
"""
407554
n_params = len(param_labels)
408555

409556
si_name_first_order_list = [
@@ -421,6 +568,21 @@ def _si_param_first(param_labels, sens_indices):
421568

422569

423570
def _si_param_second(param_labels, sens_indices):
571+
"""Extract second order sensitivity indices
572+
573+
Parameters
574+
----------
575+
param_labels : list(str)
576+
name of the unceratinty input parameters
577+
sens_indices : dict
578+
sensitivity indidices dictionnary as produced by SALib
579+
580+
Returns
581+
-------
582+
si_names_second_order, param_names_second_order, param_names_second_order_2: list, list, list
583+
Names of the sensivity indices of second order for all input parameters
584+
and Pairs of parameter names for each 2nd order sentivity index
585+
"""
424586
n_params = len(param_labels)
425587
si_name_second_order_list = [
426588
key

0 commit comments

Comments
 (0)