Skip to content

Commit 593ebf0

Browse files
peanutfunemanuel-schmidtovogttimschmi95Schmid  Timo
authored
Add classes for calibrating average ensemble and ensemble of tragedies (#1048)
* Initial draft for calibration from scipy.optimize * Draft for impact function calibration * Add first unit tests of calibration module * ci: Add bayesian-optimization during Jenkins build * Add __init__.py for util/calibarte/test module * Add climada.util.calibrate.test module to test discovery * Add unit and integration tests, update code base * Start documenting new calibrate module * Actually add the intregration test * Add some documentation * commit PLEASE CLEAN UP * Add more docstrings and simplify imports through __init__ * Add separate Output classes for each optimizer * Restructure calibration module * Add tutorial on impact function calibration * Update tutorial * Remove hazard event selection from calibrate.Input * Update calibration tutorial * Update climada/util/calibrate/bayesian_optimizer.py Use negative cost function as target function in BayesianOptimizer Co-authored-by: Thomas Vogt <[email protected]> * Separate computing cost from transforming impact objects * Add evaluator for calibration output * Add TestBayesianOptimizer test to test loader * Update code, docs, and tutorial * Update tutorial * Add option to adjust data frame alignment * add seaborn * Add function to plot Impf variability of calibration (#791) Co-authored-by: Schmid Timo <[email protected]> Co-authored-by: Lukas Riedel <[email protected]> * Improve alignment and handling of NaNs * Add seaborn to dependencies * Split tests into multiple files, finish up * Move impact transform and align to Input * Use MultiIndex in parameter space dataframe * Update tutorial * Fix requirements for calibration module Seaborn and bayes_opt * Remove plot_impf_set function and improve exception type * Add tests for OutputEvaluator * Fix name of bayes_opt package on PyPI * Make sure latest seaborn is installed on Jenkins * Remove unused function definition * Fix linter issues and remove unused code * Add BayesianOptimizerOutputEvaluator * Fix typo in tutorial * Add GNU license header to new files * Update CHANGELOG.md * edit authors.md * Fix a bug in parameter space plot * Add ensemble calibrators * Fix issue with constraints and allow_duplicate_points * Fix bug in to_input_var * Increment random seed and update output handling * Make EnsembleOptimizer run in parallel * Add docstring to EnsembleOptimizer.run * Add suggestions from code review * Rename 'true' parameter to 'data' in 'target_func' method. * Remove setting random_state in run function during tests. * Add explicit haz_type to tests to avoid warnings. * Update cross-calibration module * Add option to read and write EnsembleOptimizerOutput from/to HDF5. * Add plotting functions for EnsembleOptimizerOutput. * Streamline EnsembleOptimizer with generator methods. * Add unit tests for cross calibration Many tests still missing. * Avoid errors when storing event names in cross calibration output * Add iteration controller for BayesianOptimizer * Fix calibrate module init and add first controller tests * Deepcopy optimizer run kwargs and add hazard intensity histogram to shiny plot * Fix handling of instance maximum for constrained optimization * Add tests for BayesianOptimizerController and fix verbosity * Add test for plotting parameter space * Update integration tests for BayesianOptimizer Include controller. * Add explanation of BayesianOptimizerController to tutorial * add preliminary plot_category function * Add option to reduce tragedy ensemble size * Add option to store and load calibration results * Update plots and tutorial * Fix annotations, docstrings, some linter issues * Add tests for cross_calibrate.py * Add test execution to test_cross_calibrate.py * Update plots, tests, docstrings, docs * Rename cross_calibrate.py to ensemble.py * Improve plots, add 'replace' parameter to AverageEnsembleOptimizer * Add ensemble calibration to tutorial * Remove custom changes to Jenkins script * Beautify plot * Update calibation tutorial * Use absolute imports for referring out of this module * Update docs and links in tutorial * Add `data_weights` to calibration Input * Use weights for sampling with replacement in AverageEnsembleOptimizer. * Update tests * Add equality comparison to EnsembleOptimizerOutput * Update CHANGELOG.md * Update CHANGELOG.md * Comply to abstractmethod interface in BayesianOptimizer * Fix linter issue where builtin 'input' was shadowed * Suggest that cost functions take ndarrays * Update tutorial * Make cost functions consume numpy arrays * Update docs * Update tutorial * remove useless tqdm artefacts * Update climada/util/calibrate/ensemble.py Co-authored-by: Emanuel Schmid <[email protected]> --------- Co-authored-by: emanuel-schmid <[email protected]> Co-authored-by: Thomas Vogt <[email protected]> Co-authored-by: Timo Schmid <[email protected]> Co-authored-by: Schmid Timo <[email protected]> Co-authored-by: Emanuel Schmid <[email protected]>
1 parent a9f9e6a commit 593ebf0

File tree

13 files changed

+2710
-453
lines changed

13 files changed

+2710
-453
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Removed:
3030
- Added instructions to install Climada petals on Euler cluster in `doc.guide.Guide_Euler.ipynb` [#1029](https://github.com/CLIMADA-project/climada_python/pull/1029)
3131
- Added util methods to handle crs coordinates consistently: `is_geo_coords`, `check_if_geo_coords`, `get_crs_unit`, `estimate_matching_threshold`, `degree_to_km`, and `km_to_degree` [#1080](https://github.com/CLIMADA-project/climada_python/pull/1080)
3232
- `ImpactFunc` and `ImpactFuncSet` now support equality comparisons via `==` [#1027](https://github.com/CLIMADA-project/climada_python/pull/1027)
33+
- Calibration of impact function ensembles in `climada.util.calibrate` [#1048](https://github.com/CLIMADA-project/climada_python/pull/1048)
3334
- Added optional `attrs` parameter to `Exposures.from_raster` method to set additional object properties through the method's `Exposures.__init__` call.
3435

3536
### Changed
@@ -45,6 +46,7 @@ geographic coordinates as input (e.g. `util.coordinates.dist_to_coast`, `util.co
4546
- World Bank indicator data is now downloaded directly from their API via the function `download_world_bank_indicator`, instead of relying on the `pandas-datareader` package [#1033](https://github.com/CLIMADA-project/climada_python/pull/1033)
4647
- `Exposures.write_hdf5` pickles geometry data in WKB format, which is faster and more sustainable. [#1051](https://github.com/CLIMADA-project/climada_python/pull/1051)
4748
- The online documentation has been completely overhauled, now uses PyData theme: [#977](https://github.com/CLIMADA-project/climada_python/pull/977)
49+
- `Input` to impact function calibration tasks now supports adding weights to the data [#1048](https://github.com/CLIMADA-project/climada_python/pull/1048)
4850
- Add `climada.hazard.xarray` module with helper structures for reading Hazard objects from `xarray` data [#1063](https://github.com/CLIMADA-project/climada_python/pull/1063)
4951
- The output of the `impact_yearset` was changed to only contain attributes corresponding to the yearly impact set. The application of the correction factor and the frequency of the resulting yearly impact object are corrected. [#1075](https://github.com/CLIMADA-project/climada_python/pull/1075)
5052
- `util.coordinates.get_resolution` always returns positive values, regardless of how the input coordinates' order [#1080](https://github.com/CLIMADA-project/climada_python/pull/1080).

climada/test/test_util_calibrate.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ def test_single(self):
176176
init_points=10, n_iter=20, max_iterations=1
177177
)
178178
optimizer = BayesianOptimizer(self.input, random_state=1)
179-
output = optimizer.run(controller)
179+
output = optimizer.run(controller=controller)
180180

181181
# Check result (low accuracy)
182182
self.assertAlmostEqual(output.params["slope"], 1.0, places=2)
@@ -210,7 +210,7 @@ def test_multiple_constrained(self):
210210
controller = BayesianOptimizerController.from_input(
211211
self.input, sampling_base=5, max_iterations=3
212212
)
213-
output = optimizer.run(controller)
213+
output = optimizer.run(controller=controller)
214214

215215
# Check results (low accuracy)
216216
self.assertEqual(output.p_space.dim, 2)
@@ -246,7 +246,7 @@ def test_plots(self):
246246
controller = BayesianOptimizerController.from_input(
247247
self.input, max_iterations=1
248248
)
249-
output = optimizer.run(controller)
249+
output = optimizer.run(controller=controller)
250250

251251
output_eval = OutputEvaluator(self.input, output)
252252
output_eval.impf_set.plot()

climada/util/calibrate/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,10 @@
2626
BayesianOptimizerOutputEvaluator,
2727
select_best,
2828
)
29+
from .cost_func import mse, msle
30+
from .ensemble import (
31+
AverageEnsembleOptimizer,
32+
EnsembleOptimizerOutput,
33+
TragedyEnsembleOptimizer,
34+
)
2935
from .scipy_optimizer import ScipyMinimizeOptimizer

climada/util/calibrate/base.py

Lines changed: 64 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,9 @@ class Input:
4747
Hazard object to compute impacts from
4848
exposure : climada.Exposures
4949
Exposures object to compute impacts from
50-
data : pandas.Dataframe
50+
data : pandas.DataFrame
5151
The data to compare computed impacts to. Index: Event IDs matching the IDs of
52-
``hazard``. Columns: Arbitrary columns. NaN values in the data frame have
52+
:py:attr:`hazard`. Columns: Arbitrary columns. NaN values in the data frame have
5353
special meaning: Corresponding impact values computed by the model are ignored
5454
in the calibration.
5555
impact_func_creator : Callable
@@ -64,8 +64,11 @@ class Input:
6464
cost_func : Callable
6565
Function that takes two ``pandas.Dataframe`` objects and returns the scalar
6666
"cost" between them. The optimization algorithm will try to minimize this
67-
number. The first argument is the true/correct values (:py:attr:`data`), and the
68-
second argument is the estimated/predicted values.
67+
number. The first argument is the true/correct values (:py:attr:`data`), the
68+
second argument is the estimated/predicted values, and the third argument is the
69+
:py:attr:`data_weights`. The cost function is intended to operate on
70+
``numpy.ndarray`` objects.
71+
Dataframes are transformed using :py:attr:`df_to_numpy`.
6972
bounds : Mapping (str, {Bounds, tuple(float, float)}), optional
7073
The bounds for the parameters. Keys: parameter names. Values:
7174
``scipy.minimize.Bounds`` instance or tuple of minimum and maximum value.
@@ -85,6 +88,16 @@ class Input:
8588
:py:attr:`data`, insert this value. Defaults to NaN, in which case the impact
8689
from the model is ignored. Set this to zero to explicitly calibrate to zero
8790
impacts in these cases.
91+
df_to_numpy : Callable, optional
92+
A function that transforms a pandas.DataFrame into a numpy.ndarray to be
93+
inserted into the :py:attr:`cost_func`. By default, this will flatten the data
94+
frame.
95+
data_weights : pandas.DataFrame, optional
96+
Weights for each entry in :py:attr:`data`. Must have the exact same index and
97+
columns. If ``None``, the weights will be ignored (equivalent to the same weight
98+
for each event).
99+
missing_weights_value : float, optional
100+
Same as :py:attr:`missing_data_value`, but for :py:attr:`data_weights`.
88101
assign_centroids : bool, optional
89102
If ``True`` (default), assign the hazard centroids to the exposure when this
90103
object is created.
@@ -95,14 +108,19 @@ class Input:
95108
data: pd.DataFrame
96109
impact_func_creator: Callable[..., ImpactFuncSet]
97110
impact_to_dataframe: Callable[[Impact], pd.DataFrame]
98-
cost_func: Callable[[pd.DataFrame, pd.DataFrame], Number]
111+
cost_func: Callable[[np.ndarray, np.ndarray, np.ndarray | None], Number]
99112
bounds: Optional[Mapping[str, Union[Bounds, Tuple[Number, Number]]]] = None
100113
constraints: Optional[Union[ConstraintType, list[ConstraintType]]] = None
101114
impact_calc_kwds: Mapping[str, Any] = field(
102115
default_factory=lambda: {"assign_centroids": False}
103116
)
104117
missing_data_value: float = np.nan
105-
assign_centroids: InitVar[bool] = True
118+
df_to_numpy: Callable[[pd.DataFrame], np.ndarray] = (
119+
lambda df: df.to_numpy().flatten()
120+
)
121+
data_weights: pd.DataFrame | None = field(default=None, kw_only=True)
122+
missing_weights_value: float = field(default=0.0, kw_only=True)
123+
assign_centroids: InitVar[bool] = field(default=True, kw_only=True)
106124

107125
def __post_init__(self, assign_centroids):
108126
"""Prepare input data"""
@@ -115,6 +133,17 @@ def __post_init__(self, assign_centroids):
115133
)
116134
raise TypeError("'data' must be a pandas.DataFrame")
117135

136+
if self.data_weights is not None:
137+
try:
138+
pd.testing.assert_index_equal(self.data.index, self.data_weights.index)
139+
pd.testing.assert_index_equal(
140+
self.data.columns, self.data_weights.columns
141+
)
142+
except AssertionError as err:
143+
raise ValueError(
144+
"'data_weights' must have exact same index and columns as 'data'"
145+
) from err
146+
118147
if assign_centroids:
119148
self.exposure.assign_centroids(self.hazard)
120149

@@ -413,26 +442,30 @@ class Optimizer(ABC):
413442

414443
input: Input
415444

416-
def _target_func(self, data: pd.DataFrame, predicted: pd.DataFrame) -> Number:
445+
def _target_func(
446+
self, data: np.ndarray, predicted: np.ndarray, weights: np.ndarray | None
447+
) -> Number:
417448
"""Target function for the optimizer
418449
419450
The default version of this function simply returns the value of the cost
420451
function evaluated on the arguments.
421452
422453
Parameters
423454
----------
424-
data : pandas.DataFrame
455+
data : nd.ndarray
425456
The reference data used for calibration. By default, this is
426457
:py:attr:`Input.data`.
427-
predicted : pandas.DataFrame
458+
predicted : nd.ndarray
428459
The impact predicted by the data calibration after it has been transformed
429460
into a dataframe by :py:attr:`Input.impact_to_dataframe`.
461+
weights : nd.ndarray
462+
The relative weight for each data/entry pair.
430463
431464
Returns
432465
-------
433466
The value of the target function for the optimizer.
434467
"""
435-
return self.input.cost_func(data, predicted)
468+
return self.input.cost_func(data, predicted, weights)
436469

437470
def _kwargs_to_impact_func_creator(self, *_, **kwargs) -> Dict[str, Any]:
438471
"""Define how the parameters to :py:meth:`_opt_func` must be transformed
@@ -484,11 +517,29 @@ def _opt_func(self, *args, **kwargs) -> Number:
484517
hazard=self.input.hazard,
485518
).impact(**self.input.impact_calc_kwds)
486519

487-
# Transform to DataFrame, align, and compute target function
520+
# Transform to DataFrame and align
488521
data_aligned, impact_df_aligned = self.input.impact_to_aligned_df(
489-
impact, fillna=0
522+
impact, fillna=0.0
523+
)
524+
525+
# Align weights
526+
weights_aligned = None
527+
if self.input.data_weights is not None:
528+
weights_aligned, _ = self.input.data_weights.align(
529+
data_aligned,
530+
axis=None,
531+
join="right",
532+
copy=True,
533+
fill_value=self.input.missing_weights_value,
534+
)
535+
weights_aligned = self.input.df_to_numpy(weights_aligned)
536+
537+
# Compute target function
538+
return self._target_func(
539+
self.input.df_to_numpy(data_aligned),
540+
self.input.df_to_numpy(impact_df_aligned),
541+
weights_aligned,
490542
)
491-
return self._target_func(data_aligned, impact_df_aligned)
492543

493544
@abstractmethod
494545
def run(self, **opt_kwargs) -> Output:

climada/util/calibrate/bayesian_optimizer.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -616,11 +616,13 @@ def __post_init__(self, random_state, allow_duplicate_points, bayes_opt_kwds):
616616
**bayes_opt_kwds,
617617
)
618618

619-
def _target_func(self, data: pd.DataFrame, predicted: pd.DataFrame) -> Number:
619+
def _target_func(
620+
self, data: np.ndarray, predicted: np.ndarray, weights: np.ndarray | None
621+
) -> Number:
620622
"""Invert the cost function because BayesianOptimization maximizes the target"""
621-
return -self.input.cost_func(data, predicted)
623+
return -self.input.cost_func(data, predicted, weights)
622624

623-
def run(self, controller: BayesianOptimizerController) -> BayesianOptimizerOutput:
625+
def run(self, **opt_kwargs) -> BayesianOptimizerOutput:
624626
"""Execute the optimization
625627
626628
``BayesianOptimization`` *maximizes* a target function. Therefore, this class
@@ -631,15 +633,25 @@ def run(self, controller: BayesianOptimizerController) -> BayesianOptimizerOutpu
631633
----------
632634
controller : BayesianOptimizerController
633635
The controller instance used to set the optimization iteration parameters.
634-
opt_kwargs
635-
Further keyword arguments passed to ``BayesianOptimization.maximize``.
636+
kwargs
637+
Further keyword arguments passed to ``BayesianOptimization.maximize``. Note
638+
that some arguments are also provided by
639+
:py:meth:`BayesianOptimizerController.optimizer_params`.
636640
637641
Returns
638642
-------
639643
output : BayesianOptimizerOutput
640644
Optimization output. :py:attr:`BayesianOptimizerOutput.p_space` stores data
641645
on the sampled parameter space.
642646
"""
647+
# Take the controller
648+
try:
649+
controller = opt_kwargs.pop("controller")
650+
except KeyError as err:
651+
raise RuntimeError(
652+
"BayesianOptimizer.run requires 'controller' as keyword argument"
653+
) from err
654+
643655
# Register the controller
644656
for event in (Events.OPTIMIZATION_STEP, Events.OPTIMIZATION_END):
645657
self.optimizer.subscribe(event, controller)
@@ -660,7 +672,7 @@ def run(self, controller: BayesianOptimizerController) -> BayesianOptimizerOutpu
660672
while controller.iterations < controller.max_iterations:
661673
try:
662674
LOGGER.info(f"Optimization iteration: {controller.iterations}")
663-
self.optimizer.maximize(**controller.optimizer_params())
675+
self.optimizer.maximize(**controller.optimizer_params(), **opt_kwargs)
664676
except StopEarly:
665677
# Start a new iteration
666678
continue
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
"""
2+
This file is part of CLIMADA.
3+
4+
Copyright (C) 2017 ETH Zurich, CLIMADA contributors listed in AUTHORS.
5+
6+
CLIMADA is free software: you can redistribute it and/or modify it under the
7+
terms of the GNU General Public License as published by the Free
8+
Software Foundation, version 3.
9+
10+
CLIMADA is distributed in the hope that it will be useful, but WITHOUT ANY
11+
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
12+
PARTICULAR PURPOSE. See the GNU General Public License for more details.
13+
14+
You should have received a copy of the GNU General Public License along
15+
with CLIMADA. If not, see <https://www.gnu.org/licenses/>.
16+
17+
---
18+
Cost functions for impact function calibration module
19+
"""
20+
21+
import numpy as np
22+
from sklearn.metrics import mean_squared_error, mean_squared_log_error
23+
24+
25+
def mse(data: np.ndarray, predicted: np.ndarray, weights: np.ndarray | None) -> float:
26+
"""Weighted mean squared error
27+
28+
See
29+
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
30+
"""
31+
return mean_squared_error(data, predicted, sample_weight=weights)
32+
33+
34+
def msle(data: np.ndarray, predicted: np.ndarray, weights: np.ndarray | None) -> float:
35+
"""Weighted mean squared logarithmic error
36+
37+
See
38+
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html
39+
"""
40+
return mean_squared_log_error(data, predicted, sample_weight=weights)

0 commit comments

Comments
 (0)