Skip to content

Commit 9d5fe33

Browse files
authored
Merge pull request #48 from ESMValGroup/episode05_preprocessor
Episode05 Preprocessors
2 parents c815aa0 + 3717977 commit 9d5fe33

File tree

3 files changed

+347
-15
lines changed

3 files changed

+347
-15
lines changed

_episodes/05-preprocessor.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
---
2+
title: "Working with preprocessors"
3+
teaching: 20
4+
exercises: 20
5+
questions:
6+
- "How do I set up a preprocessor?"
7+
- "Can I use different preprocessors for different variables?"
8+
- "Can I use different datasets for different variables?"
9+
- "How can I combine different preprocessor functions?"
10+
objectives:
11+
- "Create a recipe with multiple preprocessors"
12+
- "Use different preprocessors for different variables"
13+
- "Run a recipe with variables from different datasets"
14+
keypoints:
15+
- "A recipe can work with different preprocessors at the same time."
16+
- "The setting additional_datasets can be used to add a different dataset."
17+
- "Variable groups are useful for defining different settings for different variables."
18+
---
19+
20+
## Preprocessors: What are they and how do they work?
21+
22+
Preprocessing is the process of performing a set of modular operations on the data before applying diagnostics or metrics. In the figure below you see the preprocessor functions in the light blue box on the right.
23+
24+
![figure showing ESMValTool architecture]({{ page.root }}/fig/esmvaltool_architecture.png)
25+
26+
Underneath the hood, each preprocessor is a modular python function that receives an iris cube and sometimes some arguments. The preprocessor applies some mathematical or computational transformation to the input iris cube, then returns the processed iris cube.
27+
28+
## Inspect the example preprocessor
29+
30+
Each preprocessor section includes a preprocessor name, a list of preprocessor steps to be executed and any arguments needed by the preprocessor steps.
31+
32+
~~~yaml
33+
preprocessors:
34+
prep_timeseries:
35+
annual_statistics:
36+
operator: mean
37+
~~~
38+
39+
For instance, the 'annual_statistics' with the 'operation: mean' argument preprocessor receives an iris cube, takes the annual average for each year of data in the cube, and returns the processed cube.
40+
41+
You may use one or more of several preprocessors listed in the [documentation](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/recipe/preprocessor.html). The standardised interface between the preprocessors allows them to be used modularly - like lego blocks. Almost any conceivable preprocessing order of operations can be performed using ESMValTool preprocessors.
42+
43+
> ## The 'custom order' command.
44+
>
45+
>If you do not want your preprocessors to be applied in the [default order](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/api/esmvalcore.preprocessor.html), then add the following line to your preprocessor chain:
46+
>
47+
>~~~
48+
> custom_order: true
49+
>~~~
50+
>
51+
>The default preprocessor order is listed in the [ESMValCore preprocessor API page](>https://docs.esmvaltool.org/projects/ESMValCore/en/latest/api/esmvalcore.preprocessor.html).
52+
>
53+
> Also note that preprocessor operations aren't always commutative - meaning that the order of operations matters. For instance, when you run the two preprocessors -- 'extract_volume' to extract the top 100m of the ocean surface and 'volume_statistics' to calculate the volume-weighted mean of the data, your result will differ depending on the order of these two preprocessors. In fact, the 'extract_volume' preprocessor will fail if you try to run it on a 2D dataset.
54+
>
55+
> Changing the order of preprocessors can also speed up your processing. For instance, if you want to extract a two-dimensional layer from a 3D field and re-grid it, the layer extraction should be done first. If you did it the other way around, then the regridding function would be applied to all the layers of your 3D cube and it would take much more time.
56+
{: .callout}
57+
58+
Some preprocessor modules are always applied and do not need to be called. This includes the preprocessors that load the data, apply any fixes and save the data file afterwards. These do not need to be explicitly included in recipes.
59+
60+
> ## Exercise: Adding more preprocessor steps
61+
>
62+
> Edit the [example recipe](LINK to episode #4) to first change the variable thetao, then add preprocessors to average over the latitude and longitude dimensions and finally average over the depth. Now run the recipe.
63+
>
64+
>> ## Solution
65+
>>
66+
>>~~~yaml
67+
>> preprocessors:
68+
>> prep_timeseries:
69+
>> annual_statistics:
70+
>> operator: mean
71+
>> area_statistics:
72+
>> operator: mean
73+
>> depth_integration:
74+
>>~~~
75+
>>
76+
>{: .solution}
77+
{: .challenge}
78+
79+
## Using different preprocessors for different variables
80+
81+
You can also define different preprocessors with several preprocessor sections (setting different preprocessor names). In the variable section you call the specific preprocessor which should be applied.
82+
83+
> ## Example
84+
>~~~yaml
85+
> preprocessors:
86+
> prep_timeseries_1:
87+
> annual_statistics:
88+
> operator: mean
89+
> prep_timeseries_2:
90+
> annual_statistics:
91+
> operator: mean
92+
> area_statistics:
93+
> operator: mean
94+
> depth_integration:
95+
> ---
96+
> diagnostics:
97+
> # --------------------------------------------------
98+
> # Time series diagnostics
99+
> # --------------------------------------------------
100+
> diag_timeseries_temperature_1:
101+
> description: simple_time_series>
102+
> variables:
103+
> timeseries_variable:
104+
> short_name: thetaoga
105+
> preprocessor: prep_timeseries_1
106+
> scripts:
107+
> timeseries_diag:
108+
> script: ocean/diagnostic_timeseries.py
109+
>
110+
> diag_timeseries_temperature_2:
111+
> description: simple_time_series
112+
> variables:
113+
> timeseries_variable:
114+
> short_name: thetao
115+
> preprocessor: prep_timeseries_2
116+
> scripts:
117+
> timeseries_diag:
118+
> script: ocean/diagnostic_timeseries.py
119+
>~~~
120+
{: .solution}
121+
122+
>## Challenge : How to write a recipe with multiple preprocessors
123+
> We now know that a recipe can have more than one diagnostic, variable or preprocessor. As we saw in the examples so far, we can group preprocessors with a single user defined name and can have more than one such preprocessor group in the recipe as well. Write two different preprocessors - one to regrid the data to a 1x1 resolution and the second preprocessor to mask out sea and ice grid cells before regridding to the same resolution. In the second case, ensure you perform the masking first before regridding (hint: custom order your operations). Now, use the two preprocessors in different diagnostics within the same recipe. You may use any variable(s) of your choice. Once you succeed, try to add new datasets to the same recipe. Placeholders for the different components are provided below:
124+
>
125+
>> ## Recipe
126+
>>
127+
>>~~~yaml
128+
>>
129+
>> datasets:
130+
>> - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical,
131+
>> ensemble: r1i1p1f2} #single dataset as an example
132+
>>
133+
>> preprocessors:
134+
>> prep_map: #preprocessor to just regrid data
135+
>> #fill preprocessor details here
136+
>>
137+
>> prep_map_land: #preprocessor to mask grid cells and then regrid
138+
>> #fill preprocessor details here including ordering
139+
>>
140+
>> diagnostics:
141+
>> # --------------------------------------------------
142+
>> # Two Simple diagnostics that illustrate the use of
143+
>> # different preprocessors
144+
>> # --------------------------------------------------
145+
>> diag_simple_plot:
146+
>> description: # preprocess a variable for a simple 2D plot
147+
>> variables:
148+
>> # put your variable of choice here
149+
>> # apply the first preprocessor i.e. name your preprocessor
150+
>> # edit the following 4 lines for mip, grid and time
151+
>> # based on your variable choice
152+
>> mip: Amon
153+
>> grid: gn #can change for variables from the same model
154+
>> start_year: 1970
155+
>> end_year: 2000
156+
>> scripts: null #no scripts called
157+
>> diag_land_only_plot: #second diagnostic
158+
>> description: #preprocess a variable for a 2D land only plot
159+
>> variables:
160+
>> # include a variable and information
161+
>> # as in the previous diagnostic and
162+
>> # include your second preprocessor (masking and regridding)
163+
>> scripts: null # no scripts
164+
>>~~~
165+
>>
166+
>{: .solution}
167+
>
168+
>> ## Solution:
169+
>>
170+
>> Here is one solution to complete the challenge above using two different preprocessors
171+
>>
172+
>>~~~yaml
173+
>>
174+
>> datasets:
175+
>> - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical,
176+
>> ensemble: r1i1p1f2} #single dataset as an example
177+
>>
178+
>> preprocessors:
179+
>> prep_map:
180+
>> regrid: #apply the preprocessor to regrid
181+
>> target_grid: 1x1 # target resolution
182+
>> scheme: linear #how to interpolate for regridding
183+
>>
184+
>> prep_map_land:
185+
>> custom_order: true #ensure that given order of preprocessing is followed
186+
>> mask_landsea: #apply a mask
187+
>> mask_out: sea #mask out sea grid cells
188+
>> regrid: # now apply the preprocessor to regrid
189+
>> target_grid: 1x1 # target resolution
190+
>> scheme: linear #how to interpolate for regridding
191+
>>
192+
>> diagnostics:
193+
>> # --------------------------------------------------
194+
>> # Two Simple diagnostics that illustrate the use of
195+
>> # different preprocessors
196+
>> # --------------------------------------------------
197+
>> diag_simple_plot:
198+
>> description: # preprocess a variable for a simple 2D plot
199+
>> variables:
200+
>> tas: # surface temperature
201+
>> preprocessor: prep_map
202+
>> mip: Amon
203+
>> grid: gn #can change for variables from the same model
204+
>> start_year: 1970
205+
>> end_year: 2000
206+
>> scripts: null
207+
>>
208+
>> diag_land_only_plot:
209+
>> description: #preprocess a variable for a 2D land only plot
210+
>> variables:
211+
>> tas: # surface temperature
212+
>> preprocessor: prep_map_land
213+
>> mip: Amon
214+
>> grid: gn #can change for variables from the same model
215+
>> start_year: 1970
216+
>> end_year: 2000
217+
>> scripts: null
218+
>> ~~~
219+
>>
220+
> {: .solution}
221+
{: .challenge}
222+
223+
## Adding different datasets for different variables
224+
225+
Sometimes, we may want to include specific datasets only for certain variables. An example is when we use observations for two different variables in a diagnostic. While the CMIP dataset details for the two variables may be common, the observations will likely not be so. It would be useful to know how to include different datasets for different variables. Here is an example of a simple preprocessor and diagnostic setup for that:
226+
227+
> ## Example
228+
>~~~yaml
229+
>
230+
> datasets:
231+
> - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical,
232+
> ensemble: r1i1p1f2} #common to both variables discussed below
233+
>
234+
> preprocessors:
235+
> prep_regrid: # regrid to get all data to the same resolution
236+
> regrid: #apply the preprocessor to regrid
237+
> target_grid: 2.5x2.5 # target resolution
238+
> scheme: linear #how to interpolate for regridding
239+
>
240+
> diagnostics:
241+
> # --------------------------------------------------
242+
> # Simple diagnostic to illustrate use of different
243+
> # datasets for different variables
244+
> # --------------------------------------------------
245+
> diag_diff_datasets:
246+
> description: diff_datasets_for_vars
247+
> variables:
248+
> pr: #first variable is precipitation
249+
> preprocessor: prep_regrid
250+
> mip: Amon
251+
> grid: gn #can change for variables from the same model
252+
> start_year: 1970
253+
> end_year: 2000 # start and end years for a 30 year period,
254+
> # we assume this is common and exists for all
255+
> # model and obs data
256+
> additional_datasets:
257+
> - {dataset: GPCP-SG, project: obs4mips, level: L3,
258+
> version: v2.2, tier: 1} #dataset specific to this variable
259+
>
260+
> tas: #second variable is surface temperature
261+
> preprocessor: prep_regrid
262+
> mip: Amon
263+
> grid: gn #can change for variables from the same model
264+
> start_year: 1970 #some 30 year period
265+
> end_year: 2000
266+
> additional_datasets:
267+
> - {dataset: HadCRUT4, project: OBS, type: ground,
268+
> version: 1, tier: 2} #dataset specific to the temperature variable
269+
>
270+
> scripts: null
271+
>~~~
272+
>
273+
{: .solution}
274+
275+
## Creating variable groups
276+
277+
Variable grouping can be used to preprocess different clusters of data for the same variable. For instance, the example below illustrates how we can compute separate multimodel means for CMIP5 and CMIP6 data given the same variable. Additionally we can also preprocess observed data for evaluation.
278+
279+
> ## Example
280+
>~~~yaml
281+
>
282+
>preprocessors:
283+
> prep_mmm:
284+
> custom_order: true
285+
> regrid:
286+
> target_grid: 2.5 x 2.5
287+
> scheme: linear
288+
> multi_model_statistics:
289+
> span: full
290+
> statistics: [mean, median]
291+
>
292+
> prep_obs:
293+
> mask_landsea:
294+
> mask_out: sea
295+
> regrid:
296+
> target_grid: 2.5 x 2.5
297+
> scheme: linear
298+
>
299+
># note that there is no field called datasets anymore
300+
># note how multiple ensembles are added by using (1:4)
301+
>cmip5_datasets: &cmip5_datasets
302+
> - {dataset: CanESM2, ensemble: "r(1:4)i1p1", project: CMIP5}
303+
> - {dataset: MPI-ESM-LR, ensemble: "r(1:2)i1p1", project: CMIP5}
304+
>
305+
>cmip6_datasets: &cmip6_datasets
306+
> - {dataset: UKESM1-0-LL, ensemble: "r(1:4)i1p1f2", grid: gn, project: CMIP6}
307+
> - {dataset: CanESM5, ensemble: "r(1:4)i1p2f1", grid: gn, project: CMIP6}
308+
>
309+
>diagnostics:
310+
>
311+
> diag_variable_groups:
312+
> description: Demonstrate the use of variable groups.
313+
> variables:
314+
> tas_cmip5: &variable_settings # need a key name for the grouping
315+
> short_name: tas # specify variable to look for
316+
> preprocessor: prep_mmm
317+
> mip: Amon
318+
> exp: historical
319+
> start_year: 2000
320+
> end_year: 2005
321+
> tag: TAS_CMIP5 #tag is optional if you are using these settings just once
322+
> additional_datasets: *cmip5_datasets
323+
> tas_obs:
324+
> <<: *variable_settings
325+
> preprocessor: prep_obs
326+
> tag: TAS_OBS
327+
> additional_datasets:
328+
> - {dataset: HadCRUT4, project: OBS, type: ground, version: 1, tier: 2}
329+
> tas_cmip6:
330+
> <<: *variable_settings
331+
> tag: TAS_CMIP6
332+
> additional_datasets: *cmip6_datasets #nothing changes from cmip5 except the data set
333+
> scripts: null
334+
>~~~
335+
>
336+
{: .solution}
337+
338+
You should be able to see the variables grouped under different subdirectories under your output preproc directory. The different groupings can be accessed in your diagnostic by selecting the key name of the field variable_group such as tas_cmip5, tas_cmip6 or tas_obs.
339+
>
340+
> ## How to find what CMIP data is available?
341+
>
342+
> [CMIP5](https://pcmdi.llnl.gov/mips/cmip5/index.html) and [CMIP6](https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html) data obey the [CF-conventions](http://cfconventions.org/). Available variables can be found under the [CMIP5 data request](https://pcmdi.llnl.gov/mips/cmip5/docs/standard_output.pdf?id=28) and the [CMIP6 Data Request](http://clipc-services.ceda.ac.uk/dreq/index.html).
343+
>
344+
> CMIP data is widely available via the Earth System Grid Federation ([ESGF](https://esgf.llnl.gov/)) and is accessible to users either via download from the ESGF portal or through the ESGF data nodes hosted by large computing facilities (like [CEDA-Jasmin](https://esgf-index1.ceda.ac.uk/), [DKRZ](https://esgf-data.dkrz.de/), etc). The ESGF also hosts observations for Model Intercomparison Projects (obs4MIPs) and reanalyses data (ana4MIPs).
345+
>
346+
> A full list of all CMIP named variables is available here: [http://clipc-services.ceda.ac.uk/dreq/index/CMORvar.html](http://clipc-services.ceda.ac.uk/dreq/index/CMORvar.html).
347+
{: .callout}

_episodes/05-working-with-recipes.md

Lines changed: 0 additions & 15 deletions
This file was deleted.

fig/esmvaltool_architecture.png

113 KB
Loading

0 commit comments

Comments
 (0)