Skip to content

Commit 4b04e61

Browse files
committed
first try on preprocessor episode
1 parent aeeca75 commit 4b04e61

File tree

4 files changed

+287
-32
lines changed

4 files changed

+287
-32
lines changed

_episodes/05-preprocessor.md

Lines changed: 287 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,287 @@
1+
---
2+
title: "Working with preprocessors"
3+
teaching: 20?
4+
exercises: 15?
5+
questions:
6+
- "How do I set up a preprocessor?"
7+
- "Can I use different preprocessors for different variables?"
8+
- "Can I use different datasets for different variables?"
9+
objectives:
10+
- "Create a recipe with multiple preprocessors"
11+
- "Use different preprocessors for different variables"
12+
- "Run a recipe with variables from different datasets"
13+
keypoints:
14+
- "A recipe can run different preprocessors at the same time."
15+
- "The setting additional_datasets can be used to add a different dataset."
16+
- "Variable groups are useful for defining different settings for different variables."
17+
---
18+
19+
## Preprocessors: What are they and how do they work?
20+
21+
Preprocessing is the process of performing a set of modular operations on the data before applying diagnostics or metrics. In the figure below you see the preprocessor functions in the light blue box on the right.
22+
23+
![figure showing ESMValTool architecture]({{ page.root }}/fig/esmvaltool_architecture.png)
24+
25+
Underneath the hood, each preprocessor is a modular python function that receives an iris cube and sometimes some arguments. The preprocessor applies some mathematical or computational transformation to the input iris cube, then returns the processed iris cube.
26+
27+
## Inspect the example preprocessor
28+
29+
Each preprocessor section includes a preprocessor name, a list of preprocessor steps to be executed and any arguments needed by the preprocessor steps.
30+
31+
>~~~YAML
32+
> preprocessors:
33+
> prep_timeseries:
34+
> annual_statistics:
35+
> operator: mean
36+
>~~~
37+
{: .source}
38+
39+
For instance, the 'annual_statistics' with the 'operation: mean' argument preprocessor receives an iris cube, takes the annual average for each year of data in the cube, and returns the processed cube.
40+
41+
You could use several preprocessor steps listed in the [documentation](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/recipe/preprocessor.html). The standardised interface between the preprocessors allows them to be used modularly - like lego blocks. Almost any conceivable preprocessing order of operations can be performed using ESMValTool preprocessors.
42+
43+
> ## The 'custom order' command.
44+
>
45+
>If you do not want your preprocessors to be applied in the [default order](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/api/esmvalcore.preprocessor.html), then add the following line to your preprocessor chain:
46+
>
47+
>~~~
48+
> custom_order: true
49+
>~~~
50+
>
51+
>The default preprocessor order is listed in the [ESMValCore preprocessor API page](>https://docs.esmvaltool.org/projects/ESMValCore/en/latest/api/esmvalcore.preprocessor.html).
52+
>
53+
> Also note that not preprocessor operations aren't always commutative - meaning that the order of operations matters. For instance, if you run the preprocessor 'extract_volume' to extract the top 100m of the ocean surface, then 'volume_statistics' to calculate the volume-weighted mean of the data, then your result will differ depending on the order of these two preprocessors. In fact, the 'extract_volume' preprocessor will fail if you try to run it on a 2D dataset.
54+
>
55+
> Changing the order of preprocessors can also speed up your processing. For instance, if you want to extract a two-dimensional layer from a 3D field and re-grid it, the layer extraction should be done first. If you did it the other way around, then the regridding function would be applied to all the layers of your 3D cube and it would take much more time.
56+
{: .callout}
57+
58+
Some preprocessor stages are always applied and do not need to be called. This includes the preprocessors that load the data, apply the fixes, and save the data file afterwards. They do not need to be explicitly included in recipes.
59+
60+
> ## Exercise: Adding more preprocessor steps
61+
>
62+
> Edit the [example recipe](LINK to episode #4) to change the variable in ?thetao? and to add preprocessors which take the average over the latitude and longitude dimensions and the average over the depth. And then run the recipe.
63+
>
64+
>> ## Solution
65+
>>
66+
>>~~~YAML
67+
>> preprocessors:
68+
>> prep_timeseries:
69+
>> annual_statistics:
70+
>> operator: mean
71+
>> area_statistics:
72+
>> operator: mean
73+
>> depth_integration:
74+
>>~~~
75+
>>{: .source}
76+
>{: .solution}
77+
{: .challenge}
78+
79+
## Using different preprocessors for different variables
80+
81+
You could define different preprocessors with several preprocessor sections (setting different preprocessor names). In the variable section you call the specific preprocessor which should be applied.
82+
83+
> ## Example
84+
>~~~YAML
85+
> preprocessors:
86+
> prep_timeseries_1:
87+
> annual_statistics:
88+
> Operator: mean
89+
> prep_timeseries_2:
90+
> annual_statistics:
91+
> operator: mean
92+
> area_statistics:
93+
> operator: mean
94+
> Depth_integration:
95+
> ---
96+
> diagnostics:
97+
> # --------------------------------------------------
98+
> # Time series diagnostics
99+
> # --------------------------------------------------
100+
> diag_timeseries_temperature_1:
101+
> description: simple_time_series>
102+
> variables:
103+
> timeseries_variable:
104+
> short_name: thetaoga
105+
> preprocessor: prep_timeseries_1
106+
> scripts:
107+
> timeseries_diag:
108+
> script: ocean/diagnostic_timeseries.py
109+
>
110+
> diag_timeseries_temperature_2:
111+
> description: simple_time_series
112+
> variables:
113+
> timeseries_variable:
114+
> short_name: thetao
115+
> preprocessor: prep_timeseries_2
116+
> scripts:
117+
> timeseries_diag:
118+
> script: ocean/diagnostic_timeseries.py
119+
>~~~
120+
>{: .source}
121+
{: .solution}
122+
123+
>## Challenge : How to write a recipe with multiple preprocessors
124+
> We now know that a recipe can have more than one diagnostic, variable or preprocessor. As we saw in the examples so far, we can group preprocessors with a single user defined name and can have more than one such preprocessor group in the recipe as well. Write two different preprocessors - one to regrid the data to a 1x1 resolution and the second preprocessor to mask out sea and ice grid cells before regridding to the saem resolution. In the second case, ensure you perform the masking first before regridding (hint: custom order your operations). Now, use the two preprocessors in different diagnostics within the same recipe. You may use any variable(s) of your choice. Once you succeed, try to add new datasets to the same recipe. Placeholders for the different components are provided below:
125+
>
126+
>> ## Recipe
127+
>>
128+
>>~~~YAML
129+
>>
130+
>> datasets:
131+
>> - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical,
132+
>> ensemble: r1i1p1f2} #single dataset as an example
133+
>>
134+
>> preprocessors:
135+
>> prep_map: #preprocessor to just regrid data
136+
>> #fill preprocessor details here
137+
>>
138+
>> prep_map_land: #preprocessor to mask grid cells and then regrid
139+
>> #fill preprocessor details here including ordering
140+
>>
141+
>> diagnostics:
142+
>> # --------------------------------------------------
143+
>> # Two Simple diagnostics that illustrate the use of
144+
>> # different preprocessors
145+
>> # --------------------------------------------------
146+
>> diag_simple_plot:
147+
>> description: # preprocess a variable for a simple 2D plot
148+
>> variables:
149+
>> # put your variable of choice here
150+
>> # apply the first preprocessor i.e. name your preprocessor
151+
>> # edit the following 4 lines for mip, grid and time
152+
>> # based on your variable choice
153+
>> mip: Amon
154+
>> grid: gn #can change for variables from the same model
155+
>> start_year: 1970
156+
>> end_year: 2000
157+
>> scripts: null #no scripts called
158+
>> diag_land_only_plot: #second diagnostic
159+
>> description: #preprocess a variable for a 2D land only plot
160+
>> variables:
161+
>> # include a variable and information
162+
>> # as in the previous diagnostic and
163+
>> # include your second preprocessor (masking and regridding)
164+
>> scripts: null # no scripts
165+
>>~~~
166+
>>{: .source}
167+
>{: .solution}
168+
>
169+
>> ## Solution:
170+
>>
171+
>> Here is one possible way to use two different preprocessors including a
172+
>> group of preprocessors on different variables.
173+
>>
174+
>>~~~YAML
175+
>>
176+
>> datasets:
177+
>> - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical,
178+
>> ensemble: r1i1p1f2} #single dataset as an example
179+
>>
180+
>> preprocessors:
181+
>> prep_map:
182+
>> regrid: #apply the preprocessor to regrid
183+
>> target_grid: 1x1 # target resolution
184+
>> scheme: linear #how to interpolate for regridding
185+
>>
186+
>> prep_map_land:
187+
>> custom_order: true #ensure order follows a given
188+
>> mask_landsea: #apply a mask
189+
>> mask_out: sea #mask out sea grid cells
190+
>> regrid: # now apply the preprocessor to regrid
191+
>> target_grid: 1x1 # target resolution
192+
>> scheme: linear #how to interpolate for regridding
193+
>>
194+
>> diagnostics:
195+
>> # --------------------------------------------------
196+
>> # Two Simple diagnostics that illustrate the use of
197+
>> # different preprocessors
198+
>> # --------------------------------------------------
199+
>> diag_simple_plot:
200+
>> description: # preprocess a variable for a simple 2D plot
201+
>> variables:
202+
>> tas: # surface temperature
203+
>> preprocessor: prep_map
204+
>> mip: Amon
205+
>> grid: gn #can change for variables from the same model
206+
>> start_year: 1970
207+
>> end_year: 2000
208+
>> additional_datasets:
209+
>> scripts: null
210+
>>
211+
>> diag_land_only_plot:
212+
>> description: #preprocess a variable for a 2D land only plot
213+
>> variables:
214+
>> tas: # surface temperature
215+
>> preprocessor: prep_map_land
216+
>> mip: Amon
217+
>> grid: gn #can change for variables from the same model
218+
>> start_year: 1970
219+
>> end_year: 2000
220+
>> additional_datasets:
221+
>> scripts: null
222+
>> ~~~
223+
>> {: .source}
224+
> {: .solution}
225+
{: .challenge}
226+
227+
## Adding different datasets for different variables
228+
229+
Sometimes, we may want to include specific datasets for certain variables. An example is when we use observations for two different variables in a diagnostic. While the CMIP dataset details for the two variables may be common, the observations will likely not be so. It would be useful to know how to include different datasets for different variables. Here is an example of a simple preprocessor and diagnostic setup for that:
230+
231+
> ## Example
232+
>~~~YAML
233+
>
234+
> datasets:
235+
> - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical,
236+
> ensemble: r1i1p1f2} #common to both variables discussed below
237+
>
238+
> preprocessors:
239+
> prep_regrid: # regrid to get all data to the same resolution
240+
> regrid: #apply the preprocessor to regrid
241+
> target_grid: 2.5x2.5 # target resolution
242+
> scheme: linear #how to interpolate for regridding
243+
>
244+
> diagnostics:
245+
> # --------------------------------------------------
246+
> # Simple diagnostic to illustrate use of different
247+
> # datasets for different variables
248+
> # --------------------------------------------------
249+
> diag_diff_datasets:
250+
> description: diff_datasets_for_vars
251+
> variables:
252+
> pr: #first variable is precipitation
253+
> preprocessor: prep_regrid
254+
> mip: Amon
255+
> grid: gn #can change for variables from the same model
256+
> start_year: 1970
257+
> end_year: 2000 # start and end years for a30 year period,
258+
> # we assume this is common and exists for all
259+
> # model and obs data
260+
> additional_datasets:
261+
> - {dataset: GPCP-SG, project: obs4mips, level: L3,
262+
> version: v2.2, tier: 1} #dataset specific to this variable
263+
>
264+
> tas: #second variable is surface temperature
265+
> preprocessor: prep_regrid
266+
> mip: Amon
267+
> grid: gn #can change for variables from the same model
268+
> start_year: 1970 #some 30 year period
269+
> end_year: 2000
270+
> additional_datasets:
271+
> - {dataset: HadCRUT4, project: OBS, type: ground,
272+
> version: 1, tier: 2} #dataset specific to the temperature var
273+
>
274+
> scripts: null
275+
>~~~
276+
>{: .source}
277+
{: .solution}
278+
279+
> ## How to find what CMIP data is available?
280+
>
281+
> [CMIP5](https://pcmdi.llnl.gov/mips/cmip5/index.html) and [CMIP6](https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html) data obey the [CF-conventions](http://cfconventions.org/). Available variables could be found under the [CMIP5 data request](https://pcmdi.llnl.gov/mips/cmip5/docs/standard_output.pdf?id=28) and the [CMIP6 Data Request](http://clipc-services.ceda.ac.uk/dreq/index.html).
282+
>
283+
> CMIP data is widely available via the Earth System Grid Federation ([ESGF](https://esgf.llnl.gov/)) and is accessible to users either via download from the ESGF portal or through the ESGF data nodes hosted by large computing facilities (like [CEDA-Jasmin](https://esgf-index1.ceda.ac.uk/), [DKRZ](https://esgf-data.dkrz.de/), etc). The ESGF also hosts observations for Model Intercomparison Projects (obs4MIPs) and reanalyses data (ana4MIPs).
284+
>
285+
> A full list of all CMIP named variables is available here: [http://clipc-services.ceda.ac.uk/dreq/index/CMORvar.html](http://clipc-services.ceda.ac.uk/dreq/index/CMORvar.html).
286+
{: .callout}
287+

_episodes/05-working-with-recipes.md

Lines changed: 0 additions & 15 deletions
This file was deleted.

_episodes/preprocessor.md

Lines changed: 0 additions & 17 deletions
This file was deleted.

fig/esmvaltool_architecture.png

113 KB
Loading

0 commit comments

Comments
 (0)