Skip to content

Commit 49dc092

Browse files
committed
Start of Dask tutorial
1 parent dead7a3 commit 49dc092

File tree

2 files changed

+234
-0
lines changed

2 files changed

+234
-0
lines changed

_episodes/11-dask-configuration.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
title: "Dask Configuration"
3+
teaching: 10
4+
exercises: 10
5+
compatibility: ESMValCore v2.10.0
6+
7+
questions:
8+
- What is the [Dask](https://www.dask.org/) configuration file and how should I use it?
9+
10+
objectives:
11+
- Understand the contents of the dask.yml file
12+
- Prepare a personalized dask.yml file
13+
- Configure ESMValCore to use some settings
14+
15+
keypoints:
16+
- The ``dask.yml`` file tells ESMValCore how to configure Dask.
17+
- "``client`` can be used to an already running Dask cluster."
18+
- "``cluster`` can be used to start a new Dask cluster for each run."
19+
- "The Dask default scheduler can be configured by editing the files in ~/.config/dask."
20+
21+
---
22+
23+
## The Dask configuration file
24+
25+
The preprocessor functions in ESMValCore use the
26+
[Iris](https://scitools-iris.readthedocs.io) library, which in turn uses Dask
27+
Arrays to be able to process datasets that are larger than the available memory.
28+
It is not necesary to understand how these work exactly to use the ESMValTool,
29+
but if you are interested there is a
30+
[Dask Array Tutorial](https://tutorial.dask.org/02_array.html) as a well as a
31+
[guide to "Lazy Data"](https://scitools-iris.readthedocs.io/en/stable/userguide/real_and_lazy_data.html)
32+
available. Lazy data is the term the Iris library uses for Dask Arrays.
33+
34+
The most important concept to understand when using Dask Arrays is the concept
35+
of a Dask "worker". With Dask, computations are run in parallel by Python
36+
processes or threads called "workers". These could be on the
37+
same machine that you are running ESMValTool on, or they could be on one or
38+
more other computers. Dask workers typically require 2 to 4 gigabytes of
39+
memory (RAM) each. In order to avoid running out of memory, it is important
40+
to use only as many workers as your computer(s) have memory for. ESMValCore
41+
(or Dask) provide configuration files where you can configure the number of
42+
workers.
43+
44+
In order to distribute the computations over the workers, Dask makes use of a
45+
"scheduler". There are two different schedulers available. The default
46+
scheduler can be good choice for smaller computations that can run
47+
on a single computer, while the scheduler provided by the Dask Distributed
48+
package is more suitable for larger computations.
49+
50+
> ## On using ``max_parallel_tasks``
51+
>
52+
> In the config-user.yml file, there is a setting called ``max_parallel_tasks``.
53+
> With the Dask Distributed scheduler, all the tasks running in parallel
54+
> can use the same workers, but with the default scheduler each task will
55+
> start its own workers. For recipes that process large datasets, it is usually
56+
> beneficial to set ``max_parallel_tasks: 1``, while for recipes that process
57+
> many small datasets it can be beneficial to increasing this number.
58+
>
59+
{: .callout}
60+
61+
## Starting a Dask distributed cluster
62+
63+
Let's start the the tutorial by configuring ESMValCore so it runs its
64+
computations using 2 workers.
65+
66+
We use a text editor called ``nano`` to edit the configuration file:
67+
68+
~~~bash
69+
nano ~/.esmvaltool/dask.yml
70+
~~~
71+
72+
Any other editor can be used, e.g. many systems have ``vi`` available.
73+
74+
This file contains the settings for:
75+
76+
- Starting a new cluster of Dask workers
77+
- Or alternatively: connecting to an existing cluster of Dask workers
78+
79+
Add the following content to the file ``~/.esmvaltool/dask.yml``:
80+
81+
```yaml
82+
cluster:
83+
type: distributed.LocalCluster
84+
n_workers: 1
85+
threads_per_worker: 2
86+
memory_limit: 4GiB
87+
```
88+
89+
This tells ESMValCore to start a cluster of one worker, that can use 2
90+
gigabytes (GiB) of memory and run computations using 2 threads. For a more
91+
extensive description of the available arguments and their values, see
92+
[``distributed.LocalCluster``](https://distributed.dask.org/en/stable/api.html#distributed.LocalCluster).
93+
94+
To see this configuration in action, run we will run a version
95+
of [recipe_easy_ipcc.yml](https://docs.esmvaltool.org/en/latest/recipes/recipe_examples.html) with just two datasets. Download
96+
the recipe [here](../files/recipe_easy_ipcc_short.yml) and run it
97+
with the command:
98+
99+
~~~bash
100+
esmvaltool run recipe_easy_ipcc_short.yml
101+
~~~
102+
103+
After finding and downloading all the required input files, this will start
104+
the Dask scheduler and workers required for processing the data. A message that
105+
looks like this will appear on the screen:
106+
107+
```
108+
2024-05-29 12:52:38,858 UTC [107445] INFO Dask dashboard: http://127.0.0.1:8787/status
109+
```
110+
111+
Open the Dashboard link in a browser to see the Dask Dashboard website.
112+
When the recipe has finished running, the Dashboard website will stop working.
113+
The top left panel shows the memory use of each of the workers, the panel on the
114+
right shows one row for each thread that is doing work, and the panel at the
115+
bottom shows the progress.
116+
117+
> ## Explore what happens if workers do not have enough memory
118+
>
119+
> Reduce the amount of memory that the workers are allowed to use to 2GiB and
120+
> run the recipe again. Note that the bars representing the memory use turn
121+
> orange as the worker reaches the maximum amount of memory it is
122+
> allowed to use and starts 'spilling' (writing data temporarily) to disk.
123+
> The red blocks in the top right panel represent time spent reading/writing
124+
> to disk.
125+
>
126+
>> ## Solution
127+
>>
128+
>> We use `memory_limit` entry in the `~/.esmvaltool/dask.yml` file to set the
129+
>> amount of memory allowed to 2 gigabytes:
130+
>>```yaml
131+
>> cluster:
132+
>> type: distributed.LocalCluster
133+
>> n_workers: 1
134+
>> threads_per_worker: 2
135+
>> memory_limit: 2GiB
136+
>>```
137+
>>
138+
> {: .solution}
139+
{: .challenge}
140+
141+
142+
> ## Tune the configuration to your own computer
143+
>
144+
> Look at how much memory you have available on your machine (run the command
145+
> ``grep MemTotal /proc/meminfo`` on Linux), set the ``memory_limit`` back to
146+
> 4 GiB and increase the number of Dask workers so they use total amount
147+
> available minus a few gigabytes for your other work.
148+
>
149+
>> ## Solution
150+
>>
151+
>> For example, if your computer has 16 GiB of memory, it can comfortably use
152+
>> 12 GiB of memory for Dask workers, so you can start 3 workers with 4 GiB
153+
>> of memory each.
154+
>> Use the `num_workers` entry in the `~/.esmvaltool/dask.yml` file to set the
155+
>> number of workers to 3.
156+
>>```yaml
157+
>> cluster:
158+
>> type: distributed.LocalCluster
159+
>> n_workers: 3
160+
>> threads_per_worker: 2
161+
>> memory_limit: 4GiB
162+
>>```
163+
>>
164+
> {: .solution}
165+
{: .challenge}
166+
167+
{% include links.md %}

files/recipe_easy_ipcc_short.yml

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
documentation:
2+
title: Easy IPCC
3+
description: Reproduce part of IPCC AR6 figure 9.3a.
4+
references:
5+
- fox-kemper21ipcc
6+
authors:
7+
- kalverla_peter
8+
- andela_bouwe
9+
maintainer:
10+
- andela_bouwe
11+
12+
preprocessors:
13+
easy_ipcc:
14+
custom_order: true
15+
anomalies:
16+
period: month
17+
reference:
18+
start_year: 1950
19+
start_month: 1
20+
start_day: 1
21+
end_year: 1979
22+
end_month: 12
23+
end_day: 31
24+
area_statistics:
25+
operator: mean
26+
annual_statistics:
27+
operator: mean
28+
convert_units:
29+
units: 'degrees_C'
30+
ensemble_statistics:
31+
statistics:
32+
- operator: mean
33+
multi_model_statistics:
34+
statistics:
35+
- operator: mean
36+
- operator: percentile
37+
percent: 17
38+
- operator: percentile
39+
percent: 83
40+
span: full
41+
keep_input_datasets: false
42+
ignore_scalar_coords: true
43+
44+
diagnostics:
45+
AR6_Figure_9.3:
46+
variables:
47+
tos_ssp585:
48+
short_name: tos
49+
exp: ['historical', 'ssp585']
50+
project: CMIP6
51+
mip: Omon
52+
preprocessor: easy_ipcc
53+
timerange: '1850/2100'
54+
tos_ssp126:
55+
short_name: tos
56+
exp: ['historical', 'ssp126']
57+
project: CMIP6
58+
mip: Omon
59+
timerange: '1850/2100'
60+
preprocessor: easy_ipcc
61+
scripts:
62+
Figure_9.3a:
63+
script: examples/make_plot.py
64+
65+
datasets:
66+
- {dataset: ACCESS-CM2, ensemble: r1i1p1f1, grid: gn}
67+
- {dataset: CESM2, ensemble: r4i1p1f1, grid: gn}

0 commit comments

Comments
 (0)