You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To see this configuration in action, run we will run a version
95
-
of [recipe_easy_ipcc.yml](https://docs.esmvaltool.org/en/latest/recipes/recipe_examples.html) with just two datasets. Download
116
+
of [recipe_easy_ipcc.yml](https://docs.esmvaltool.org/en/latest/recipes/recipe_examples.html) with just two datasets. This recipe takes a few minutes to run, once you have the data available. Download
96
117
the recipe [here](../files/recipe_easy_ipcc_short.yml) and run it
97
118
with the command:
98
119
@@ -112,54 +133,235 @@ Open the Dashboard link in a browser to see the Dask Dashboard website.
112
133
When the recipe has finished running, the Dashboard website will stop working.
113
134
The top left panel shows the memory use of each of the workers, the panel on the
114
135
right shows one row for each thread that is doing work, and the panel at the
115
-
bottom shows the progress.
136
+
bottom shows the progress of all work that the scheduler currently has been asked
137
+
to do.
116
138
117
139
> ## Explore what happens if workers do not have enough memory
118
140
>
119
141
> Reduce the amount of memory that the workers are allowed to use to 2GiB and
120
-
> run the recipe again. Note that the bars representing the memory use turn
121
-
> orange as the worker reaches the maximum amount of memory it is
122
-
> allowed to use and starts 'spilling' (writing data temporarily) to disk.
123
-
> The red blocks in the top right panel represent time spent reading/writing
124
-
> to disk.
142
+
> run the recipe again. Watch what happens.
125
143
>
126
144
>> ## Solution
127
145
>>
128
146
>> We use `memory_limit` entry in the `~/.esmvaltool/dask.yml` file to set the
129
-
>> amount of memory allowed to 2 gigabytes:
147
+
>> amount of memory allowed to 2GiB:
130
148
>>```yaml
131
149
>> cluster:
132
150
>> type: distributed.LocalCluster
133
151
>> n_workers: 1
134
152
>> threads_per_worker: 2
135
153
>> memory_limit: 2GiB
136
154
>>```
155
+
>> Note that the bars representing the memory use turn
156
+
>> orange as the worker reaches the maximum amount of memory it is
157
+
>> allowed to use and it starts 'spilling' (writing data temporarily) to disk.
158
+
>> The red blocks in the top right panel represent time spent reading/writing
159
+
>> to disk. While 2 GiB per worker may be enough in other cases, it is apparently
160
+
>> not enough for this recipe.
137
161
>>
138
162
> {: .solution}
139
163
{: .challenge}
140
164
141
165
142
166
> ## Tune the configuration to your own computer
143
167
>
144
-
> Look at how much memory you have available on your machine (run the command
145
-
> ``grep MemTotal /proc/meminfo`` on Linux), set the ``memory_limit`` back to
146
-
> 4 GiB and increase the number of Dask workers so they use total amount
147
-
> available minus a few gigabytes for your other work.
168
+
> Look at how much memory you have available on your machine (e.g. by running
169
+
> the command ``grep MemTotal /proc/meminfo`` on Linux), set the
170
+
> ``memory_limit`` back to 4 GiB per worker and increase the number of Dask
171
+
> workers so they use total amount available minus a few gigabytes for your
172
+
> other work. Run the recipe again and notice that it completed faster.
148
173
>
149
174
>> ## Solution
150
175
>>
151
-
>> For example, if your computer has 16 GiB of memory, it can comfortably use
152
-
>> 12 GiB of memory for Dask workers, so you can start 3 workers with 4 GiB
153
-
>> of memory each.
176
+
>> For example, if your computer has 16 GiB of memory and you do not have too
177
+
>> many other programs running, it can use 12 GiB of memory for Dask workers,
178
+
>> so you can start 3 workers with 4 GiB of memory each.
179
+
>>
154
180
>> Use the `num_workers` entry in the `~/.esmvaltool/dask.yml` file to set the
155
-
>> number of workers to 3.
181
+
>> number of workers to 3:
156
182
>>```yaml
157
183
>> cluster:
158
184
>> type: distributed.LocalCluster
159
185
>> n_workers: 3
160
186
>> threads_per_worker: 2
161
187
>> memory_limit: 4GiB
162
188
>>```
189
+
>> and run the recipe again with the command ``esmvaltool run recipe_easy_ipcc_short.yml``. The time it took to run the recipe is printed
190
+
>> to the screen.
191
+
>>
192
+
> {: .solution}
193
+
{: .challenge}
194
+
195
+
## Using an existing Dask Distributed cluster
196
+
197
+
In some cases, it can be useful to start the Dask Distributed cluster before
198
+
running the ``esmvaltool`` command. For example, if you would like to keep the Dashboard available for further investigation after the recipe completes running, or if you are working from a Jupyter notebook environment, see
199
+
[dask-labextension](https://github.com/dask/dask-labextension) and
To use a cluster that was started in some other way, the following configuration
204
+
can be used in ``~/.esmvaltool/dask.yml``:
205
+
206
+
```yaml
207
+
client:
208
+
address: "tcp://127.0.0.1:33041"
209
+
```
210
+
where the address depends on the Dask cluster. Code to start a
211
+
[``distributed.LocalCluster``](https://distributed.dask.org/en/stable/api.html#distributed.LocalCluster) that automatically scales between 0 and 2 workers, depending on demand, could look like this:
212
+
213
+
```python
214
+
from time import sleep
215
+
216
+
from distributed import LocalCluster
217
+
218
+
if __name__ == '__main__': # Remove this line when running from a Jupyter notebook
219
+
cluster = LocalCluster(
220
+
threads_per_worker=2,
221
+
memory_limit='4GiB',
222
+
)
223
+
cluster.adapt(minimum=0, maximum=2)
224
+
225
+
# Print connection information
226
+
print(f"Connect to the Dask Dashboard by opening {cluster.dashboard_link} in a browser.")
227
+
print("Add the following text to ~/.esmvaltool/dask.yml to connect to the cluster:" )
228
+
print("client:")
229
+
print(f' address: "{cluster.scheduler_address}"')
230
+
231
+
# When running this as a Python script, the next two lines keep the cluster
232
+
# running for an hour.
233
+
hour = 3600 # seconds
234
+
sleep(1 * hour)
235
+
236
+
# Stop the cluster when you are done with it.
237
+
cluster.close()
238
+
```
239
+
240
+
> ## Start a cluster and use it
241
+
>
242
+
> Copy the Python code above into a file called ``start_dask_cluster.py`` (or
243
+
into a Jupyter notebook if you prefer) and start the cluster using the command
244
+
``python start_dask_cluster.py``. Edit the ``~/esmvaltool/dask.yml`` file so
245
+
ESMValCore can connect to the cluster. Run the recipe again and notice that the
246
+
Dashboard remains available after the recipe completes.
247
+
>
248
+
>> ## Solution
249
+
>>
250
+
>> If the script printed
251
+
>> ```
252
+
>> Connect to the Dask Dashboard by opening http://127.0.0.1:8787/status in a browser.
253
+
>> Add the following text to ~/.esmvaltool/dask.yml to connect to the cluster:
254
+
>> client:
255
+
>> address: "tcp://127.0.0.1:34827"
256
+
>> ```
257
+
>> to the screen, edit the file ``~/.esmvaltool/dask.yml`` so it contains the
258
+
lines
259
+
>> ```yaml
260
+
>> client:
261
+
>> address: "tcp://127.0.0.1:34827"
262
+
>> ```
263
+
>> open the link "http://127.0.0.1:8787/status" in your browser and
264
+
>> run the recipe again with the command ``esmvaltool run recipe_easy_ipcc_short.yml``.
265
+
> {: .solution}
266
+
{: .challenge}
267
+
268
+
When running from a Jupyter notebook, don't forget to `close()` the cluster
269
+
when you are running on an HPC facility (see below), to avoid wasting
270
+
compute hours you are not using.
271
+
272
+
## Using the Dask default scheduler
273
+
274
+
It is recommended to use the Distributed scheduler explained above for
275
+
processing larger amounts of data. However, in many cases the default scheduler
276
+
is good enough. Note that it does not provide a Dashboard, so it is less
277
+
instructive and that is why we did not use it earlier in this tutorial.
278
+
279
+
To use the default scheduler, comment out all the contents of
280
+
``~/.esmvaltool/dask.yml`` and create a file in ``~/.config/dask``, e.g. called
281
+
``~/.config/dask/default.yml`` but the filename does not matter, with the
282
+
contents:
283
+
```yaml
284
+
scheduler: threads
285
+
num_workers: 4
286
+
```
287
+
to set the number of workers to 4. The ``scheduler`` can also be set to
288
+
``synchronous``. In that case it will use a single thread, which may be useful
289
+
for debugging.
290
+
291
+
> ## Use the default scheduler
292
+
>
293
+
> Follow the instructions above to use the default scheduler and run the recipe
294
+
> again. To keep track of the amount of memory used by the process, you can
295
+
> start the ``top`` command in another terminal. The amount of memory is shown
296
+
> in the ``RES`` column.
297
+
>
298
+
>> ## Solution
299
+
>>
300
+
>> The recipe runs a bit faster with this configuration and you may have seen
301
+
>> a memory use of around 5 GB.
302
+
>>
303
+
> {: .solution}
304
+
{: .challenge}
305
+
306
+
## Optional: Using dask_jobqueue to run a Dask Cluster on an HPC system
307
+
308
+
The [``dask_jobqueue``](https://jobqueue.dask.org) package provides functionality
309
+
to start Dask Distributed clusters on High Performance Computing (HPC) or
310
+
High Throughput Computing (HTC) systems. This section is optional and only
type: dask_jobqueue.SLURMCluster # Levante uses SLURM as a job scheduler
320
+
queue: compute # SLURM partition name
321
+
account: bk1088 # SLURM account name
322
+
cores: 128 # number of CPU cores per SLURM job
323
+
memory: 240GiB # amount of memory per SLURM job
324
+
processes: 64 # number of Dask workers per SLURM job
325
+
interface: ib0 # use the infiniband network interface for communication
326
+
local_directory: "/scratch/username/dask-tmp" # directory for spilling to disk
327
+
n_workers: 64 # total number of workers to start
328
+
```
329
+
330
+
In this example we use the popular SLURM scheduduler, but other schedulers are also supported, see [this list](https://jobqueue.dask.org/en/latest/api.html).
331
+
332
+
In the above example, ESMValCore will start 64 Dask workers
333
+
(with 128 / 64 = 2 threads each) and for that it will need to launch a single SLURM
334
+
batch job on the ``compute`` partition. If you would set ``n_workers`` to e.g.
335
+
256, it would launch 4 SLURM batch jobs which would each start 64 workers for a
336
+
total of 4 x 64 = 256 workers. In the above configuration, each worker is
337
+
allowed to use 240 GiB per job / 64 workers per job = ~4 GiB per worker.
338
+
339
+
It is important to read the documentation about your HPC system and answer questions such as
340
+
- Which batch scheduler does my HPC system use?
341
+
- How many CPU cores are available per node (a computer in an HPC system)?
342
+
- How much memory is available for use per node?
343
+
- What is the fastest network interface (infiniband is much faster than ethernet)?
344
+
- What path should I use for storing temporary files on the nodes (try to avoid slower network storage if possible)?
345
+
- Which computing queue has the best availability?
346
+
- Can I use part of a node or do I need to use the full node?
347
+
- If you are always charged for using the full node, asking for only part of a node is wasteful of computational resources.
348
+
- If you can ask for part of a node, make sure the amount of memory you request matches the number of CPU cores if possible, or you will be charged for a larger fraction of the node.
349
+
350
+
in order to find the optimal configuration for your situation.
351
+
352
+
> ## Tune the configuration to your own computer
353
+
>
354
+
> Answer the questions above and create an ``~/.esmvaltool/dask.yml`` file that
355
+
> matches your situation. To benefit from using an HPC system, you will probably
356
+
> need to run a larger recipe than the example we have used so far. You could
357
+
> try the full version of that recipe (``esmvaltool run examples/recipe_easy_ipcc.yml``) or use your own recipe. To understand performance, you may want
358
+
> to experiment with different configurations.
359
+
>
360
+
>> ## Solution
361
+
>>
362
+
>> The best configuration depends on the HPC system that you are using.
363
+
>> Discuss your answer with the instructor and the class if possible. If you are
364
+
>> taking this course by yourself, you can have a look at the [Dask configuration examples in the ESMValCore documentation](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/quickstart/configure.html#dask-distributed-configuration).
0 commit comments