|
94 | 94 | "- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?\n", |
95 | 95 | "- Why is the multiprocessing scheduler so much slower here?\n", |
96 | 96 | "\n", |
| 97 | + "The `threaded` scheduler is a fine choice for working with large datasets out-of-core on a single machine, as long as the functions being used release the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) most of the time. NumPy and pandas release the GIL in most places, so the `threaded` scheduler is the default for `dask.array` and `dask.dataframe`. The distributed scheduler, perhaps with `processes=False`, will also work well for these workloads on a single machine.\n", |
97 | 98 | "\n", |
98 | | - "For single-machine use, the threaded and multiprocessing schedulers are fine choices. They are solid, mature and performant, and require absolutely no set-up. As a rule of thumb, threaded will work well when the functions called release the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock), whereas multiprocessing will always have a slower start-up time and suffer where a lot of communication is required between tasks. The *number* of workers is, in general, also important.\n", |
99 | | - "\n" |
100 | | - ] |
101 | | - }, |
102 | | - { |
103 | | - "cell_type": "markdown", |
104 | | - "metadata": {}, |
105 | | - "source": [ |
106 | | - "For scaling out work across a cluster, the distributed scheduler is required. Indeed, this is now generally preferred for all work, because it gives you additional monitoring information not available in the other schedulers. (Some of this monitoring is also available with an explicit progress bar and profiler, see [here](https://docs.dask.org/en/latest/diagnostics-local.html).)" |
| 99 | + "For workloads that do hold the GIL, as is common with `dask.bag` and custom code wrapped with `dask.delayed`, we recommend using the distributed scheduler, even on a single machine. Generally speaking, it's more intelligent and provides better diagnostics than the `processes` scheduler.\n", |
| 100 | + "\n", |
| 101 | + "https://docs.dask.org/en/latest/scheduling.html provides some additional details on choosing a scheduler.\n", |
| 102 | + "\n", |
| 103 | + "For scaling out work across a cluster, the distributed scheduler is required." |
107 | 104 | ] |
108 | 105 | }, |
109 | 106 | { |
|
147 | 144 | "cell_type": "markdown", |
148 | 145 | "metadata": {}, |
149 | 146 | "source": [ |
150 | | - "Be sure to click the `Dashboard` link to open up the diagnostics dashboard.\n", |
151 | | - "\n", |
152 | | - "\n", |
| 147 | + "If you aren't in jupyterlab and using the `dask-labextension`, be sure to click the `Dashboard` link to open up the diagnostics dashboard.\n", |
153 | 148 | "\n", |
154 | 149 | "## Executing with the distributed client" |
155 | 150 | ] |
|
299 | 294 | "# Average departure delay per day-of-week\n", |
300 | 295 | "_ = df.groupby(df.Date.dt.dayofweek).DepDelay.mean().compute()" |
301 | 296 | ] |
| 297 | + }, |
| 298 | + { |
| 299 | + "cell_type": "code", |
| 300 | + "execution_count": null, |
| 301 | + "metadata": {}, |
| 302 | + "outputs": [], |
| 303 | + "source": [ |
| 304 | + "client.shutdown()" |
| 305 | + ] |
302 | 306 | } |
303 | 307 | ], |
304 | 308 | "metadata": { |
|
0 commit comments