Skip to content

Commit c215585

Browse files
authored
Add Overview (#254)
1 parent 38522a1 commit c215585

File tree

6 files changed

+1300
-83
lines changed

6 files changed

+1300
-83
lines changed

00_overview.ipynb

Lines changed: 205 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -4,100 +4,207 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"<img src=\"http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg\"\n",
8-
" align=\"right\"\n",
9-
" width=\"30%\"\n",
10-
" alt=\"Dask logo\\\">\n",
7+
"# Welcome to the Dask Tutorial\n",
118
"\n",
12-
"# Introduction\n",
9+
"<img src=\"https://docs.dask.org/en/latest/_images/dask_horizontal.svg\" align=\"right\" width=\"30%\" alt=\"Dask logo\">\n",
1310
"\n",
14-
"Welcome to the Dask Tutorial.\n",
11+
"Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.\n",
1512
"\n",
16-
"Dask is a parallel computing library that scales the existing Python ecosystem. This tutorial will introduce Dask and parallel data analysis more generally.\n",
13+
"Dask can scale up to your full laptop capacity and out to a cloud cluster."
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"metadata": {},
19+
"source": [
20+
"## An example Dask computation\n",
21+
"\n",
22+
"In the following lines of code, we're reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don't worry about the code, this is just for a quick demonstration. We'll go over all of this in the next notebook. :)\n",
23+
"\n",
24+
"**Note for learners:** This might be heavy for Binder.\n",
1725
"\n",
18-
"Dask can scale down to your laptop and up to a cluster. Here, we'll use an environment you setup on your laptop to analyze medium sized datasets in parallel locally."
26+
"**Note for instructors:** Don't forget to open the Dask Dashboard!"
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": null,
32+
"metadata": {},
33+
"outputs": [],
34+
"source": [
35+
"import dask.dataframe as dd\n",
36+
"from dask.distributed import Client"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {},
43+
"outputs": [],
44+
"source": [
45+
"client = Client()\n",
46+
"client"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": null,
52+
"metadata": {},
53+
"outputs": [],
54+
"source": [
55+
"ddf = dd.read_parquet(\n",
56+
" \"s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet\",\n",
57+
" columns=[\"passenger_count\", \"tip_amount\"],\n",
58+
" storage_options={\"anon\": True},\n",
59+
")"
60+
]
61+
},
62+
{
63+
"cell_type": "code",
64+
"execution_count": null,
65+
"metadata": {},
66+
"outputs": [],
67+
"source": [
68+
"result = ddf.groupby(\"passenger_count\").tip_amount.mean().compute()\n",
69+
"result"
1970
]
2071
},
2172
{
2273
"cell_type": "markdown",
2374
"metadata": {},
2475
"source": [
25-
"## Overview"
76+
"## What is [Dask](\"https://www.dask.org/\")?\n",
77+
"\n",
78+
"There are many parts to the \"Dask\" the project:\n",
79+
"* Collections/API also known as \"core-library\".\n",
80+
"* Distributed -- to create clusters\n",
81+
"* Intergrations and broader ecosystem\n"
2682
]
2783
},
2884
{
2985
"cell_type": "markdown",
3086
"metadata": {},
3187
"source": [
32-
"Dask provides multi-core and distributed parallel execution on larger-than-memory datasets.\n",
88+
"### Dask Collections\n",
3389
"\n",
34-
"We can think of Dask at a high and a low level\n",
90+
"Dask provides **multi-core** and **distributed+parallel** execution on **larger-than-memory** datasets\n",
3591
"\n",
36-
"* **High-level collections:** Dask provides high-level Array, Bag, and DataFrame\n",
37-
" collections that mimic NumPy, lists, and Pandas but can operate in parallel on\n",
38-
" datasets that don't fit into memory. Dask's high-level collections are\n",
39-
" alternatives to NumPy and Pandas for large datasets.\n",
40-
"* **Low-level schedulers:** Dask provides dynamic task schedulers that\n",
41-
" execute task graphs in parallel. These execution engines power the\n",
42-
" high-level collections mentioned above but can also power custom,\n",
43-
" user-defined workloads. These schedulers are low-latency (around 1ms) and\n",
44-
" work hard to run computations in a small memory footprint. Dask's\n",
45-
" schedulers are an alternative to direct use of `threading` or\n",
46-
" `multiprocessing` libraries in complex cases or other task scheduling\n",
47-
" systems like `Luigi` or `IPython parallel`.\n",
92+
"We can think of Dask's APIs (also called collections) at a high and a low level:\n",
4893
"\n",
49-
"Different users operate at different levels but it is useful to understand\n",
50-
"both.\n",
94+
"<center>\n",
95+
"<img src=\"images/high_vs_low_level_coll_analogy.png\" width=\"75%\" alt=\"High vs Low level clothes analogy\">\n",
96+
"</center>\n",
5197
"\n",
52-
"The Dask [use cases](https://stories.dask.org/en/latest/) provides a number of sample workflows where Dask should be a good fit."
98+
"* **High-level collections:** Dask provides high-level Array, Bag, and DataFrame\n",
99+
" collections that mimic NumPy, lists, and pandas but can operate in parallel on\n",
100+
" datasets that don't fit into memory.\n",
101+
"* **Low-level collections:** Dask also provides low-level Delayed and Futures\n",
102+
" collections that give you finer control to build custom parallel and distributed computations."
53103
]
54104
},
55105
{
56106
"cell_type": "markdown",
57107
"metadata": {},
58108
"source": [
59-
"## Prepare"
109+
"### Dask Cluster\n",
110+
"\n",
111+
"Most of the times when you are using Dask, you will be using a distributed scheduler, which exists in the context of a Dask cluster. The Dask cluster is structured as:\n",
112+
"\n",
113+
"<center>\n",
114+
"<img src=\"images/distributed-overview.png\" width=\"75%\" alt=\"Distributed overview\">\n",
115+
"</center>"
60116
]
61117
},
62118
{
63119
"cell_type": "markdown",
64120
"metadata": {},
65121
"source": [
66-
"You should clone this repository: \n",
122+
"### Dask Ecosystem\n",
67123
"\n",
68-
" git clone http://github.com/dask/dask-tutorial\n",
124+
"In addition to the core Dask library and its distributed scheduler, the Dask ecosystem connects several additional initiatives, including:\n",
69125
"\n",
70-
"The included file `environment.yml` in the `binder` subdirectory contains a list of all of the packages needed to run this tutorial. To install them using `conda`, you can do\n",
126+
"- Dask-ML (parallel scikit-learn-style API)\n",
127+
"- Dask-image\n",
128+
"- Dask-cuDF\n",
129+
"- Dask-sql\n",
130+
"- Dask-snowflake\n",
131+
"- Dask-mongo\n",
132+
"- Dask-bigquery\n",
71133
"\n",
72-
" conda env create -f binder/environment.yml\n",
73-
" conda activate dask-tutorial\n",
74-
" jupyter labextension install @jupyter-widgets/jupyterlab-manager\n",
75-
" jupyter labextension install @bokeh/jupyter_bokeh\n",
76-
" \n",
77-
"Do this *before* running this notebook."
134+
"Community libraries that have built-in dask integrations like:\n",
135+
"\n",
136+
"- Xarray\n",
137+
"- XGBoost\n",
138+
"- Prefect\n",
139+
"- Airflow\n",
140+
"\n",
141+
"Dask deployment libraries\n",
142+
"- Dask-kubernetes\n",
143+
"- Dask-YARN\n",
144+
"- Dask-gateway\n",
145+
"- Dask-cloudprovider\n",
146+
"- jobqueue\n",
147+
"\n",
148+
"... When we talk about the Dask project we include all these efforts as part of the community. "
78149
]
79150
},
80151
{
81152
"cell_type": "markdown",
82153
"metadata": {},
83154
"source": [
84-
"## Links"
155+
"## Dask Use Cases\n",
156+
"\n",
157+
"Dask is used in multiple fields such as:\n",
158+
"\n",
159+
"* Geospatial\n",
160+
"* Finance\n",
161+
"* Astrophysics\n",
162+
"* Microbiology\n",
163+
"* Environmental science\n",
164+
"\n",
165+
"Check out the Dask [use cases](https://stories.dask.org/en/latest/) page that provides a number of sample workflows."
85166
]
86167
},
87168
{
88169
"cell_type": "markdown",
89170
"metadata": {},
90171
"source": [
91-
"* Reference\n",
92-
" * [Docs](https://dask.org/)\n",
93-
" * [Examples](https://examples.dask.org/)\n",
94-
" * [Code](https://github.com/dask/dask/)\n",
95-
" * [Blog](https://blog.dask.org/)\n",
96-
"* Ask for help\n",
97-
" * [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions\n",
98-
" * [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests\n",
99-
" * [discourse forum](https://dask.discourse.group/) for general, non-bug, questions and discussion\n",
100-
" * Attend a live tutorial"
172+
"## Prepare"
173+
]
174+
},
175+
{
176+
"cell_type": "markdown",
177+
"metadata": {
178+
"tags": []
179+
},
180+
"source": [
181+
"#### 1. You should clone this repository\n",
182+
"\n",
183+
"\n",
184+
" git clone http://github.com/dask/dask-tutorial\n",
185+
"\n",
186+
"and then install necessary packages.\n",
187+
"There are three different ways to achieve this, pick the one that best suits you, and ***only pick one option***.\n",
188+
"They are, in order of preference:\n",
189+
"\n",
190+
"#### 2a) Create a conda environment (preferred)\n",
191+
"\n",
192+
"In the main repo directory\n",
193+
"\n",
194+
"\n",
195+
" conda env create -f binder/environment.yml\n",
196+
" conda activate dask-tutorial\n",
197+
"\n",
198+
"\n",
199+
"#### 2b) Install into an existing environment\n",
200+
"\n",
201+
"You will need the following core libraries\n",
202+
"\n",
203+
"\n",
204+
" conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch scipy dask distributed dask-labextension\n",
205+
"\n",
206+
"Note that these options will alter your existing environment, potentially changing the versions of packages you already\n",
207+
"have installed."
101208
]
102209
},
103210
{
@@ -108,26 +215,36 @@
108215
"\n",
109216
"Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.\n",
110217
"\n",
218+
"0. [Overview](00_overview.ipynb) - dask's place in the universe.\n",
219+
"\n",
220+
"1. [Dataframe](01_dataframe.ipynb) - parallelized operations on many pandas dataframes spread across your cluster.\n",
221+
"\n",
222+
"2. [Array](02_array.ipynb) - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.\n",
223+
"\n",
224+
"3. [Delayed](03_dask.delayed.ipynb) - the single-function way to parallelize general python code.\n",
225+
"\n",
226+
"4. [Deployment/Distributed](04_distributed.ipynb) - Dask's scheduler for clusters, with details of how to view the UI.\n",
227+
"\n",
228+
"5. [Distributed Futures](05_futures.ipynb) - non-blocking results that compute asynchronously.\n",
229+
"\n",
230+
"6. Conclusion\n",
231+
"\n",
232+
"\n",
111233
"If you haven't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is\n",
112234
"\n",
113235
"1. There are two modes: command and edit\n",
114236
"2. From command mode, press `Enter` to edit a cell (like this markdown cell)\n",
115237
"3. From edit mode, press `Esc` to change to command mode\n",
116238
"4. Press `shift+enter` to execute a cell and move to the next cell.\n",
117239
"\n",
118-
"The toolbar has commands for executing, converting, and creating cells.\n",
119-
"\n",
120-
"The layout of the tutorial will be as follows:\n",
121-
"- Foundations: an explanation of what Dask is, how it works, and how to use lower-level primitives to set up computations. Casual users may wish to skip this section, although we consider it useful knowledge for all users.\n",
122-
"- Distributed: information on running Dask on the distributed scheduler, which enables scale-up to distributed settings and enhanced monitoring of task operations. The distributed scheduler is now generally the recommended engine for executing task work, even on single workstations or laptops.\n",
123-
"- Collections: convenient abstractions giving a familiar feel to big data\n",
124-
" - bag: Python iterators with a functional paradigm, such as found in func/iter-tools and toolz - generalize lists/generators to big data; this will seem very familiar to users of PySpark's [RDD](http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.RDD)\n",
125-
" - array: massive multi-dimensional numerical data, with Numpy functionality\n",
126-
" - dataframes: massive tabular data, with Pandas functionality\n",
127-
" \n",
128-
"Whereas there is a wealth of information in the documentation, linked above, here we aim to give practical advice to aid your understanding and application of Dask in everyday situations. This means that you should not expect every feature of Dask to be covered, but the examples hopefully are similar to the kinds of work-flows that you have in mind.\n",
129-
"\n",
130-
"## Exercise: Print `Hello, world!`\n",
240+
"The toolbar has commands for executing, converting, and creating cells."
241+
]
242+
},
243+
{
244+
"cell_type": "markdown",
245+
"metadata": {},
246+
"source": [
247+
"### Exercise: Print `Hello, world!`\n",
131248
"Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.\n",
132249
"\n",
133250
"\n",
@@ -157,12 +274,38 @@
157274
"metadata": {
158275
"jupyter": {
159276
"source_hidden": true
160-
}
277+
},
278+
"tags": []
161279
},
162280
"outputs": [],
163281
"source": [
164282
"print(\"Hello, world!\")"
165283
]
284+
},
285+
{
286+
"cell_type": "markdown",
287+
"metadata": {
288+
"tags": []
289+
},
290+
"source": [
291+
"## Useful Links"
292+
]
293+
},
294+
{
295+
"cell_type": "markdown",
296+
"metadata": {},
297+
"source": [
298+
"* Reference\n",
299+
" * [Docs](https://dask.org/)\n",
300+
" * [Examples](https://examples.dask.org/)\n",
301+
" * [Code](https://github.com/dask/dask/)\n",
302+
" * [Blog](https://blog.dask.org/)\n",
303+
"* Ask for help\n",
304+
" * [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions\n",
305+
" * [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests\n",
306+
" * [discourse forum](https://dask.discourse.group/) for general, non-bug, questions and discussion\n",
307+
" * Attend a live tutorial"
308+
]
166309
}
167310
],
168311
"metadata": {
@@ -182,7 +325,7 @@
182325
"name": "python",
183326
"nbconvert_exporter": "python",
184327
"pygments_lexer": "ipython3",
185-
"version": "3.9.10"
328+
"version": "3.9.13"
186329
}
187330
},
188331
"nbformat": 4,

README.md

Lines changed: 1 addition & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,7 @@ This tutorial was last given at SciPy 2020 which was a virtual conference.
66
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dask/dask-tutorial/main?urlpath=lab)
77
[![Build Status](https://github.com/dask/dask-tutorial/workflows/CI/badge.svg)](https://github.com/dask/dask-tutorial/actions?query=workflow%3ACI)
88

9-
Dask provides multi-core execution on larger-than-memory datasets.
10-
11-
We can think of dask at a high and a low level
12-
13-
* **High-level collections:** Dask provides high-level Array, Bag, and DataFrame
14-
collections that mimic NumPy, lists, and Pandas but can operate in parallel on
15-
datasets that don't fit into main memory. Dask's high-level collections are
16-
alternatives to NumPy and Pandas for large datasets.
17-
* **Low-level schedulers:** Dask provides dynamic task schedulers that
18-
execute task graphs in parallel. These execution engines power the
19-
high-level collections mentioned above but can also power custom,
20-
user-defined workloads. These schedulers are low-latency (around 1ms) and
21-
work hard to run computations in a small memory footprint. Dask's
22-
schedulers are an alternative to direct use of `threading` or
23-
`multiprocessing` libraries in complex cases or other task scheduling
24-
systems like `Luigi` or `IPython parallel`.
25-
26-
Different users operate at different levels but it is useful to understand
27-
both. This tutorial will interleave between high-level use of `dask.array` and
28-
`dask.dataframe` (even sections) and low-level use of dask graphs and
29-
schedulers (odd sections.)
9+
Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. Dask can scale up to your full laptop capacity and out to a cloud cluster.
3010

3111
## Prepare
3212

binder/environment.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ dependencies:
1515
- pip
1616
- python-graphviz
1717
- ipycytoscape
18+
- pyarrow
19+
- s3fs
1820
- zarr
1921
- pooch
2022
- xarray

images/distributed-overview.png

98.5 KB
Loading
123 KB
Loading

0 commit comments

Comments
 (0)