wip

TomAugspurger · TomAugspurger · commit a49c62cdb91b · 2019-12-20T13:48:14.000-06:00
diff --git a/01_dask.delayed.ipynb b/01_dask.delayed.ipynb
@@ -251,11 +251,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "results = []\n",
@@ -343,11 +339,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "results = []\n",
@@ -649,11 +641,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "# This is just one possible solution, there are\n",
@@ -717,7 +705,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Close the Client\n",
+    "## Shutdown\n",
     "\n",
     "Before moving on to the next exercise, make sure to close your client or stop this kernel."
    ]
diff --git a/01x_lazy.ipynb b/01x_lazy.ipynb
@@ -35,11 +35,12 @@
     "As Python programmers, you probably already perform certain *tricks* to enable computation of larger-than-memory datasets, parallel execution or delayed/background execution. Perhaps with this phrasing, it is not clear what we mean, but a few examples should make things clearer. The point of Dask is to make simple things easy and complex things possible!\n",
     "\n",
     "Aside from the [detailed introduction](http://dask.pydata.org/en/latest/), we can summarize the basics of Dask as follows:\n",
+    "\n",
     "- process data that doesn't fit into memory by breaking it into blocks and specifying task chains\n",
     "- parallelize execution of tasks across cores and even nodes of a cluster\n",
-    "- move computation to the data rather than the other way around, to minimize communication overheads\n",
+    "- move computation to the data rather than the other way around, to minimize communication overhead\n",
     "\n",
-    "All of this allows you to get the most out of your computation resources, but program in a way that is very familiar: for-loops to build basic tasks, Python iterators, and the Numpy (array) and Pandas (dataframe) functions for multi-dimensional or tabular data, respectively.\n",
+    "All of this allows you to get the most out of your computation resources, but program in a way that is very familiar: for-loops to build basic tasks, Python iterators, and the NumPy (array) and Pandas (dataframe) functions for multi-dimensional or tabular data, respectively.\n",
     "\n",
     "The remainder of this notebook will take you through the first of these programming paradigms. This is more detail than some users will want, who can skip ahead to the iterator, array and dataframe sections; but there will be some data processing tasks that don't easily fit into those abstractions and need to fall back to the methods here.\n",
     "\n",
@@ -81,7 +82,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here we have used the delayed annotation to show that we want these functions to operate lazily - to save the set of inputs and execute only on demand. `dask.delayed` is also a function which can do this, without the annotation, leaving the original function unchanged, e.g., \n",
+    "Here we have used the delayed annotation to show that we want these functions to operate lazily — to save the set of inputs and execute only on demand. `dask.delayed` is also a function which can do this, without the annotation, leaving the original function unchanged, e.g.,\n",
     "```python\n",
     "    delayed_inc = delayed(inc)\n",
     "```"
@@ -146,7 +147,7 @@
     "\n",
     "By building a specification of the calculation we want to carry out before executing anything, we can pass the specification to an *execution engine* for evaluation. In the case of Dask, this execution engine could be running on many nodes of a cluster, so you have access to the full number of CPU cores and memory across all the machines. Dask will intelligently execute your calculation with care for minimizing the amount of data held in memory, while parallelizing over the tasks that make up a graph. Notice that in the animated diagram below, where four workers are processing the (simple) graph, execution progresses vertically up the branches first, so that intermediate results can be expunged before moving onto a new branch.\n",
     "\n",
-    "With `delayed` and normal pythonic looped code, very complex graphs can be built up and passed on to Dask for execution. See a nice example of [simulated complex ETL](http://matthewrocklin.com/blog/work/2017/01/24/dask-custom) work flow.\n",
+    "With `delayed` and normal pythonic looped code, very complex graphs can be built up and passed on to Dask for execution. See a nice example of [simulated complex ETL](https://blog.dask.org/2017/01/24/dask-custom) work flow.\n",
     "\n",
     "![this](images/grid_search_schedule.gif)"
    ]
@@ -251,11 +252,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "## verbose version\n",
diff --git a/03_array.ipynb b/03_array.ipynb
@@ -210,7 +210,8 @@
    "outputs": [],
    "source": [
     "import dask.array as da\n",
-    "x = da.from_array(dset, chunks=(1000000,))"
+    "x = da.from_array(dset, chunks=(1000000,))\n",
+    "x"
    ]
   },
   {
@@ -554,7 +555,7 @@
     "import matplotlib.pyplot as plt\n",
     "\n",
     "fig = plt.figure(figsize=(16, 8))\n",
-    "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r')"
+    "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r');"
    ]
   },
   {
@@ -644,12 +645,16 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    }
+   },
    "outputs": [],
    "source": [
     "result = x.mean(axis=0)\n",
     "fig = plt.figure(figsize=(16, 8))\n",
-    "plt.imshow(result, cmap='RdBu_r')"
+    "plt.imshow(result, cmap='RdBu_r');"
    ]
   },
   {
@@ -669,12 +674,16 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "jupyter": {
+     "source_hidden": true
+    }
+   },
    "outputs": [],
    "source": [
     "result = x[0] - x.mean(axis=0)\n",
     "fig = plt.figure(figsize=(16, 8))\n",
-    "plt.imshow(result, cmap='RdBu_r')"
+    "plt.imshow(result, cmap='RdBu_r');"
    ]
   },
   {
diff --git a/05_distributed.ipynb b/05_distributed.ipynb
@@ -80,7 +80,8 @@
     "for sch in ['threading', 'processes', 'sync']:\n",
     "    t0 = time.time()\n",
     "    _ = largest_delay.compute(scheduler=sch)\n",
-    "    print(sch, time.time() - t0)"
+    "    t1 = time.time()\n",
+    "    print(f\"{sch:>10}, {t1 - t0:0.4f}\")"
    ]
   },
   {
diff --git a/06_distributed_advanced.ipynb b/06_distributed_advanced.ipynb
diff --git a/binder/postBuild b/binder/postBuild

Original file line number	Diff line number	Diff line change
`@@ -210,7 +210,8 @@`
`210`	`210`	`"outputs": [],`
`211`	`211`	`"source": [`
`212`	`212`	`"import dask.array as da\n",`
`213`		`- "x = da.from_array(dset, chunks=(1000000,))"`
	`213`	`+ "x = da.from_array(dset, chunks=(1000000,))\n",`
	`214`	`+ "x"`
`214`	`215`	`]`
`215`	`216`	`},`
`216`	`217`	`{`
`@@ -554,7 +555,7 @@`
`554`	`555`	`"import matplotlib.pyplot as plt\n",`
`555`	`556`	`"\n",`
`556`	`557`	`"fig = plt.figure(figsize=(16, 8))\n",`
`557`		`- "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r')"`
	`558`	`+ "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r');"`
`558`	`559`	`]`
`559`	`560`	`},`
`560`	`561`	`{`
`@@ -644,12 +645,16 @@`
`644`	`645`	`{`
`645`	`646`	`"cell_type": "code",`
`646`	`647`	`"execution_count": null,`
`647`		`- "metadata": {},`
	`648`	`+ "metadata": {`
	`649`	`+ "jupyter": {`
	`650`	`+ "source_hidden": true`
	`651`	`+ }`
	`652`	`+ },`
`648`	`653`	`"outputs": [],`
`649`	`654`	`"source": [`
`650`	`655`	`"result = x.mean(axis=0)\n",`
`651`	`656`	`"fig = plt.figure(figsize=(16, 8))\n",`
`652`		`- "plt.imshow(result, cmap='RdBu_r')"`
	`657`	`+ "plt.imshow(result, cmap='RdBu_r');"`
`653`	`658`	`]`
`654`	`659`	`},`
`655`	`660`	`{`
`@@ -669,12 +674,16 @@`
`669`	`674`	`{`
`670`	`675`	`"cell_type": "code",`
`671`	`676`	`"execution_count": null,`
`672`		`- "metadata": {},`
	`677`	`+ "metadata": {`
	`678`	`+ "jupyter": {`
	`679`	`+ "source_hidden": true`
	`680`	`+ }`
	`681`	`+ },`
`673`	`682`	`"outputs": [],`
`674`	`683`	`"source": [`
`675`	`684`	`"result = x[0] - x.mean(axis=0)\n",`
`676`	`685`	`"fig = plt.figure(figsize=(16, 8))\n",`
`677`		`- "plt.imshow(result, cmap='RdBu_r')"`
	`686`	`+ "plt.imshow(result, cmap='RdBu_r');"`
`678`	`687`	`]`
`679`	`688`	`},`
`680`	`689`	`{`
Original file line number	Diff line number	Diff line change
`@@ -80,7 +80,8 @@`
`80`	`80`	`"for sch in ['threading', 'processes', 'sync']:\n",`
`81`	`81`	`" t0 = time.time()\n",`
`82`	`82`	`" _ = largest_delay.compute(scheduler=sch)\n",`
`83`		`- " print(sch, time.time() - t0)"`
	`83`	`+ " t1 = time.time()\n",`
	`84`	`+ " print(f\"{sch:>10}, {t1 - t0:0.4f}\")"`
`84`	`85`	`]`
`85`	`86`	`},`
`86`	`87`	`{`