dask
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎01_dask.delayed.ipynb‎
Lines changed: 39 additions & 3 deletions b/‎01_dask.delayed.ipynb‎
Lines changed: 39 additions & 3 deletions
diff --git a/‎02_bag.ipynb‎
Lines changed: 32 additions & 9 deletions b/‎02_bag.ipynb‎
Lines changed: 32 additions & 9 deletions
diff --git a/‎03_array.ipynb‎
Lines changed: 32 additions & 11 deletions b/‎03_array.ipynb‎
Lines changed: 32 additions & 11 deletions
diff --git a/‎04_dataframe.ipynb‎
Lines changed: 23 additions & 3 deletions b/‎04_dataframe.ipynb‎
Lines changed: 23 additions & 3 deletions
diff --git a/‎05_distributed.ipynb‎
Lines changed: 16 additions & 12 deletions b/‎05_distributed.ipynb‎
Lines changed: 16 additions & 12 deletions
diff --git a/‎06_distributed_advanced.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎06_distributed_advanced.ipynb‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎binder/environment.yml‎
Lines changed: 1 addition & 0 deletions b/‎binder/environment.yml‎
Lines changed: 1 addition & 0 deletions
@@ -11,6 +11,7 @@ data/weather-big
 data/myfile.hdf5
 data/flightjson
 data/nycflights
+data/myfile.zarr
 profile.html
 log
 .idea/
@@ -16,6 +16,24 @@
     "This is a simple way to use `dask` to parallelize existing codebases or build [complex systems](http://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask).  This will also help us to develop an understanding for later sections."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As well see in the [distributed scheduler notebook](05_distributed.ipynb), Dask has several ways of executing code in parallel. We'll use the distributed scheduler by creating a `dask.distributed.Client`. For now, this will provide us with some nice diagnostics. We'll talk about schedulers in depth later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -117,7 +135,7 @@
    "outputs": [],
    "source": [
     "%%time\n",
-    "# This actually runs our computation using a local thread pool\n",
+    "# This actually runs our computation using a local process pool\n",
     "\n",
     "z.compute()"
    ]
@@ -625,6 +643,24 @@
     "- Experiment with delaying the call to `sum`. What does the graph look like if `sum` is delayed? What does the graph look like if it isn't?\n",
     "- Can you think of any reason why you'd want to do the reduction one way over the other?"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Close the Client\n",
+    "\n",
+    "Before moving on to the next exercise, make sure to close your client or stop this kernel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.close()"
+   ]
   }
  ],
  "metadata": {
@@ -643,9 +679,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
@@ -35,6 +35,24 @@
     "*  [Bag API](http://dask.pydata.org/en/latest/bag-api.html)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, we'll use the distributed scheduler. Schedulers will be explained in depth [later](05_distributed.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -422,9 +440,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
@@ -552,9 +568,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "def denormalize(record):\n",
@@ -612,12 +626,21 @@
     "    a normalised dataframe."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Shutdown"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "client.shutdown()"
+   ]
   }
  ],
  "metadata": {
@@ -637,9 +660,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
@@ -31,6 +31,17 @@
     "* [API reference](http://dask.readthedocs.io/en/latest/array-api.html)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client(processes=False)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -292,9 +303,7 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "source": [
     "Does this match your result from before?"
    ]
@@ -505,6 +514,13 @@
     "dsets[0]"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -757,9 +773,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%time potential(cluster)"
@@ -846,9 +860,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "e = potential_dask(dcluster)\n",
@@ -881,6 +893,15 @@
     "    functions, like ``np.full_like`` have not been implemented purely out of\n",
     "    laziness.  These would make excellent community contributions."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.shutdown()"
+   ]
   }
  ],
  "metadata": {
@@ -900,9 +921,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 1
+ "nbformat_minor": 4
 }
@@ -46,6 +46,17 @@
     "## Setup"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask.distributed import Client\n",
+    "\n",
+    "client = Client()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -60,7 +71,7 @@
    "outputs": [],
    "source": [
     "from prep import accounts_csvs\n",
-    "accounts_csvs(3, 1000000, 500)\n",
+    "accounts_csvs()\n",
     "\n",
     "import os\n",
     "import dask\n",
@@ -775,6 +786,15 @@
     "    *  From any set of functions creating sub dataframes via ``dd.from_delayed``.\n",
     "    *  Dask.bag: ``mybag.to_dataframe(columns=[...])``"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.shutdown()"
+   ]
   }
  ],
  "metadata": {
@@ -794,9 +814,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
@@ -94,16 +94,13 @@
     "- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?\n",
     "- Why is the multiprocessing scheduler so much slower here?\n",
     "\n",
+    "The `threaded` scheduler is a fine choice for working with large datasets out-of-core on a single machine, as long as the functions being used release the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) most of the time. NumPy and pandas release the GIL in most places, so the `threaded` scheduler is the default for `dask.array` and `dask.dataframe`. The distributed scheduler, perhaps with `processes=False`, will also work well for these workloads on a single machine.\n",
     "\n",
-    "For single-machine use, the threaded and multiprocessing schedulers are fine choices. They are solid, mature and performant, and require absolutely no set-up. As a rule of thumb, threaded will work well when the functions called release the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock), whereas multiprocessing will always have a slower start-up time and suffer where a lot of communication is required between tasks. The *number* of workers is, in general, also important.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "For scaling out work across a cluster, the distributed scheduler is required. Indeed, this is now generally preferred for all work, because it gives you additional monitoring information not available in the other schedulers. (Some of this monitoring is also available with an explicit progress bar and profiler, see [here](https://docs.dask.org/en/latest/diagnostics-local.html).)"
+    "For workloads that do hold the GIL, as is common with `dask.bag` and custom code wrapped with `dask.delayed`, we recommend using the distributed scheduler, even on a single machine. Generally speaking, it's more intelligent and provides better diagnostics than the `processes` scheduler.\n",
+    "\n",
+    "https://docs.dask.org/en/latest/scheduling.html provides some additional details on choosing a scheduler.\n",
+    "\n",
+    "For scaling out work across a cluster, the distributed scheduler is required."
    ]
   },
   {
@@ -147,9 +144,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Be sure to click the `Dashboard` link to open up the diagnostics dashboard.\n",
-    "\n",
-    "\n",
+    "If you aren't in jupyterlab and using the `dask-labextension`, be sure to click the `Dashboard` link to open up the diagnostics dashboard.\n",
     "\n",
     "## Executing with the distributed client"
    ]
@@ -299,6 +294,15 @@
     "# Average departure delay per day-of-week\n",
     "_ = df.groupby(df.Date.dt.dayofweek).DepDelay.mean().compute()"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.shutdown()"
+   ]
   }
  ],
  "metadata": {
 
@@ -655,9 +655,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.2"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
@@ -30,3 +30,4 @@ dependencies:
   - ipywidgets>=7.5
   - cachey
   - python-graphviz
+  - zarr
Original file line number	Diff line number	Diff line change
`@@ -655,9 +655,9 @@`
`655`	`655`	`"name": "python",`
`656`	`656`	`"nbconvert_exporter": "python",`
`657`	`657`	`"pygments_lexer": "ipython3",`
`658`		`- "version": "3.7.2"`
	`658`	`+ "version": "3.7.3"`
`659`	`659`	`}`
`660`	`660`	`},`
`661`	`661`	`"nbformat": 4,`
`662`		`- "nbformat_minor": 2`
	`662`	`+ "nbformat_minor": 4`
`663`	`663`	`}`