Add more about retries in futures notebook (#258)

jsignell · web-flow · commit 0472d3cc55bc · 2022-07-10T12:33:58.000-04:00
diff --git a/05_futures.ipynb b/05_futures.ipynb
@@ -11,10 +11,12 @@
     "\n",
     "# Futures - non-blocking distributed calculations\n",
     "\n",
-    "In the previous chapter, we showed that executing a calculation (created using delayed) with the distributed executor is identical to any other executor. However, we now have access to additional functionality, and control over what data is held in memory.\n",
+    "Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. \n",
     "\n",
     "The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session.\n",
     "\n",
+    "This is the important difference between futures and delayed. Both can be used to support arbitrary task scheduling, but delayed is lazy (it just constructs a graph) whereas futures are eager. With futures, as soon as the inputs are available and there is compute available, the computation starts. \n",
+    "\n",
     "**Related Documentation**\n",
     "\n",
     "* [Futures documentation](https://docs.dask.org/en/latest/futures.html)\n",
@@ -183,9 +185,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### `client.compute`\n",
+    "## `client.compute`\n",
     "\n",
-    "Generally, any Dask operation that is executed using `.compute()` (or `dask.compute()` can be submitted for asynchronous execution using `client.compute()` instead, and this applies to all collections.\n",
+    "Generally, any Dask operation that is executed using `.compute()` or `dask.compute()` can be submitted for asynchronous execution using `client.compute()` instead.\n",
     "\n",
     "Here is an example from the delayed notebook:"
    ]
@@ -213,6 +215,13 @@
     "z = add(x, y)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far we have a regular `dask.delayed` output. When we pass `z` to `client.compute` we get a future back and Dask starts evaluating the task graph. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -245,9 +254,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### `client.submit`\n",
+    "## `client.submit`\n",
     "\n",
-    "`client.submit` takes a function and arguments, pushes these to the cluster, returning a *Future* representing the result to be computed. The function is passed to a worker process for evaluation. This looks a lot like doing `client.compute()`, above, except now we are passing the function and arguments directly to the cluster."
+    "`client.submit` takes a function and arguments, pushes these to the cluster, returning a `Future` representing the result to be computed. The function is passed to a worker process for evaluation. This looks a lot like doing `client.compute()`, above, except now we are passing the function and arguments directly to the cluster."
    ]
   },
   {
@@ -290,11 +299,150 @@
     "\n",
     "Each future represents a result held, or being evaluated by the cluster. Thus we can control caching of intermediate values - when a future is no longer referenced, its value is forgotten. In the solution, above, futures are held for each of the function calls. These results would not need to be re-evaluated if we chose to submit more work that needed them.\n",
     "\n",
-    "We can explicitly pass data from our local session into the cluster using `client.scatter()`, but usually it is better to construct functions that do the loading of data within the workers themselves, so that there is no need to serialize and communicate the data. Most of the loading functions within Dask, such as `dd.read_csv`, work this way. Similarly, we normally don't want to `gather()` results that are too big in memory.\n",
+    "We can explicitly pass data from our local session into the cluster using `client.scatter()`, but usually it is better to construct functions that do the loading of data within the workers themselves, so that there is no need to serialize and communicate the data. Most of the loading functions within Dask, such as `dd.read_csv`, work this way. Similarly, we normally don't want to `gather()` results that are too big in memory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example: Sporadically failing task\n",
+    "\n",
+    "Let's imagine a task that sometimes fails. You might encounter this when dealing with input data where sometimes a file is malformed, or maybe a request times out."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from random import random\n",
+    " \n",
+    "def flaky_inc(i):\n",
+    "    if random() < 0.2:\n",
+    "        raise ValueError(\"You hit the error!\")\n",
+    "    return i + 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you run this function over and over again, it will sometimes fail. \n",
+    "\n",
+    "```python\n",
+    ">>> flaky_inc(2)\n",
+    "---------------------------------------------------------------------------\n",
+    "ValueError                                Traceback (most recent call last)\n",
+    "Input In [65], in <cell line: 1>()\n",
+    "----> 1 flaky_inc(2)\n",
+    "\n",
+    "Input In [61], in flaky_inc(i)\n",
+    "      3 def flaky_inc(i):\n",
+    "      4     if random() < 0.5:\n",
+    "----> 5         raise ValueError(\"You hit the error!\")\n",
+    "      6     return i + 1\n",
+    "\n",
+    "ValueError: You hit the error!\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can run this function on a range of inputs using `client.map`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "futures = client.map(flaky_inc, range(10))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice how the cell returned even though some of the computations failed. We can inspect these futures one by one and find the ones that failed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, future in enumerate(futures):\n",
+    "    print(i, future.status)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can rerun those specific futures to try to get the task to successfully complete:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "futures[5].retry()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, future in enumerate(futures):\n",
+    "    print(i, future.status)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A more concise way of retrying in the case of sporadic failures is by setting the number of retries in the `client.compute`, `client.submit` or `client.map` method.\n",
+    "\n",
+    "**Note**: In this example we also need to set `pure=False` to let Dask know that the arguments to the function do not totally determine the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "futures = client.map(flaky_inc, range(10), retries=5, pure=False)\n",
+    "future_z = client.submit(sum, futures)\n",
+    "future_z.result()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You will see a lot of warnings, but the computation should eventually succeed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Why use Futures?\n",
     "\n",
-    "The [full API](http://distributed.readthedocs.io/en/latest/api.html) of the distributed scheduler gives details of interacting with the cluster, which remember, can be on your local machine or possibly on a massive computational resource. \n",
+    "The futures API offers a work submission style that can easily emulate the map/reduce paradigm. If that is familiar to you then futures might be the simplest entrypoint into Dask. \n",
     "\n",
-    "The futures API offers a work submission style that can easily emulate the map/reduce paradigm (see `client.map()`) that may be familiar to many people. The intermediate results, represented by futures, can be passed to new tasks without having to bring the pull locally from the cluster, and new work can be assigned to work on the output of previous jobs that haven't even begun yet."
+    "The other big benefit of futures is that the intermediate results, represented by futures, can be passed to new tasks without having to pull data locally from the cluster. New operations can be setup to work on the output of previous jobs that haven't even begun yet."
    ]
   }
  ],