Merge pull request #157 from raybellwaves/setup-data

TomAugspurger · web-flow · commit 877f21a871aa · 2020-01-02T09:36:37.000-06:00
Move the prep data to the notebooks
diff --git a/00_overview.ipynb b/00_overview.ipynb
@@ -65,8 +65,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "\n",
-    "You should clone this repository\n",
+    "You should clone this repository: \n",
     "\n",
     "    git clone http://github.com/dask/dask-tutorial\n",
     "\n",
@@ -75,20 +74,7 @@
     "    conda env create -f binder/environment.yml\n",
     "    conda activate dask-tutorial\n",
     "    \n",
-    "Do this *before* running this notebook\n",
-    "    \n",
-    "Finally, run the following script to download and create data for analysis."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# in directory dask-tutorial/\n",
-    "# this takes a little while\n",
-    "%run prep.py"
+    "Do this *before* running this notebook."
    ]
   },
   {
@@ -194,7 +180,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/01_dask.delayed.ipynb b/01_dask.delayed.ipynb
@@ -397,9 +397,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Prep data\n",
+    "## Create data\n",
     "\n",
-    "First, run this code to prep some data, if you have not already done so.\n",
+    "Run this code to prep some data.\n",
     "\n",
     "This downloads and extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is originally from [here](http://stat-computing.org/dataexpo/2009/the-data.html)."
    ]
@@ -410,7 +410,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%run prep.py"
+    "%run prep.py -d flights"
    ]
   },
   {
@@ -736,7 +736,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/01x_lazy.ipynb b/01x_lazy.ipynb
@@ -168,6 +168,15 @@
     "Consider reading three CSV files with `pd.read_csv` and then measuring their total length. We will consider how you would do this with ordinary Python code, then build a graph for this process using delayed, and finally execute this graph using Dask, for a handy speed-up factor of more than two (there are only three inputs to parallelize over)."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run prep.py -d accounts"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -573,7 +582,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/02_bag.ipynb b/02_bag.ipynb
@@ -37,6 +37,29 @@
     "* [Bag examples](https://examples.dask.org/bag.html)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run prep.py -d accounts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -655,6 +678,13 @@
    "source": [
     "client.shutdown()"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -674,7 +704,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/03_array.ipynb b/03_array.ipynb
@@ -36,6 +36,29 @@
     "* [Array examples](https://examples.dask.org/array.html)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run prep.py -d random"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -67,23 +90,12 @@
     "We do exactly this with Python and NumPy in the following example:"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Create random dataset**"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# create data if it doesn't already exist\n",
-    "from prep import random_array\n",
-    "random_array()  \n",
-    "\n",
     "# Load data with h5py\n",
     "# this creates a pointer to the data, but does not actually load\n",
     "import h5py\n",
@@ -155,15 +167,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Compute the mean of the array"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "# Compute the mean of the array\n",
     "sums = []\n",
     "lengths = []\n",
     "for i in range(0, 1000000000, 1000000):\n",
@@ -173,7 +177,7 @@
     "\n",
     "total = sum(sums)\n",
     "length = sum(lengths)\n",
-    "print(total / length)\n"
+    "print(total / length)"
    ]
   },
   {
@@ -510,8 +514,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from prep import create_weather  # Prep data if it doesn't exist\n",
-    "create_weather()"
+    "%run prep.py -d weather"
    ]
   },
   {
@@ -637,7 +640,7 @@
    },
    "outputs": [],
    "source": [
-    "# complete the following\n",
+    "# complete the following:\n",
     "fig = plt.figure(figsize=(16, 8))\n",
     "plt.imshow(..., cmap='RdBu_r')"
    ]
@@ -755,7 +758,7 @@
     "\n",
     "result = x[:, ::2, ::2]\n",
     "\n",
-    "da.to_zarr(result, os.path.join('data', 'myfile.zarr'), overwrite=True)\n"
+    "da.to_zarr(result, os.path.join('data', 'myfile.zarr'), overwrite=True)"
    ]
   },
   {
@@ -830,7 +833,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Notice that the most time consuming function is `distances`."
+    "Notice that the most time consuming function is `distances`:"
    ]
   },
   {
@@ -973,7 +976,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/05_distributed.ipynb b/05_distributed.ipynb
@@ -49,6 +49,15 @@
     "Lets see the difference for the familiar case of the flights data"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run prep.py -d flights"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -323,7 +332,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/06_distributed_advanced.ipynb b/06_distributed_advanced.ipynb
@@ -237,6 +237,15 @@
     "Generally, any Dask operation that is executed using `.compute()` can be submitted for asynchronous execution using `c.compute()` instead, and this applies to all collections. Here is an example with the calculation previously seen in the Bag chapter. We have replaced the `.compute()` method there with the distributed client version, so, again, we could continue to submit more work (perhaps based on the result of the calculation), or, in the next cell, follow the progress of the computation. A similar progress-bar appears in the monitoring UI page."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run prep.py -d accounts"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -305,6 +314,15 @@
     "In the example here, we repeat a calculation from the Array chapter - notice that each call to `compute()` is roughly the same speed, because the loading of the data is included every time."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run prep.py -d random"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -675,7 +693,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/07_dataframe_storage.ipynb b/07_dataframe_storage.ipynb
@@ -39,14 +39,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Setup"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Create data if we don't have any"
+    "## Create data"
    ]
   },
   {
@@ -55,8 +48,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from prep import accounts_csvs\n",
-    "accounts_csvs()\n"
+    "%run prep.py -d accounts"
    ]
   },
   {
@@ -389,7 +381,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/Homework.ipynb b/Homework.ipynb
@@ -61,7 +61,7 @@
     "\n",
     "*  Use `dask.bag` to inspect the data\n",
     "*  Combine `dask.bag` with `nltk` or `gensim` to perform textual analyis on the data\n",
-    "*  Reproduce the work of [Daniel Rodriguez](http://danielfrg.com/blog/2015/07/21/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."
+    "*  Reproduce the work of [Daniel Rodriguez](https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."
    ]
   },
   {
@@ -111,7 +111,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.1"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,
diff --git a/README.md b/README.md
diff --git a/prep.py b/prep.py

Original file line number	Diff line number	Diff line change
`@@ -61,7 +61,7 @@`
`61`	`61`	`"\n",`
`62`	`62`	"* Use `dask.bag` to inspect the data\n",
`63`	`63`	"* Combine `dask.bag` with `nltk` or `gensim` to perform textual analyis on the data\n",
`64`		`- "* Reproduce the work of [Daniel Rodriguez](http://danielfrg.com/blog/2015/07/21/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."`
	`64`	`+ "* Reproduce the work of [Daniel Rodriguez](https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."`
`65`	`65`	`]`
`66`	`66`	`},`
`67`	`67`	`{`
`@@ -111,7 +111,7 @@`
`111`	`111`	`"name": "python",`
`112`	`112`	`"nbconvert_exporter": "python",`
`113`	`113`	`"pygments_lexer": "ipython3",`
`114`		`- "version": "3.6.1"`
	`114`	`+ "version": "3.7.6"`
`115`	`115`	`}`
`116`	`116`	`},`
`117`	`117`	`"nbformat": 4,`