Skip to content

Commit 877f21a

Browse files
Merge pull request #157 from raybellwaves/setup-data
Move the prep data to the notebooks
2 parents b385a20 + f894666 commit 877f21a

11 files changed

+114
-78
lines changed

00_overview.ipynb

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,7 @@
6565
"cell_type": "markdown",
6666
"metadata": {},
6767
"source": [
68-
"\n",
69-
"You should clone this repository\n",
68+
"You should clone this repository: \n",
7069
"\n",
7170
" git clone http://github.com/dask/dask-tutorial\n",
7271
"\n",
@@ -75,20 +74,7 @@
7574
" conda env create -f binder/environment.yml\n",
7675
" conda activate dask-tutorial\n",
7776
" \n",
78-
"Do this *before* running this notebook\n",
79-
" \n",
80-
"Finally, run the following script to download and create data for analysis."
81-
]
82-
},
83-
{
84-
"cell_type": "code",
85-
"execution_count": null,
86-
"metadata": {},
87-
"outputs": [],
88-
"source": [
89-
"# in directory dask-tutorial/\n",
90-
"# this takes a little while\n",
91-
"%run prep.py"
77+
"Do this *before* running this notebook."
9278
]
9379
},
9480
{
@@ -194,7 +180,7 @@
194180
"name": "python",
195181
"nbconvert_exporter": "python",
196182
"pygments_lexer": "ipython3",
197-
"version": "3.7.3"
183+
"version": "3.7.6"
198184
}
199185
},
200186
"nbformat": 4,

01_dask.delayed.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -397,9 +397,9 @@
397397
"cell_type": "markdown",
398398
"metadata": {},
399399
"source": [
400-
"### Prep data\n",
400+
"## Create data\n",
401401
"\n",
402-
"First, run this code to prep some data, if you have not already done so.\n",
402+
"Run this code to prep some data.\n",
403403
"\n",
404404
"This downloads and extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is originally from [here](http://stat-computing.org/dataexpo/2009/the-data.html)."
405405
]
@@ -410,7 +410,7 @@
410410
"metadata": {},
411411
"outputs": [],
412412
"source": [
413-
"%run prep.py"
413+
"%run prep.py -d flights"
414414
]
415415
},
416416
{
@@ -736,7 +736,7 @@
736736
"name": "python",
737737
"nbconvert_exporter": "python",
738738
"pygments_lexer": "ipython3",
739-
"version": "3.7.3"
739+
"version": "3.7.6"
740740
}
741741
},
742742
"nbformat": 4,

01x_lazy.ipynb

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,15 @@
168168
"Consider reading three CSV files with `pd.read_csv` and then measuring their total length. We will consider how you would do this with ordinary Python code, then build a graph for this process using delayed, and finally execute this graph using Dask, for a handy speed-up factor of more than two (there are only three inputs to parallelize over)."
169169
]
170170
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"metadata": {},
175+
"outputs": [],
176+
"source": [
177+
"%run prep.py -d accounts"
178+
]
179+
},
171180
{
172181
"cell_type": "code",
173182
"execution_count": null,
@@ -573,7 +582,7 @@
573582
"name": "python",
574583
"nbconvert_exporter": "python",
575584
"pygments_lexer": "ipython3",
576-
"version": "3.7.3"
585+
"version": "3.7.6"
577586
}
578587
},
579588
"nbformat": 4,

02_bag.ipynb

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,29 @@
3737
"* [Bag examples](https://examples.dask.org/bag.html)"
3838
]
3939
},
40+
{
41+
"cell_type": "markdown",
42+
"metadata": {},
43+
"source": [
44+
"## Create data"
45+
]
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": null,
50+
"metadata": {},
51+
"outputs": [],
52+
"source": [
53+
"%run prep.py -d accounts"
54+
]
55+
},
56+
{
57+
"cell_type": "markdown",
58+
"metadata": {},
59+
"source": [
60+
"## Setup"
61+
]
62+
},
4063
{
4164
"cell_type": "markdown",
4265
"metadata": {},
@@ -655,6 +678,13 @@
655678
"source": [
656679
"client.shutdown()"
657680
]
681+
},
682+
{
683+
"cell_type": "code",
684+
"execution_count": null,
685+
"metadata": {},
686+
"outputs": [],
687+
"source": []
658688
}
659689
],
660690
"metadata": {
@@ -674,7 +704,7 @@
674704
"name": "python",
675705
"nbconvert_exporter": "python",
676706
"pygments_lexer": "ipython3",
677-
"version": "3.7.3"
707+
"version": "3.7.6"
678708
}
679709
},
680710
"nbformat": 4,

03_array.ipynb

Lines changed: 30 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,29 @@
3636
"* [Array examples](https://examples.dask.org/array.html)"
3737
]
3838
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"## Create data"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {},
50+
"outputs": [],
51+
"source": [
52+
"%run prep.py -d random"
53+
]
54+
},
55+
{
56+
"cell_type": "markdown",
57+
"metadata": {},
58+
"source": [
59+
"## Setup"
60+
]
61+
},
3962
{
4063
"cell_type": "code",
4164
"execution_count": null,
@@ -67,23 +90,12 @@
6790
"We do exactly this with Python and NumPy in the following example:"
6891
]
6992
},
70-
{
71-
"cell_type": "markdown",
72-
"metadata": {},
73-
"source": [
74-
"**Create random dataset**"
75-
]
76-
},
7793
{
7894
"cell_type": "code",
7995
"execution_count": null,
8096
"metadata": {},
8197
"outputs": [],
8298
"source": [
83-
"# create data if it doesn't already exist\n",
84-
"from prep import random_array\n",
85-
"random_array() \n",
86-
"\n",
8799
"# Load data with h5py\n",
88100
"# this creates a pointer to the data, but does not actually load\n",
89101
"import h5py\n",
@@ -155,15 +167,7 @@
155167
"metadata": {},
156168
"outputs": [],
157169
"source": [
158-
"# Compute the mean of the array"
159-
]
160-
},
161-
{
162-
"cell_type": "code",
163-
"execution_count": null,
164-
"metadata": {},
165-
"outputs": [],
166-
"source": [
170+
"# Compute the mean of the array\n",
167171
"sums = []\n",
168172
"lengths = []\n",
169173
"for i in range(0, 1000000000, 1000000):\n",
@@ -173,7 +177,7 @@
173177
"\n",
174178
"total = sum(sums)\n",
175179
"length = sum(lengths)\n",
176-
"print(total / length)\n"
180+
"print(total / length)"
177181
]
178182
},
179183
{
@@ -510,8 +514,7 @@
510514
"metadata": {},
511515
"outputs": [],
512516
"source": [
513-
"from prep import create_weather # Prep data if it doesn't exist\n",
514-
"create_weather()"
517+
"%run prep.py -d weather"
515518
]
516519
},
517520
{
@@ -637,7 +640,7 @@
637640
},
638641
"outputs": [],
639642
"source": [
640-
"# complete the following\n",
643+
"# complete the following:\n",
641644
"fig = plt.figure(figsize=(16, 8))\n",
642645
"plt.imshow(..., cmap='RdBu_r')"
643646
]
@@ -755,7 +758,7 @@
755758
"\n",
756759
"result = x[:, ::2, ::2]\n",
757760
"\n",
758-
"da.to_zarr(result, os.path.join('data', 'myfile.zarr'), overwrite=True)\n"
761+
"da.to_zarr(result, os.path.join('data', 'myfile.zarr'), overwrite=True)"
759762
]
760763
},
761764
{
@@ -830,7 +833,7 @@
830833
"cell_type": "markdown",
831834
"metadata": {},
832835
"source": [
833-
"Notice that the most time consuming function is `distances`."
836+
"Notice that the most time consuming function is `distances`:"
834837
]
835838
},
836839
{
@@ -973,7 +976,7 @@
973976
"name": "python",
974977
"nbconvert_exporter": "python",
975978
"pygments_lexer": "ipython3",
976-
"version": "3.7.3"
979+
"version": "3.7.6"
977980
}
978981
},
979982
"nbformat": 4,

05_distributed.ipynb

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,15 @@
4949
"Lets see the difference for the familiar case of the flights data"
5050
]
5151
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": null,
55+
"metadata": {},
56+
"outputs": [],
57+
"source": [
58+
"%run prep.py -d flights"
59+
]
60+
},
5261
{
5362
"cell_type": "code",
5463
"execution_count": null,
@@ -323,7 +332,7 @@
323332
"name": "python",
324333
"nbconvert_exporter": "python",
325334
"pygments_lexer": "ipython3",
326-
"version": "3.7.3"
335+
"version": "3.7.6"
327336
}
328337
},
329338
"nbformat": 4,

06_distributed_advanced.ipynb

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,15 @@
237237
"Generally, any Dask operation that is executed using `.compute()` can be submitted for asynchronous execution using `c.compute()` instead, and this applies to all collections. Here is an example with the calculation previously seen in the Bag chapter. We have replaced the `.compute()` method there with the distributed client version, so, again, we could continue to submit more work (perhaps based on the result of the calculation), or, in the next cell, follow the progress of the computation. A similar progress-bar appears in the monitoring UI page."
238238
]
239239
},
240+
{
241+
"cell_type": "code",
242+
"execution_count": null,
243+
"metadata": {},
244+
"outputs": [],
245+
"source": [
246+
"%run prep.py -d accounts"
247+
]
248+
},
240249
{
241250
"cell_type": "code",
242251
"execution_count": null,
@@ -305,6 +314,15 @@
305314
"In the example here, we repeat a calculation from the Array chapter - notice that each call to `compute()` is roughly the same speed, because the loading of the data is included every time."
306315
]
307316
},
317+
{
318+
"cell_type": "code",
319+
"execution_count": null,
320+
"metadata": {},
321+
"outputs": [],
322+
"source": [
323+
"%run prep.py -d random"
324+
]
325+
},
308326
{
309327
"cell_type": "code",
310328
"execution_count": null,
@@ -675,7 +693,7 @@
675693
"name": "python",
676694
"nbconvert_exporter": "python",
677695
"pygments_lexer": "ipython3",
678-
"version": "3.7.3"
696+
"version": "3.7.6"
679697
}
680698
},
681699
"nbformat": 4,

07_dataframe_storage.ipynb

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -39,14 +39,7 @@
3939
"cell_type": "markdown",
4040
"metadata": {},
4141
"source": [
42-
"## Setup"
43-
]
44-
},
45-
{
46-
"cell_type": "markdown",
47-
"metadata": {},
48-
"source": [
49-
"Create data if we don't have any"
42+
"## Create data"
5043
]
5144
},
5245
{
@@ -55,8 +48,7 @@
5548
"metadata": {},
5649
"outputs": [],
5750
"source": [
58-
"from prep import accounts_csvs\n",
59-
"accounts_csvs()\n"
51+
"%run prep.py -d accounts"
6052
]
6153
},
6254
{
@@ -389,7 +381,7 @@
389381
"name": "python",
390382
"nbconvert_exporter": "python",
391383
"pygments_lexer": "ipython3",
392-
"version": "3.7.3"
384+
"version": "3.7.6"
393385
}
394386
},
395387
"nbformat": 4,

Homework.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
"\n",
6262
"* Use `dask.bag` to inspect the data\n",
6363
"* Combine `dask.bag` with `nltk` or `gensim` to perform textual analyis on the data\n",
64-
"* Reproduce the work of [Daniel Rodriguez](http://danielfrg.com/blog/2015/07/21/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."
64+
"* Reproduce the work of [Daniel Rodriguez](https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."
6565
]
6666
},
6767
{
@@ -111,7 +111,7 @@
111111
"name": "python",
112112
"nbconvert_exporter": "python",
113113
"pygments_lexer": "ipython3",
114-
"version": "3.6.1"
114+
"version": "3.7.6"
115115
}
116116
},
117117
"nbformat": 4,

0 commit comments

Comments
 (0)