Skip to content

Commit 38522a1

Browse files
authored
First pass at reading over notebooks (#253)
1 parent 2809508 commit 38522a1

File tree

5 files changed

+94
-104
lines changed

5 files changed

+94
-104
lines changed

01_dataframe.ipynb

Lines changed: 56 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,11 @@
1212
" alt=\"Dask logo\\\">\n",
1313
"\n",
1414
"\n",
15-
"# Dask DataFrame API\n",
15+
"# Dask DataFrame - parallelized pandas\n",
1616
"\n",
17-
"`dask.dataframe` API is the high-level collection that looks and feels like the pandas API, but for parallel and distributed workflows. At it's core, the `dask.dataframe` module implements a \"blocked parallel\" `DataFrame` object that. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n",
17+
"Looks and feels like the pandas API, but for parallel and distributed workflows. \n",
1818
"\n",
19-
"<img src=\"https://docs.dask.org/en/stable/_images/dask-dataframe.svg\"\n",
20-
" width=\"30%\"\n",
21-
" alt=\"Dask DataFrame is compsed of pandas DataFrames\"/>\n",
22-
"\n",
23-
"The term \"Dask DataFrame\" is slightly overloaded. Depending on the context, it can refer to the API or the DataFrame object (data structure composed on multiple pandas DataFrames shown above). To avoid confusion, throughout this notebook:\n",
24-
"- `dask.dataframe` (note the all lowercase) refers to the API, and\n",
25-
"- `dask.dataframe.DataFrame` or simply `DataFrame` (note the CamelCase) refers to the object."
19+
"At its core, the `dask.dataframe` module implements a \"blocked parallel\" `DataFrame` object that. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrame`s separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n"
2620
]
2721
},
2822
{
@@ -31,6 +25,19 @@
3125
"tags": []
3226
},
3327
"source": [
28+
"<img src=\"https://docs.dask.org/en/stable/_images/dask-dataframe.svg\"\n",
29+
" align=\"right\"\n",
30+
" width=\"30%\"\n",
31+
" alt=\"Dask DataFrame is composed of pandas DataFrames\"/>\n",
32+
"\n",
33+
"**Related Documentation**\n",
34+
"\n",
35+
"* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n",
36+
"* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n",
37+
"* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n",
38+
"* [DataFrame examples](https://examples.dask.org/dataframe.html)\n",
39+
"* [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)\n",
40+
"\n",
3441
"## When to use `dask.dataframe`\n",
3542
"\n",
3643
"pandas is great for tabular datasets that fit in memory. A general rule of thumb for pandas is:\n",
@@ -43,20 +50,8 @@
4350
"\n",
4451
"Dask becomes useful when the datasets exceed the above rule.\n",
4552
"\n",
46-
"In this notebook, you will be working with the New York City Airline data. This dataset is only ~200MB, so that you can download it in a reasonable time, but `dask.dataframe` will scale to datasets **much** larger than memory."
47-
]
48-
},
49-
{
50-
"cell_type": "markdown",
51-
"metadata": {},
52-
"source": [
53-
"**Related Documentation**\n",
54-
"\n",
55-
"* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n",
56-
"* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n",
57-
"* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n",
58-
"* [DataFrame examples](https://examples.dask.org/dataframe.html)\n",
59-
"* [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)"
53+
"In this notebook, you will be working with the New York City Airline data. This dataset is only ~200MB, so that you can download it in a reasonable time, but `dask.dataframe` will scale to datasets **much** larger than memory.\n",
54+
"\n"
6055
]
6156
},
6257
{
@@ -136,13 +131,6 @@
136131
"Let's read an extract of flights in the USA across several years. This data is specific to flights out of the three airports in the New York City area."
137132
]
138133
},
139-
{
140-
"cell_type": "markdown",
141-
"metadata": {},
142-
"source": [
143-
"The following filename includes a glob pattern `*`, so all files in the path matching that pattern will be read into the same Dask `DataFrame`."
144-
]
145-
},
146134
{
147135
"cell_type": "code",
148136
"execution_count": null,
@@ -157,16 +145,13 @@
157145
"cell_type": "markdown",
158146
"metadata": {},
159147
"source": [
160-
"By convention, we import `dask.dataframe` as `dd`, and call the corresponding Dask DataFrame `ddf`."
161-
]
162-
},
163-
{
164-
"cell_type": "code",
165-
"execution_count": null,
166-
"metadata": {},
167-
"outputs": [],
168-
"source": [
169-
"import dask.dataframe as dd"
148+
"By convention, we import the module `dask.dataframe` as `dd`, and call the corresponding `DataFrame` object `ddf`.\n",
149+
"\n",
150+
"**Note**: The term \"Dask DataFrame\" is slightly overloaded. Depending on the context, it can refer to the module or the DataFrame object. To avoid confusion, throughout this notebook:\n",
151+
"- `dask.dataframe` (note the all lowercase) refers to the API, and\n",
152+
"- `DataFrame` (note the CamelCase) refers to the object.\n",
153+
"\n",
154+
"The following filename includes a glob pattern `*`, so all files in the path matching that pattern will be read into the same `DataFrame`."
170155
]
171156
},
172157
{
@@ -175,16 +160,10 @@
175160
"metadata": {},
176161
"outputs": [],
177162
"source": [
163+
"import dask.dataframe as dd\n",
164+
"\n",
178165
"ddf = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n",
179-
" parse_dates={'Date': [0, 1, 2]})"
180-
]
181-
},
182-
{
183-
"cell_type": "code",
184-
"execution_count": null,
185-
"metadata": {},
186-
"outputs": [],
187-
"source": [
166+
" parse_dates={'Date': [0, 1, 2]})\n",
188167
"ddf"
189168
]
190169
},
@@ -201,7 +180,7 @@
201180
"cell_type": "markdown",
202181
"metadata": {},
203182
"source": [
204-
"Notice that the representation of the DataFrame object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes."
183+
"Notice that the representation of the `DataFrame` object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes."
205184
]
206185
},
207186
{
@@ -360,7 +339,7 @@
360339
"source": [
361340
"## Computations with `dask.dataframe`\n",
362341
"\n",
363-
"Let's compute the maximum of the `DepDelay` column.\n",
342+
"Let's compute the maximum of the flight delay.\n",
364343
"\n",
365344
"With just pandas, we would loop over each file to find the individual maximums, then find the final maximum over all the individual maximums.\n",
366345
"\n",
@@ -388,7 +367,8 @@
388367
"outputs": [],
389368
"source": [
390369
"%%time\n",
391-
"ddf.DepDelay.max().compute()"
370+
"result = ddf.DepDelay.max()\n",
371+
"result.compute()"
392372
]
393373
},
394374
{
@@ -409,7 +389,7 @@
409389
"cell_type": "markdown",
410390
"metadata": {},
411391
"source": [
412-
"You can view the underlying task graph using the `.visualize` method:"
392+
"You can view the underlying task graph using `.visualize()`:"
413393
]
414394
},
415395
{
@@ -419,7 +399,7 @@
419399
"outputs": [],
420400
"source": [
421401
"# notice the parallelism\n",
422-
"ddf.DepDelay.max().visualize()"
402+
"result.visualize()"
423403
]
424404
},
425405
{
@@ -428,7 +408,7 @@
428408
"source": [
429409
"## Exercises\n",
430410
"\n",
431-
"In this section you will do a few `dask.dataframe` computations. If you are comfortable with pandas then these should be familiar. You will have to think about when to call `compute`."
411+
"In this section you will do a few `dask.dataframe` computations. If you are comfortable with pandas then these should be familiar. You will have to think about when to call `.compute()`."
432412
]
433413
},
434414
{
@@ -437,7 +417,7 @@
437417
"source": [
438418
"### 1. How many rows are in our dataset?\n",
439419
"\n",
440-
"If you aren't familiar with pandas, how would you check how many records are in a list of tuples?"
420+
"_Hint_: how would you check how many items are in a list?"
441421
]
442422
},
443423
{
@@ -466,7 +446,7 @@
466446
"source": [
467447
"### 2. In total, how many non-canceled flights were taken?\n",
468448
"\n",
469-
"With pandas, you would use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)."
449+
"_Hint_: use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)."
470450
]
471451
},
472452
{
@@ -493,9 +473,9 @@
493473
"cell_type": "markdown",
494474
"metadata": {},
495475
"source": [
496-
"### 3. In total, how many non-cancelled flights were taken from each airport?\n",
476+
"### 3. In total, how many non-canceled flights were taken from each airport?\n",
497477
"\n",
498-
"*Hint*: use [`groupby`](https://pandas.pydata.org/pandas-docs/stable/groupby.html)."
478+
"*Hint*: use [groupby](https://pandas.pydata.org/pandas-docs/stable/groupby.html)."
499479
]
500480
},
501481
{
@@ -596,7 +576,11 @@
596576
},
597577
"outputs": [],
598578
"source": [
599-
"ddf[\"Distance\"].apply(lambda x: x+1).compute() # don't worry about the warning, we'll discuss in the next sections"
579+
"ddf[\"Distance\"].apply(lambda x: x+1).compute() # don't worry about the warning, we'll discuss in the next sections\n",
580+
"\n",
581+
"# OR\n",
582+
"\n",
583+
"(ddf[\"Distance\"] + 1).compute()"
600584
]
601585
},
602586
{
@@ -605,9 +589,9 @@
605589
"source": [
606590
"## Sharing Intermediate Results\n",
607591
"\n",
608-
"When computing all of the above, we sometimes did the same operation more than once. For most operations, `dask.dataframe` hashes the arguments, allowing duplicate computations to be shared, and only computed once.\n",
592+
"When computing all of the above, we sometimes did the same operation more than once. For most operations, `dask.dataframe` stores the arguments, allowing duplicate computations to be shared and only computed once.\n",
609593
"\n",
610-
"For example, lets compute the mean and standard deviation for departure delay of all non-canceled flights. Since Dask operations are lazy, those values aren't the final results yet. They're just the recipe required to get the result.\n",
594+
"For example, let's compute the mean and standard deviation for departure delay of all non-canceled flights. Since Dask operations are lazy, those values aren't the final results yet. They're just the steps required to get the result.\n",
611595
"\n",
612596
"If you compute them with two calls to compute, there is no sharing of intermediate computations."
613597
]
@@ -697,11 +681,11 @@
697681
"tags": []
698682
},
699683
"source": [
700-
"### `persit`\n",
684+
"### `.persist()`\n",
701685
"\n",
702-
"While using a distributed scheduler (you will learn more about schedulers in the upcoming notebooks), which is the case in this notebook, you can keep some _data that you want to use often_ in the _distributed memory_. \n",
686+
"While using a distributed scheduler (you will learn more about schedulers in the upcoming notebooks), you can keep some _data that you want to use often_ in the _distributed memory_. \n",
703687
"\n",
704-
"`persist` computes the \"Futures\" ( and stores it in the same structure as your output. You can use persist with any data or computation that fits in memory."
688+
"`persist` generates \"Futures\" (more on this later as well) and stores them in the same structure as your output. You can use `persist` with any data or computation that fits in memory."
705689
]
706690
},
707691
{
@@ -783,14 +767,14 @@
783767
"\n",
784768
"Additionally, some important operations like `set_index` work, but are slower than in pandas because they include substantial shuffling of data, and may write out to disk.\n",
785769
"\n",
786-
"**So, what if you want to use some custom functions that aren't (or can't be) implemented for Dask DataFrame yet?**\n",
770+
"**What if you want to use some custom functions that aren't (or can't be) implemented for Dask DataFrame yet?**\n",
787771
"\n",
788772
"You can open an issue on the [Dask issue tracker](https://github.com/dask/dask/issues) to check how feasible the function could be to implement, and you can consider contributing this function to Dask.\n",
789773
"\n",
790774
"If it's a custom function or tricky to implement, `dask.dataframe` provides a few methods to make applying custom functions to Dask DataFrames easier:\n",
791775
"\n",
792-
"- [`map_partitions`](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.map_partitions.html): to use a function on each partition (each pandas DataFrame) of the Dask DataFrame\n",
793-
"- [`map_overlap`](https://docs.dask.org/en/latest/generated/dask.dataframe.rolling.map_overlap.html): to use a function on each partition (each pandas DataFrame) of the Dask DataFrame, with some overlap between partitions\n",
776+
"- [`map_partitions`](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.map_partitions.html): to run a function on each partition (each pandas DataFrame) of the Dask DataFrame\n",
777+
"- [`map_overlap`](https://docs.dask.org/en/latest/generated/dask.dataframe.rolling.map_overlap.html): to run a function on each partition (each pandas DataFrame) of the Dask DataFrame, with some rows shared between neighboring partitions\n",
794778
"- [`reduction`](https://docs.dask.org/en/latest/generated/dask.dataframe.Series.reduction.html): for custom row-wise reduction operations."
795779
]
796780
},
@@ -829,7 +813,7 @@
829813
"\n",
830814
"meta = ddf.head()\n",
831815
"\n",
832-
"ddf = ddf.map_partitions(my_custom_function, True, meta=meta)"
816+
"ddf = ddf.map_partitions(my_custom_function, parameter_a=True, meta=meta)"
833817
]
834818
},
835819
{
@@ -839,8 +823,8 @@
839823
"We suggest using Dask's `apply` and `map` functions when you can because they already use `map_partitions` internally.\n",
840824
"\n",
841825
"Using the correct `meta` is important here, because your output will depend on it. A few notes about `meta`:\n",
842-
"* Think of `meta` as a suggestion that Dask will while it is operating lazily and can not infer the output structure.\n",
843-
"* The best way to specify `meta` is using a small pandas DataFrame and Series that match the structure of your final output."
826+
"* Think of `meta` as a suggestion that Dask uses while it is operating lazily. Importantly `meta` _never infers with the output structure_.\n",
827+
"* The best way to specify `meta` is using a small pandas DataFrame or Series that match the structure of your final output."
844828
]
845829
},
846830
{
@@ -890,13 +874,6 @@
890874
" \n",
891875
"Furthermore, the distributed scheduler (you will learn about it later) allows the same `dask.dataframe` expressions to be executed across a cluster. To enable massive \"big data\" processing, one could execute data ingestion functions such as `read_parquet`, where the data is held on storage accessible to every worker node (e.g., amazon's S3), and because most operations begin by selecting only some columns, transforming and filtering the data, only relatively small amounts of data need to be communicated between the machines."
892876
]
893-
},
894-
{
895-
"cell_type": "code",
896-
"execution_count": null,
897-
"metadata": {},
898-
"outputs": [],
899-
"source": []
900877
}
901878
],
902879
"metadata": {
@@ -916,7 +893,7 @@
916893
"name": "python",
917894
"nbconvert_exporter": "python",
918895
"pygments_lexer": "ipython3",
919-
"version": "3.9.10"
896+
"version": "3.10.5"
920897
}
921898
},
922899
"nbformat": 4,

02_array.ipynb

Lines changed: 31 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,20 @@
99
" width=\"30%\"\n",
1010
" alt=\"Dask logo\\\">\n",
1111
"\n",
12-
"# Dask Arrays\n",
13-
"\n",
14-
"<img src=\"https://docs.dask.org/en/stable/_images/dask-array.svg\" width=\"40%\" align=\"right\">\n",
15-
"Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. \n",
12+
"# Dask Arrays - parallelized numpy\n",
13+
"Parallel, larger-than-memory, n-dimensional array using blocked algorithms. \n",
1614
"\n",
1715
"* **Parallel**: Uses all of the cores on your computer\n",
1816
"* **Larger-than-memory**: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.\n",
19-
"* **Blocked Algorithms**: Perform large computations by performing many smaller computations.\n",
17+
"* **Blocked Algorithms**: Perform large computations by performing many smaller computations.\n"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"<img src=\"https://docs.dask.org/en/stable/_images/dask-array.svg\" width=\"40%\" align=\"right\">\n",
25+
"\n",
2026
"\n",
2127
"In other words, Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.\n",
2228
"\n",
@@ -715,26 +721,33 @@
715721
]
716722
},
717723
{
718-
"cell_type": "code",
719-
"execution_count": null,
724+
"cell_type": "markdown",
720725
"metadata": {},
721-
"outputs": [],
722726
"source": [
723-
"client.shutdown()"
727+
"### Learn More \n",
728+
"\n",
729+
"Both xarray and zarr have their own tutorials that go into greater depth:\n",
730+
"\n",
731+
"* [Zarr tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html)\n",
732+
"* [Xarray tutorial](https://tutorial.xarray.dev/intro.html)"
724733
]
725734
},
726735
{
727736
"cell_type": "markdown",
728737
"metadata": {},
729738
"source": [
730-
"## Useful links\n",
731-
" \n",
732-
"* [Array documentation](https://docs.dask.org/en/latest/array.html)\n",
733-
"* [Array screencast](https://youtu.be/9h_61hXCDuI)\n",
734-
"* [Array API](https://docs.dask.org/en/latest/array-api.html)\n",
735-
"* [Array examples](https://examples.dask.org/array.html)\n",
736-
"* [Zarr tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html)\n",
737-
"* [Xarray tutorial](https://tutorial.xarray.dev/intro.html)"
739+
"## Close your cluster\n",
740+
"\n",
741+
"It's good practice to close any Dask cluster you create:"
742+
]
743+
},
744+
{
745+
"cell_type": "code",
746+
"execution_count": null,
747+
"metadata": {},
748+
"outputs": [],
749+
"source": [
750+
"client.shutdown()"
738751
]
739752
}
740753
],
@@ -755,7 +768,7 @@
755768
"name": "python",
756769
"nbconvert_exporter": "python",
757770
"pygments_lexer": "ipython3",
758-
"version": "3.9.10"
771+
"version": "3.10.5"
759772
}
760773
},
761774
"nbformat": 4,

0 commit comments

Comments
 (0)