|
12 | 12 | "\n", |
13 | 13 | "# Dask DataFrames\n", |
14 | 14 | "\n", |
15 | | - "We finished Chapter 02 by building a parallel dataframe computation over a directory of CSV files using `dask.delayed`. In this section we use `dask.dataframe` to automatically build similiar computations, for the common case of tabular computations. Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`.\n", |
| 15 | + "We finished Chapter 1 by building a parallel dataframe computation over a directory of CSV files using `dask.delayed`. In this section we use `dask.dataframe` to automatically build similiar computations, for the common case of tabular computations. Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`.\n", |
16 | 16 | "\n", |
17 | 17 | "In this notebook we use the same airline data as before, but now rather than write for-loops we let `dask.dataframe` construct our computations for us. The `dask.dataframe.read_csv` function can take a globstring like `\"data/nycflights/*.csv\"` and build parallel computations on all of our data at once.\n", |
18 | 18 | "\n", |
|
31 | 31 | "\n", |
32 | 32 | "**Related Documentation**\n", |
33 | 33 | "\n", |
34 | | - "* [Dask DataFrame documentation](http://dask.pydata.org/en/latest/dataframe.html)\n", |
35 | | - "* [Pandas documentation](http://pandas.pydata.org/)\n", |
| 34 | + "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n", |
| 35 | + "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n", |
| 36 | + "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n", |
| 37 | + "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n", |
| 38 | + "* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)\n", |
36 | 39 | "\n", |
37 | 40 | "**Main Take-aways**\n", |
38 | 41 | "\n", |
39 | | - "1. Dask.dataframe should be familiar to Pandas users\n", |
40 | | - "2. The partitioning of dataframes is important for efficient queries" |
| 42 | + "1. Dask DataFrame should be familiar to Pandas users\n", |
| 43 | + "2. The partitioning of dataframes is important for efficient execution" |
41 | 44 | ] |
42 | 45 | }, |
43 | 46 | { |
|
55 | 58 | "source": [ |
56 | 59 | "from dask.distributed import Client\n", |
57 | 60 | "\n", |
58 | | - "client = Client()" |
| 61 | + "client = Client(n_workers=4)" |
59 | 62 | ] |
60 | 63 | }, |
61 | 64 | { |
|
76 | 79 | "\n", |
77 | 80 | "import os\n", |
78 | 81 | "import dask\n", |
79 | | - "filename = os.path.join('data', 'accounts.*.csv')" |
| 82 | + "filename = os.path.join('data', 'accounts.*.csv')\n", |
| 83 | + "filename" |
80 | 84 | ] |
81 | 85 | }, |
82 | 86 | { |
83 | 87 | "cell_type": "markdown", |
84 | 88 | "metadata": {}, |
85 | 89 | "source": [ |
86 | | - "This works just like `pandas.read_csv`, except on multiple csv files at once." |
87 | | - ] |
88 | | - }, |
89 | | - { |
90 | | - "cell_type": "code", |
91 | | - "execution_count": null, |
92 | | - "metadata": {}, |
93 | | - "outputs": [], |
94 | | - "source": [ |
95 | | - "filename" |
| 90 | + "Filename includes a glob pattern `*`, so all files in the path matching that pattern will be read into the same Dask DataFrame." |
96 | 91 | ] |
97 | 92 | }, |
98 | 93 | { |
|
103 | 98 | "source": [ |
104 | 99 | "import dask.dataframe as dd\n", |
105 | 100 | "df = dd.read_csv(filename)\n", |
106 | | - "# load and count number of rows\n", |
107 | 101 | "df.head()" |
108 | 102 | ] |
109 | 103 | }, |
|
113 | 107 | "metadata": {}, |
114 | 108 | "outputs": [], |
115 | 109 | "source": [ |
| 110 | + "# load and count number of rows\n", |
116 | 111 | "len(df)" |
117 | 112 | ] |
118 | 113 | }, |
|
150 | 145 | "cell_type": "markdown", |
151 | 146 | "metadata": {}, |
152 | 147 | "source": [ |
153 | | - "Notice that the respresentation of the dataframe object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and types." |
| 148 | + "Notice that the respresentation of the dataframe object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes." |
154 | 149 | ] |
155 | 150 | }, |
156 | 151 | { |
|
317 | 312 | { |
318 | 313 | "cell_type": "code", |
319 | 314 | "execution_count": null, |
320 | | - "metadata": {}, |
| 315 | + "metadata": { |
| 316 | + "jupyter": { |
| 317 | + "source_hidden": true |
| 318 | + } |
| 319 | + }, |
321 | 320 | "outputs": [], |
322 | 321 | "source": [ |
323 | 322 | "len(df)" |
|
344 | 343 | { |
345 | 344 | "cell_type": "code", |
346 | 345 | "execution_count": null, |
347 | | - "metadata": {}, |
| 346 | + "metadata": { |
| 347 | + "jupyter": { |
| 348 | + "source_hidden": true |
| 349 | + } |
| 350 | + }, |
348 | 351 | "outputs": [], |
349 | 352 | "source": [ |
350 | 353 | "len(df[~df.Cancelled])" |
|
371 | 374 | { |
372 | 375 | "cell_type": "code", |
373 | 376 | "execution_count": null, |
374 | | - "metadata": {}, |
| 377 | + "metadata": { |
| 378 | + "jupyter": { |
| 379 | + "source_hidden": true |
| 380 | + } |
| 381 | + }, |
375 | 382 | "outputs": [], |
376 | 383 | "source": [ |
377 | 384 | "df[~df.Cancelled].groupby('Origin').Origin.count().compute()" |
|
398 | 405 | { |
399 | 406 | "cell_type": "code", |
400 | 407 | "execution_count": null, |
401 | | - "metadata": {}, |
| 408 | + "metadata": { |
| 409 | + "jupyter": { |
| 410 | + "source_hidden": true |
| 411 | + } |
| 412 | + }, |
402 | 413 | "outputs": [], |
403 | 414 | "source": [ |
404 | 415 | "df.groupby(\"Origin\").DepDelay.mean().compute()" |
|
423 | 434 | { |
424 | 435 | "cell_type": "code", |
425 | 436 | "execution_count": null, |
426 | | - "metadata": {}, |
| 437 | + "metadata": { |
| 438 | + "jupyter": { |
| 439 | + "source_hidden": true |
| 440 | + } |
| 441 | + }, |
427 | 442 | "outputs": [], |
428 | 443 | "source": [ |
429 | 444 | "df.groupby(\"DayOfWeek\").DepDelay.mean().compute()" |
|
518 | 533 | "source": [ |
519 | 534 | "Pandas is more mature and fully featured than `dask.dataframe`. If your data fits in memory then you should use Pandas. The `dask.dataframe` module gives you a limited `pandas` experience when you operate on datasets that don't fit comfortably in memory.\n", |
520 | 535 | "\n", |
521 | | - "During this tutorial we provide a small dataset consisting of a few CSV files. This dataset is 45MB on disk that expands to about 400MB in memory (the difference is caused by using `object` dtype for strings). This dataset is small enough that you would normally use Pandas.\n", |
| 536 | + "During this tutorial we provide a small dataset consisting of a few CSV files. This dataset is 45MB on disk that expands to about 400MB in memory. This dataset is small enough that you would normally use Pandas.\n", |
522 | 537 | "\n", |
523 | 538 | "We've chosen this size so that exercises finish quickly. Dask.dataframe only really becomes meaningful for problems significantly larger than this, when Pandas breaks with the dreaded \n", |
524 | 539 | "\n", |
|
763 | 778 | "cell_type": "markdown", |
764 | 779 | "metadata": {}, |
765 | 780 | "source": [ |
766 | | - "### What definitely works?" |
767 | | - ] |
768 | | - }, |
769 | | - { |
770 | | - "cell_type": "markdown", |
771 | | - "metadata": {}, |
772 | | - "source": [ |
773 | | - "* Trivially parallelizable operations (fast):\n", |
774 | | - " * Elementwise operations: ``df.x + df.y``\n", |
775 | | - " * Row-wise selections: ``df[df.x > 0]``\n", |
776 | | - " * Loc: ``df.loc[4.0:10.5]``\n", |
777 | | - " * Common aggregations: ``df.x.max()``\n", |
778 | | - " * Is in: ``df[df.x.isin([1, 2, 3])]``\n", |
779 | | - " * Datetime/string accessors: ``df.timestamp.month``\n", |
780 | | - "* Cleverly parallelizable operations (also fast):\n", |
781 | | - " * groupby-aggregate (with common aggregations): ``df.groupby(df.x).y.max()``\n", |
782 | | - " * value_counts: ``df.x.value_counts``\n", |
783 | | - " * Drop duplicates: ``df.x.drop_duplicates()``\n", |
784 | | - " * Join on index: ``dd.merge(df1, df2, left_index=True, right_index=True)``\n", |
785 | | - "* Operations requiring a shuffle (slow-ish, unless on index)\n", |
786 | | - " * Set index: ``df.set_index(df.x)``\n", |
787 | | - " * groupby-apply (with anything): ``df.groupby(df.x).apply(myfunc)``\n", |
788 | | - " * Join not on the index: ``pd.merge(df1, df2, on='name')``\n", |
789 | | - "* Ingest operations\n", |
790 | | - " * Files: ``dd.read_csv, dd.read_parquet, dd.read_json, dd.read_orc``, etc.\n", |
791 | | - " * Pandas: ``dd.from_pandas``\n", |
792 | | - " * Anything supporting numpy slicing: ``dd.from_array``\n", |
793 | | - " * From any set of functions creating sub dataframes via ``dd.from_delayed``.\n", |
794 | | - " * Dask.bag: ``mybag.to_dataframe(columns=[...])``" |
| 781 | + "## Learn More\n", |
| 782 | + "\n", |
| 783 | + "\n", |
| 784 | + "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n", |
| 785 | + "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n", |
| 786 | + "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n", |
| 787 | + "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n", |
| 788 | + "* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)" |
795 | 789 | ] |
796 | 790 | }, |
797 | 791 | { |
|
0 commit comments