You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-dask.md
+73-2Lines changed: 73 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,14 @@
1
1
# Dask
2
2
3
3
[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
4
-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
4
+
5
+
In particular, we can use Dask DataFrame to scale up pandas workflows. Dask DataFrame parallelizes pandas to handle large tabular data. It closely mirrors the pandas API, making it simple to transition from testing on a single dataset to processing the full dataset. Dask is particularly effective with Parquet, the default format on Hugging Face Datasets, as it supports rich data types, efficient columnar filtering, and compression.
6
+
7
+
A good practical use case for Dask is running data processing or model inference on a dataset in a distributed manner. See, for example, Coiled's excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling).
8
+
9
+
# Read and Write
10
+
11
+
Since Dask uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub;
5
12
6
13
First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:
7
14
@@ -17,7 +24,8 @@ from huggingface_hub import HfApi
For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).
64
+
65
+
# Process data
66
+
67
+
To process a dataset in parallel using Dask, you can first define your data processing function for a pandas DataFrame or Series, and then use the Dask `map_partitions` function to apply this function to all the partitions of a dataset in parallel:
68
+
69
+
```python
70
+
defdummy_count_words(texts):
71
+
return pd.Series([len(text.split("")) for text in texts])
72
+
```
73
+
74
+
In pandas you can use this function on a text column:
75
+
76
+
```python
77
+
# pandas API
78
+
df["num_words"] = dummy_count_words(df.text)
79
+
```
80
+
81
+
And in Dask you can run this function on every partition:
Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function.
89
+
This is needed because Dask DataFrame uses a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs
90
+
the `meta` argument to know the type of the new column in the meantime.
91
+
92
+
# Predicate and Projection Pushdown
93
+
94
+
When reading Parquet data from Hugging Face, Dask automatically leverages the metadata in Parquet files to skip entire files or row groups if they are not needed. For example if you apply a filter (predicate) on a Hugging Face Dataset in Parquet format or if you select a subset of the columns (projection), Dask will read the metadata of the Paquet files to discard the parts that are not needed without downloading them.
95
+
96
+
This is possible thanks to the `dask-expr` package which is generally installed by default with Dask. You can read more about `dask-expr` in its [introduction blog post](https://blog.dask.org/2023/08/25/dask-expr-introduction) and in this more recent [blog post on dask optimizations](https://blog.dask.org/2024/05/30/dask-is-fast#optimizer)
97
+
98
+
For example this subset of FineWeb-Edu contains many Parquet files. If you can filter the dataset to keep the text from recent CC dumps, Dask will skip most of the files and only download the data that match the filter:
0 commit comments