Skip to content

Commit ebf4a99

Browse files
committed
update dask docs
1 parent 6484a58 commit ebf4a99

File tree

1 file changed

+45
-2
lines changed

1 file changed

+45
-2
lines changed

docs/hub/datasets-dask.md

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
# Dask
22

33
[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
4-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
4+
5+
In particular, we can use Dask DataFrame to scale up pandas workflows. Dask DataFrame parallelizes pandas to handle large tabular data. It closely mirrors the pandas API, making it simple to transition from testing on a single dataset to processing the full dataset. Dask is particularly effective with Parquet, the default format on Hugging Face Datasets, as it supports rich data types, efficient columnar filtering, and compression.
6+
7+
A good practical user-case for Dask is to run data processing or model inference on a dataset in a distributed manner. See for example the excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling) by Coiled.
8+
9+
# Read and Write
10+
11+
Since Dask uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub;
512

613
First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:
714

@@ -17,7 +24,8 @@ from huggingface_hub import HfApi
1724
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
1825
```
1926

20-
Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask:
27+
Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask.
28+
Dask DataFrame supports distributed writing to Parquet on Hugging Face, which uses commits to track dataset changes:
2129

2230
```python
2331
import dask.dataframe as dd
@@ -30,6 +38,14 @@ df_valid.to_parquet("hf://datasets/username/my_dataset/validation")
3038
df_test .to_parquet("hf://datasets/username/my_dataset/test")
3139
```
3240

41+
Since this creates one commit per file, it is recommended to squash the history after the upload:
42+
43+
```python
44+
from huggingface_hub import HfApi
45+
46+
HfApi().super_squash_history(repo_id=repo_id, repo_type="dataset")
47+
```
48+
3349
This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format.
3450
You can reload it later:
3551

@@ -45,3 +61,30 @@ df_test = dd.read_parquet("hf://datasets/username/my_dataset/test")
4561
```
4662

4763
For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).
64+
65+
# Process data
66+
67+
To process a dataset in parallel using Dask, you can first define your data processing function for a pandas DataFrame or Series, and then use the Dask `map_partitions` function to apply this function to all the partitions of a dataset in parallel:
68+
69+
```python
70+
def dummy_count_words(texts):
71+
return pd.Series([len(text.split(" ")) for text in texts])
72+
```
73+
74+
In pandas you can use this function on a text column:
75+
76+
```python
77+
# pandas API
78+
df["num_words"] = dummy_count_words(df.text)
79+
```
80+
81+
And in Dask you can run this function on every partition:
82+
83+
```python
84+
# Dask API: run the function on every partition
85+
df["num_words"] = df.text.map_partitions(dummy_count_words, meta=int)
86+
```
87+
88+
Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function.
89+
This is needed because Dask DataFrame is a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs
90+
the `meta` argument to know the type of the new column in the meantime.

0 commit comments

Comments
 (0)