Skip to content

Commit dc5f8e9

Browse files
lhoestqdavanstrien
andauthored
Apply suggestions from code review
Co-authored-by: Daniel van Strien <[email protected]>
1 parent af4604e commit dc5f8e9

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

docs/hub/datasets-dask.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
In particular, we can use Dask DataFrame to scale up pandas workflows. Dask DataFrame parallelizes pandas to handle large tabular data. It closely mirrors the pandas API, making it simple to transition from testing on a single dataset to processing the full dataset. Dask is particularly effective with Parquet, the default format on Hugging Face Datasets, as it supports rich data types, efficient columnar filtering, and compression.
66

7-
A good practical use-case for Dask is to run data processing or model inference on a dataset in a distributed manner. See for example the excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling) by Coiled.
7+
A good practical use case for Dask is running data processing or model inference on a dataset in a distributed manner. See, for example, Coiled's excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling).
88

99
# Read and Write
1010

@@ -86,7 +86,7 @@ df["num_words"] = df.text.map_partitions(dummy_count_words, meta=int)
8686
```
8787

8888
Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function.
89-
This is needed because Dask DataFrame is a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs
89+
This is needed because Dask DataFrame uses a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs
9090
the `meta` argument to know the type of the new column in the meantime.
9191

9292
# Predicate and Projection Pushdown
@@ -103,7 +103,7 @@ import dask.dataframe as dd
103103
df = dd.read_parquet("hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT/*.parquet")
104104

105105
# Dask will skip the files or row groups that don't
106-
# match rhe query without downloading them.
106+
# match the query without downloading them.
107107
df = df[df.dump >= "CC-MAIN-2023"]
108108
```
109109

0 commit comments

Comments
 (0)