Skip to content

Commit 31ed970

Browse files
committed
add column and filter pushdown
1 parent 8c76816 commit 31ed970

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

docs/hub/datasets-dask.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,29 @@ df["num_words"] = df.text.map_partitions(dummy_count_words, meta=int)
8888
Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function.
8989
This is needed because Dask DataFrame is a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs
9090
the `meta` argument to know the type of the new column in the meantime.
91+
92+
# Column and Filter Pushdown
93+
94+
When reading Parquet data from Hugging Face, Dask automatically leverages the metadata in Parquet files to skip entire files or row groups if they are not needed. For example if you apply a filter on a Hugging Face Dataset in Parquet format or if you select a subset of the columns, Dask will read the metadata of the Paquet files to discard the parts that are not needed without downloading them.
95+
96+
This is possible thanks to the `dask-expr` package which is generally installed by default with Dask.
97+
98+
For example this subset of FineWeb-Edu contains many Parquet files. If you can filter the dataset to keep the text from recent CC dumps, Dask will skip most of the files and only download the data that match the filter:
99+
100+
```python
101+
import dask.dataframe as dd
102+
103+
df = dd.read_parquet("hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT/*.parquet")
104+
105+
# Dask will skip the files or row groups that don't
106+
# match rhe query without downloading them.
107+
df = df[df.dump >= "CC-MAIN-2023"]
108+
```
109+
110+
Dask will also read only the required columns for your computation and skip the rest. This is useful when you want to manipulate a subset of the columns or for analytics:
111+
112+
```python
113+
# Dask will download the 'dump' and 'token_count' needed
114+
# for the computation and skip the other columns.
115+
df.token_count.mean().compute()
116+
```

0 commit comments

Comments
 (0)