Skip to content

Commit 7ac13a5

Browse files
authored
explain column skipping better
1 parent d320b77 commit 7ac13a5

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

docs/hub/datasets-dask.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,10 +107,12 @@ df = dd.read_parquet("hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT/*.parq
107107
df = df[df.dump >= "CC-MAIN-2023"]
108108
```
109109

110-
Dask will also read only the required columns for your computation and skip the rest. This is useful when you want to manipulate a subset of the columns or for analytics:
110+
Dask will also read only the required columns for your computation and skip the rest.
111+
For example if you drop a column late in your code, it will not bother to load it early on in the pipeline if it's not needed.
112+
This is useful when you want to manipulate a subset of the columns or for analytics:
111113

112114
```python
113115
# Dask will download the 'dump' and 'token_count' needed
114-
# for the computation and skip the other columns.
116+
# for the filtering and computation and skip the other columns.
115117
df.token_count.mean().compute()
116118
```

0 commit comments

Comments
 (0)