You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-dask.md
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,6 +71,13 @@ def dummy_count_words(texts):
71
71
return pd.Series([len(text.split("")) for text in texts])
72
72
```
73
73
74
+
or a similar function using pandas string methods (faster):
75
+
76
+
```python
77
+
defdummy_count_words(texts):
78
+
return texts.str.count("")
79
+
```
80
+
74
81
In pandas you can use this function on a text column:
75
82
76
83
```python
@@ -116,3 +123,29 @@ This is useful when you want to manipulate a subset of the columns or for analyt
116
123
# for the filtering and computation and skip the other columns.
117
124
df.token_count.mean().compute()
118
125
```
126
+
127
+
## Client
128
+
129
+
Most features in `dask` are optimized for a cluster or a local `Client` to launch the parallel computations:
130
+
131
+
```python
132
+
import dask.dataframe as dd
133
+
from distributed import Client
134
+
135
+
if__name__=="__main__": # needed for creating new processes
136
+
client = Client()
137
+
df = dd.read_parquet(...)
138
+
...
139
+
```
140
+
141
+
For local usage, the `Client` uses a Dask `LocalCluster` with multiprocessing by default. You can manually configure the multiprocessing of `LocalCluster` with
Note that if you use the default threaded scheduler locally without `Client`, a DataFrame can become slower after certain operations (more details [here](https://github.com/dask/dask-expr/issues/1181)).
150
+
151
+
Find more information on setting up a local or cloud cluster in the [Deploying Dask documentation](https://docs.dask.org/en/latest/deploying.html).
0 commit comments