@@ -146,6 +146,68 @@ julia> fetch(r)
146
146
(v = 5500 ,)
147
147
```
148
148
149
+ ## ` mapreduce ` usage
150
+
151
+ The operation ` mapreduce ` is helpful in fully utilizing ` OnlineStats ` .
152
+ It lets you transform a row to the required format before applying the reduce function.
153
+ In consequence a lot of memory usage should be saved due to the lack of an intermediate ` map ` step that allocates a full column.
154
+
155
+ ``` julia
156
+ julia> using Dagger, OnlineStats
157
+
158
+ julia> fetch (Dagger. mapreduce (sum, fit!, d1, init = Mean ()))
159
+ Mean: n= 100 | value= 1.50573
160
+
161
+ julia> d1 = DTable ((a= collect (1 : 100 ).% 3 , b= rand (100 )), 25 );
162
+
163
+ julia> gg = GroupBy (Int, Mean ());
164
+
165
+ julia> fetch (Dagger. mapreduce (x-> (x. a, x. b), fit!, d1, init= gg))
166
+ GroupBy: Int64 => Mean
167
+ ├─ 1
168
+ │ └─ Mean: n= 34 | value= 0.491379
169
+ ├─ 2
170
+ │ └─ Mean: n= 33 | value= 0.555258
171
+ └─ 0
172
+ └─ Mean: n= 33 | value= 0.470984
173
+
174
+ julia> d2 = DTable ((;a1= abs .(rand (Int, 100 ).% 2 ), [Symbol (" a\$ (i)" ) => rand (100 ) for i in 2 : 3 ]. .. ), 25 );
175
+
176
+ julia> gb = GroupBy (Int, Group ([Series (Mean (), Variance (), Extrema ()) for _ in 1 : 3 ]. .. ));
177
+
178
+ julia> fetch (Dagger. mapreduce (r -> (r. a1, tuple (r... )), fit!, d2, init = gb))
179
+ GroupBy: Int64 => Group
180
+ ├─ 1
181
+ │ └─ Group
182
+ │ ├─ Series
183
+ │ │ ├─ Mean: n= 57 | value= 1.0
184
+ │ │ ├─ Variance: n= 57 | value= 0.0
185
+ │ │ └─ Extrema: n= 57 | value= (min = 1.0 , max = 1.0 , nmin = 57 , nmax = 57 )
186
+ │ ├─ Series
187
+ │ │ ├─ Mean: n= 57 | value= 0.540256
188
+ │ │ ├─ Variance: n= 57 | value= 0.0767802
189
+ │ │ └─ Extrema: n= 57 | value= (min = 0.0132545 , max = 0.996059 , nmin = 1 , nmax = 1 )
190
+ │ └─ Series
191
+ │ ├─ Mean: n= 57 | value= 0.536187
192
+ │ ├─ Variance: n= 57 | value= 0.0981499
193
+ │ └─ Extrema: n= 57 | value= (min = 0.0112471 , max = 0.991461 , nmin = 1 , nmax = 1 )
194
+ └─ 0
195
+ └─ Group
196
+ ├─ Series
197
+ │ ├─ Mean: n= 43 | value= 0.0
198
+ │ ├─ Variance: n= 43 | value= 0.0
199
+ │ └─ Extrema: n= 43 | value= (min = 0.0 , max = 0.0 , nmin = 43 , nmax = 43 )
200
+ ├─ Series
201
+ │ ├─ Mean: n= 43 | value= 0.459732
202
+ │ ├─ Variance: n= 43 | value= 0.0911548
203
+ │ └─ Extrema: n= 43 | value= (min = 0.000925526 , max = 0.962072 , nmin = 1 , nmax = 1 )
204
+ └─ Series
205
+ ├─ Mean: n= 43 | value= 0.490613
206
+ ├─ Variance: n= 43 | value= 0.0850503
207
+ └─ Extrema: n= 43 | value= (min = 0.0450505 , max = 0.981091 , nmin = 1 , nmax = 1 )
208
+ ```
209
+
210
+
149
211
# Dagger.groupby interface
150
212
151
213
A ` DTable ` can be grouped which will result in creation of a ` GDTable ` .
@@ -319,4 +381,18 @@ julia> pp(innerjoin(dt, d2, on=:a))
319
381
3 , 3 , - 3
320
382
4 , 4 , - 4
321
383
5 , 5 , - 5
322
- ```
384
+ ```
385
+
386
+ # DataFrames.jl minilanguage and operations support (experimental)
387
+
388
+ Support for ` DataFrames.jl ` minilanguage and operations is planned for the ` DTable `
389
+ to enable a seemless transition between in-memory and distributed data processing.
390
+
391
+ As of today ` select ` is available with more operations to come in the future.
392
+
393
+ The goal is to provide exactly the same output as for DataFrames using the same ` args ` .
394
+ Even though the output should be the same the DTable may require modification of user input in order to provide optimal distributed performance.
395
+
396
+ One already known tactic is to avoid functions that require access to the full column at once.
397
+ The user should prefer to use ` ByRow ` equivalents or ` reduce ` instead.
398
+ A complete performance guide will surely be a part of the documentation at some point.
0 commit comments