Skip to content

Commit 0bb388b

Browse files
committed
add missing docs
1 parent 183dabd commit 0bb388b

File tree

1 file changed

+77
-1
lines changed

1 file changed

+77
-1
lines changed

docs/src/dtable.md

Lines changed: 77 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,68 @@ julia> fetch(r)
146146
(v = 5500,)
147147
```
148148

149+
## `mapreduce` usage
150+
151+
The operation `mapreduce` is helpful in fully utilizing `OnlineStats`.
152+
It lets you transform a row to the required format before applying the reduce function.
153+
In consequence a lot of memory usage should be saved due to the lack of an intermediate `map` step that allocates a full column.
154+
155+
```julia
156+
julia> using Dagger, OnlineStats
157+
158+
julia> fetch(Dagger.mapreduce(sum, fit!, d1, init = Mean()))
159+
Mean: n=100 | value=1.50573
160+
161+
julia> d1 = DTable((a=collect(1:100).%3, b=rand(100)), 25);
162+
163+
julia> gg = GroupBy(Int, Mean());
164+
165+
julia> fetch(Dagger.mapreduce(x-> (x.a, x.b), fit!, d1, init=gg))
166+
GroupBy: Int64 => Mean
167+
├─ 1
168+
│ └─ Mean: n=34 | value=0.491379
169+
├─ 2
170+
│ └─ Mean: n=33 | value=0.555258
171+
└─ 0
172+
└─ Mean: n=33 | value=0.470984
173+
174+
julia> d2 = DTable((;a1=abs.(rand(Int, 100).%2), [Symbol("a\$(i)") => rand(100) for i in 2:3]...), 25);
175+
176+
julia> gb = GroupBy(Int, Group([Series(Mean(), Variance(), Extrema()) for _ in 1:3]...));
177+
178+
julia> fetch(Dagger.mapreduce(r -> (r.a1, tuple(r...)), fit!, d2, init = gb))
179+
GroupBy: Int64 => Group
180+
├─ 1
181+
│ └─ Group
182+
│ ├─ Series
183+
│ │ ├─ Mean: n=57 | value=1.0
184+
│ │ ├─ Variance: n=57 | value=0.0
185+
│ │ └─ Extrema: n=57 | value=(min = 1.0, max = 1.0, nmin = 57, nmax = 57)
186+
│ ├─ Series
187+
│ │ ├─ Mean: n=57 | value=0.540256
188+
│ │ ├─ Variance: n=57 | value=0.0767802
189+
│ │ └─ Extrema: n=57 | value=(min = 0.0132545, max = 0.996059, nmin = 1, nmax = 1)
190+
│ └─ Series
191+
│ ├─ Mean: n=57 | value=0.536187
192+
│ ├─ Variance: n=57 | value=0.0981499
193+
│ └─ Extrema: n=57 | value=(min = 0.0112471, max = 0.991461, nmin = 1, nmax = 1)
194+
└─ 0
195+
└─ Group
196+
├─ Series
197+
│ ├─ Mean: n=43 | value=0.0
198+
│ ├─ Variance: n=43 | value=0.0
199+
│ └─ Extrema: n=43 | value=(min = 0.0, max = 0.0, nmin = 43, nmax = 43)
200+
├─ Series
201+
│ ├─ Mean: n=43 | value=0.459732
202+
│ ├─ Variance: n=43 | value=0.0911548
203+
│ └─ Extrema: n=43 | value=(min = 0.000925526, max = 0.962072, nmin = 1, nmax = 1)
204+
└─ Series
205+
├─ Mean: n=43 | value=0.490613
206+
├─ Variance: n=43 | value=0.0850503
207+
└─ Extrema: n=43 | value=(min = 0.0450505, max = 0.981091, nmin = 1, nmax = 1)
208+
```
209+
210+
149211
# Dagger.groupby interface
150212

151213
A `DTable` can be grouped which will result in creation of a `GDTable`.
@@ -319,4 +381,18 @@ julia> pp(innerjoin(dt, d2, on=:a))
319381
3, 3, -3
320382
4, 4, -4
321383
5, 5, -5
322-
```
384+
```
385+
386+
# DataFrames.jl minilanguage and operations support (experimental)
387+
388+
Support for `DataFrames.jl` minilanguage and operations is planned for the `DTable`
389+
to enable a seemless transition between in-memory and distributed data processing.
390+
391+
As of today `select` is available with more operations to come in the future.
392+
393+
The goal is to provide exactly the same output as for DataFrames using the same `args`.
394+
Even though the output should be the same the DTable may require modification of user input in order to provide optimal distributed performance.
395+
396+
One already known tactic is to avoid functions that require access to the full column at once.
397+
The user should prefer to use `ByRow` equivalents or `reduce` instead.
398+
A complete performance guide will surely be a part of the documentation at some point.

0 commit comments

Comments
 (0)