add missing docs

krynju · krynju · commit 0bb388bd8ba8 · 2022-06-19T13:42:02.000+02:00
diff --git a/docs/src/dtable.md b/docs/src/dtable.md
@@ -146,6 +146,68 @@ julia> fetch(r)
 (v = 5500,)
 ```
 
+## `mapreduce` usage
+
+The operation `mapreduce` is helpful in fully utilizing `OnlineStats`.
+It lets you transform a row to the required format before applying the reduce function.
+In consequence a lot of memory usage should be saved due to the lack of an intermediate `map` step that allocates a full column.
+
+```julia
+julia> using Dagger, OnlineStats
+
+julia> fetch(Dagger.mapreduce(sum, fit!, d1, init = Mean()))
+Mean: n=100 | value=1.50573
+
+julia> d1 = DTable((a=collect(1:100).%3, b=rand(100)), 25);
+
+julia> gg = GroupBy(Int, Mean());
+
+julia> fetch(Dagger.mapreduce(x-> (x.a, x.b), fit!, d1, init=gg))
+GroupBy: Int64 => Mean
+├─ 1
+│  └─ Mean: n=34 | value=0.491379
+├─ 2
+│  └─ Mean: n=33 | value=0.555258
+└─ 0
+   └─ Mean: n=33 | value=0.470984
+
+julia> d2 = DTable((;a1=abs.(rand(Int, 100).%2), [Symbol("a\$(i)") => rand(100) for i in 2:3]...), 25);
+
+julia> gb = GroupBy(Int, Group([Series(Mean(), Variance(), Extrema()) for _ in 1:3]...));
+
+julia> fetch(Dagger.mapreduce(r -> (r.a1, tuple(r...)), fit!, d2, init = gb))
+GroupBy: Int64 => Group
+├─ 1
+│  └─ Group
+│     ├─ Series
+│     │  ├─ Mean: n=57 | value=1.0
+│     │  ├─ Variance: n=57 | value=0.0
+│     │  └─ Extrema: n=57 | value=(min = 1.0, max = 1.0, nmin = 57, nmax = 57)
+│     ├─ Series
+│     │  ├─ Mean: n=57 | value=0.540256
+│     │  ├─ Variance: n=57 | value=0.0767802
+│     │  └─ Extrema: n=57 | value=(min = 0.0132545, max = 0.996059, nmin = 1, nmax = 1)
+│     └─ Series
+│        ├─ Mean: n=57 | value=0.536187
+│        ├─ Variance: n=57 | value=0.0981499
+│        └─ Extrema: n=57 | value=(min = 0.0112471, max = 0.991461, nmin = 1, nmax = 1)
+└─ 0
+   └─ Group
+      ├─ Series
+      │  ├─ Mean: n=43 | value=0.0
+      │  ├─ Variance: n=43 | value=0.0
+      │  └─ Extrema: n=43 | value=(min = 0.0, max = 0.0, nmin = 43, nmax = 43)
+      ├─ Series
+      │  ├─ Mean: n=43 | value=0.459732
+      │  ├─ Variance: n=43 | value=0.0911548
+      │  └─ Extrema: n=43 | value=(min = 0.000925526, max = 0.962072, nmin = 1, nmax = 1)
+      └─ Series
+         ├─ Mean: n=43 | value=0.490613
+         ├─ Variance: n=43 | value=0.0850503
+         └─ Extrema: n=43 | value=(min = 0.0450505, max = 0.981091, nmin = 1, nmax = 1)
+```
+
+
 # Dagger.groupby interface
 
 A `DTable` can be grouped which will result in creation of a `GDTable`.
@@ -319,4 +381,18 @@ julia> pp(innerjoin(dt, d2, on=:a))
 3, 3, -3
 4, 4, -4
 5, 5, -5
-```
+```
+
+# DataFrames.jl minilanguage and operations support (experimental)
+
+Support for `DataFrames.jl` minilanguage and operations is planned for the `DTable`
+to enable a seemless transition between in-memory and distributed data processing.
+
+As of today `select` is available with more operations to come in the future.
+
+The goal is to provide exactly the same output as for DataFrames using the same `args`.
+Even though the output should be the same the DTable may require modification of user input in order to provide optimal distributed performance.
+
+One already known tactic is to avoid functions that require access to the full column at once.
+The user should prefer to use `ByRow` equivalents or `reduce` instead.
+A complete performance guide will surely be a part of the documentation at some point.