Skip to content

Commit a1bb240

Browse files
author
Pietro Vertechi
authored
Merge pull request #13 from piever/dev
Parallel docs
2 parents 7cbdda5 + 8e5900b commit a1bb240

File tree

6 files changed

+103
-9
lines changed

6 files changed

+103
-9
lines changed

NEWS.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# JuliaDBMeta.jl NEWS
2+
3+
## 0.2 current version
4+
5+
### 0.2
6+
7+
- add `cols` to select columns programmatically
8+
- add out-of-core support
9+
- **breaking** `@groupby` no longer flattens by default

docs/src/column_macros.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22

33
Column-wise macros allow using symbols instead of columns. The order of the arguments is always the same: the first argument is the table and the last argument is the expression (can be a `begin ... end` block). If the table is omitted, the macro is automatically curried (useful for piping).
44

5+
Shared features across all row-wise macros:
6+
7+
- Symbols refer to fields of the row.
8+
- `_` refers to the whole table.
9+
- To use actual symbols, escape them with `^`, as in `^(:a)`.
10+
- Use `cols(c)` to refer to field c where `c` is a variable that evaluates to a symbol. `c` must be available in the scope where the macro is called.
11+
- An optional grouping argument is allowed: see [Column-wise macros with grouping argument](@ref)
12+
- Out-of-core tables are not supported out of the box, except when grouping
13+
514
## Replace symbols with columns
615

716
```@docs

docs/src/out_of_core.md

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,58 @@
11
# Out-of-core support
22

3+
JuliaDBMeta supports out-of-core operations in several different ways. In the following examples, we will have started the REPL with `julia -p 4`
4+
5+
## Row-wise macros parallelize out of the box
6+
37
[Row-wise macros](@ref) can be trivially implemented in parallel and will work out of the box with out-of-core tables.
48

9+
```jldoctest distributed
10+
julia> iris = loadtable(Pkg.dir("JuliaDBMeta", "test", "tables", "iris.csv"));
11+
12+
julia> iris5 = table(iris, chunks = 5);
13+
14+
julia> @where iris5 :SepalLength == 4.9 && :Species == "setosa"
15+
Distributed Table with 4 rows in 2 chunks:
16+
SepalLength SepalWidth PetalLength PetalWidth Species
17+
──────────────────────────────────────────────────────────
18+
4.9 3.0 1.4 0.2 "setosa"
19+
4.9 3.1 1.5 0.1 "setosa"
20+
4.9 3.1 1.5 0.2 "setosa"
21+
4.9 3.6 1.4 0.1 "setosa"
22+
```
23+
24+
## Grouping operations parallelize with some data shuffling
25+
526
[Grouping operations](@ref) will work on out-of-core data tables, but may involve some data shuffling as it requires data belonging to the same group to be on the same processor.
627

28+
```jldoctest distributed
29+
julia> @groupby iris5 :Species {mean(:SepalLength)}
30+
Distributed Table with 3 rows in 3 chunks:
31+
Species mean(SepalLength)
32+
───────────────────────────────
33+
"setosa" 5.006
34+
"versicolor" 5.936
35+
"virginica" 6.588
36+
```
37+
38+
## Apply a pipeline to your data in chunks
39+
740
[`@applychunked`](@ref) will apply the analysis pipeline separately to each chunk of data in parallel and collect the result as a distributed table.
841

9-
[Column-wise macros](@ref) do not have a parallel implementation yet (they require working on the whole column at the same time which makes it difficult to parallelize them).
42+
```jldoctest distributed
43+
julia> @applychunked iris5 begin
44+
@where :Species == "setosa" && :SepalLength == 4.9
45+
@transform {Ratio = :SepalLength / :SepalWidth}
46+
end
47+
Distributed Table with 4 rows in 2 chunks:
48+
SepalLength SepalWidth PetalLength PetalWidth Species Ratio
49+
───────────────────────────────────────────────────────────────────
50+
4.9 3.0 1.4 0.2 "setosa" 1.63333
51+
4.9 3.1 1.5 0.1 "setosa" 1.58065
52+
4.9 3.1 1.5 0.2 "setosa" 1.58065
53+
4.9 3.6 1.4 0.1 "setosa" 1.36111
54+
```
55+
56+
## Column-wise macros do not parallelize yet
57+
58+
[Column-wise macros](@ref) do not have a parallel implementation yet, unless when grouping: they require working on the whole column at the same time which makes it difficult to parallelize them.

docs/src/row_macros.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22

33
Row-wise macros allow using symbols to refer to fields of a row. The order of the arguments is always the same: the first argument is the table and the last argument is the expression (can be a `begin ... end` block). If the table is omitted, the macro is automatically curried (useful for piping).
44

5+
Shared features across all row-wise macros:
6+
7+
- Symbols refer to fields of the row.
8+
- `_` refers to the whole row.
9+
- To use actual symbols, escape them with `^`, as in `^(:a)`.
10+
- Use `cols(c)` to refer to field c where `c` is a variable that evaluates to a symbol. `c` must be available in the scope where the macro is called.
11+
- Out-of-core tables are supported out of the box
12+
513
## Modify data in place
614

715
```@docs

src/groupby.jl

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
_groupby(f, d::AbstractDataset, args...; kwargs...) =
2-
IndexedTables.groupby(f, d, args...; flatten = true, usekey = true, kwargs...)
1+
_groupby(f, d::AbstractDataset, args...; kwargs...) =
2+
IndexedTables.groupby(f, d, args...; usekey = true, kwargs...)
33

44
_groupby(f, args...; kwargs...) = d::AbstractDataset -> _groupby(f, d, args...; kwargs...)
55

66
function groupby_helper(args...)
77
anon_func, syms = extract_anonymous_function(last(args), replace_column, usekey = true)
88
if !isempty(syms) && !(:(_) in syms)
99
fields = Expr(:call, :(JuliaDBMeta.All), syms...)
10-
Expr(:call, :(JuliaDBMeta._groupby), anon_func, args[1:end-1]..., Expr(:kw, :select, fields))
10+
Expr(:call, :(JuliaDBMeta._groupby), anon_func, Expr(:kw, :select, fields), replace_keywords(args[1:end-1])...)
1111
else
12-
Expr(:call, :(JuliaDBMeta._groupby), anon_func, args[1:end-1]...)
12+
Expr(:call, :(JuliaDBMeta._groupby), anon_func, replace_keywords(args[1:end-1])...)
1313
end
1414
end
1515

@@ -45,6 +45,19 @@ x m
4545
1 5.7
4646
2 3.3
4747
```
48+
49+
When the summary function returns an iterable, use `flatten=true` to flatten the result:
50+
51+
```jldoctest groupby
52+
julia> @groupby(t, :x, flatten = true, select = {:y+1})
53+
Table with 4 rows, 2 columns:
54+
x y + 1
55+
────────
56+
1 5
57+
1 7
58+
2 6
59+
2 8
60+
```
4861
"""
4962
macro groupby(args...)
5063
esc(groupby_helper(args...))

test/runtests.jl

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
using JuliaDBMeta, Compat, NamedTuples
2-
using Compat.Test
1+
addprocs(4)
32

4-
iris1 = loadtable(joinpath(@__DIR__, "tables", "iris.csv"))
3+
@everywhere using JuliaDBMeta, Compat, NamedTuples
4+
@everywhere using JuliaDB, Dagger
5+
@everywhere using Compat.Test
6+
7+
iris1 = collect(loadtable(joinpath(@__DIR__, "tables", "iris.csv")))
58
iris2 = table(iris1, chunks = 5)
69

710
@testset "utils" begin
@@ -76,14 +79,15 @@ end
7679
@test (@where_vec t (:x .< 3) .& (:z .== 0.2)) == view(t, [2])
7780
@test @where_vec(t, 1:2) == view(t, 1:2)
7881
@test @where_vec(rows(t), 1:2) == view(t, 1:2)
82+
@test JuliaDBMeta._view(rows(t), 1:2) == view(rows(t), 1:2)
7983
@test @where_vec((:x .< 3) .& (:z .== 0.2))(t) == view(t, [2])
8084
@test (@where t (:x < 3) .& (:z == 0.2)) == view(t, [2])
8185
@test @where((:x < 3) .& (:z == 0.2))(t) == view(t, [2])
8286

8387
t = table([1,1,3], [4,5,6], [0.1, 0.2, 0.3], names = [:x, :y, :z])
8488
grp = groupby(@map(@NT(z = :z))@where(:y != 5), t, :x, flatten = true)
8589
@test grp == table([1, 3], [0.1, 0.3], names = [:x, :z], pkey = :x)
86-
collect(@where iris2 :SepalLength > 4) == @where iris1 :SepalLength > 4
90+
@test collect(@where iris2 :SepalLength > 4) == @where iris1 :SepalLength > 4
8791
end
8892

8993
@testset "apply" begin
@@ -145,4 +149,6 @@ end
145149
@test @groupby({m = maximum(:y - :z) / _.key.x})(reindex(t, :x)) == outcome
146150
@test @groupby(t, :x, {l = length(_)}) == table([1,2], [2,2], names = [:x, :l], pkey = :l)
147151
@test @groupby(t, :x, {l = length(_)}) == t |> @groupby(:x, {l = length(_)})
152+
@test @groupby(t, :x, flatten = true, _) == reindex(t, :x)
153+
@test @groupby(t, :x, {identity = _}) == groupby(identity, t, :x)
148154
end

0 commit comments

Comments
 (0)