Skip to content

Commit 91576e1

Browse files
authored
Constructor interpartition_merges & fixes/small improvements (#13)
1 parent c9e0698 commit 91576e1

File tree

10 files changed

+385
-116
lines changed

10 files changed

+385
-116
lines changed

docs/src/dtable.md

Lines changed: 69 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Provide a `Tables.jl` compatible source, as well as a `chunksize`, which is the
1010
maximum number of rows of each partition:
1111

1212
```julia
13-
julia> using Dagger
13+
julia> using DTables
1414

1515
julia> table = (a=[1, 2, 3, 4, 5], b=[6, 7, 8, 9, 10]);
1616

@@ -28,7 +28,7 @@ Provide a `loader_function` and a list of filenames, which are parts of the
2828
full table:
2929

3030
```julia
31-
julia> using Dagger, CSV
31+
julia> using DTables, CSV
3232

3333
julia> files = ["1.csv", "2.csv", "3.csv"];
3434

@@ -87,6 +87,60 @@ julia> fetch(d, NamedTuple)
8787
(a = [1, 2, 3, 4, 5], b = [6, 7, 8, 9, 10])
8888
```
8989

90+
## Behavior of the `interpartition_merges` kwarg
91+
92+
If a source supports the `Tables.partitions` interface then the DTable
93+
will assume the partitioning size from the source. However, if you decide
94+
to specify the exact size of the chunk the DTable will attempt to create
95+
chunks exactly of that size even if it means merging data between partitions.
96+
The behavior can be controlled by the `interpartition_merges` (`true` by default)
97+
kwarg and is best seen on the following example.
98+
99+
```julia
100+
julia> using DTables, CSV
101+
102+
julia> DTable(CSV.Chunks("test.csv", ntasks=4)) |> DTables.chunk_lengths
103+
4-element Vector{Int64}:
104+
249995
105+
250005
106+
249995
107+
250005
108+
109+
julia> DTable(CSV.Chunks("test.csv", ntasks=4), 200_000) |> DTables.chunk_lengths
110+
5-element Vector{Int64}:
111+
200000
112+
200000
113+
200000
114+
200000
115+
200000
116+
117+
julia> DTable(CSV.Chunks("test.csv", ntasks=4), 200_000, interpartition_merges=false) |> DTables.chunk_lengths
118+
8-element Vector{Int64}:
119+
200000
120+
49995
121+
200000
122+
50005
123+
200000
124+
49995
125+
200000
126+
50005
127+
128+
julia> DTable(CSV.Chunks("test.csv", ntasks=4), 300_000) |> DTables.chunk_lengths
129+
4-element Vector{Int64}:
130+
300000
131+
300000
132+
300000
133+
100000
134+
135+
julia> DTable(CSV.Chunks("test.csv", ntasks=4), 300_000, interpartition_merges=false) |> DTables.chunk_lengths
136+
4-element Vector{Int64}:
137+
249995
138+
250005
139+
249995
140+
250005
141+
142+
```
143+
90144
# Table operations
91145

92146
**Warning: this interface is experimental and may change at any time**
@@ -99,7 +153,7 @@ Below is an example of their usage.
99153
For more information please refer to the API documentation and unit tests.
100154

101155
```julia
102-
julia> using Dagger
156+
julia> using DTables
103157

104158
julia> d = DTable((k = repeat(['a', 'b'], 500), v = repeat(1:10, 100)), 100)
105159
DTable with 10 partitions
@@ -153,16 +207,16 @@ It lets you transform a row to the required format before applying the reduce fu
153207
In consequence a lot of memory usage should be saved due to the lack of an intermediate `map` step that allocates a full column.
154208

155209
```julia
156-
julia> using Dagger, OnlineStats
210+
julia> using DTables, OnlineStats
157211

158-
julia> fetch(Dagger.mapreduce(sum, fit!, d1, init = Mean()))
212+
julia> fetch(DTables.mapreduce(sum, fit!, d1, init = Mean()))
159213
Mean: n=100 | value=1.50573
160214

161215
julia> d1 = DTable((a=collect(1:100).%3, b=rand(100)), 25);
162216

163217
julia> gg = GroupBy(Int, Mean());
164218

165-
julia> fetch(Dagger.mapreduce(x-> (x.a, x.b), fit!, d1, init=gg))
219+
julia> fetch(DTables.mapreduce(x-> (x.a, x.b), fit!, d1, init=gg))
166220
GroupBy: Int64 => Mean
167221
├─ 1
168222
│ └─ Mean: n=34 | value=0.491379
@@ -175,7 +229,7 @@ julia> d2 = DTable((;a1=abs.(rand(Int, 100).%2), [Symbol("a\$(i)") => rand(100)
175229

176230
julia> gb = GroupBy(Int, Group([Series(Mean(), Variance(), Extrema()) for _ in 1:3]...));
177231

178-
julia> fetch(Dagger.mapreduce(r -> (r.a1, tuple(r...)), fit!, d2, init = gb))
232+
julia> fetch(DTables.mapreduce(r -> (r.a1, tuple(r...)), fit!, d2, init = gb))
179233
GroupBy: Int64 => Group
180234
├─ 1
181235
│ └─ Group
@@ -208,7 +262,7 @@ GroupBy: Int64 => Group
208262
```
209263

210264

211-
# Dagger.groupby interface
265+
# DTables.groupby interface
212266

213267
A `DTable` can be grouped which will result in creation of a `GDTable`.
214268
A distinct set of values contained in a single or multiple columns can be used as grouping keys.
@@ -224,22 +278,22 @@ julia> d = DTable((a=shuffle(repeat('a':'d', inner=4, outer=4)),b=repeat(1:4, 16
224278
DTable with 16 partitions
225279
Tabletype: NamedTuple
226280

227-
julia> Dagger.groupby(d, :a)
281+
julia> DTables.groupby(d, :a)
228282
GDTable with 4 partitions and 4 keys
229283
Tabletype: NamedTuple
230284
Grouped by: [:a]
231285

232-
julia> Dagger.groupby(d, [:a, :b])
286+
julia> DTables.groupby(d, [:a, :b])
233287
GDTable with 16 partitions and 16 keys
234288
Tabletype: NamedTuple
235289
Grouped by: [:a, :b]
236290

237-
julia> Dagger.groupby(d, row -> row.a + row.b)
291+
julia> DTables.groupby(d, row -> row.a + row.b)
238292
GDTable with 7 partitions and 7 keys
239293
Tabletype: NamedTuple
240294
Grouped by: #5
241295

242-
julia> g = Dagger.groupby(d, :a); keys(g)
296+
julia> g = DTables.groupby(d, :a); keys(g)
243297
KeySet for a Dict{Char, Vector{UInt64}} with 4 entries. Keys:
244298
'c'
245299
'd'
@@ -256,7 +310,7 @@ Tabletype: NamedTuple
256310
Operations such as `map`, `filter`, `reduce` can be performed on a `GDTable`
257311

258312
```julia
259-
julia> g = Dagger.groupby(d, [:a, :b])
313+
julia> g = DTables.groupby(d, [:a, :b])
260314
GDTable with 16 partitions and 16 keys
261315
Tabletype: NamedTuple
262316
Grouped by: [:a, :b]
@@ -308,7 +362,7 @@ julia> d = DTable((a=repeat('a':'b', inner=2),b=1:4), 2)
308362
DTable with 2 partitions
309363
Tabletype: NamedTuple
310364

311-
julia> g = Dagger.groupby(d, :a)
365+
julia> g = DTables.groupby(d, :a)
312366
GDTable with 2 partitions and 2 keys
313367
Tabletype: NamedTuple
314368
Grouped by: [:a]
@@ -355,7 +409,7 @@ the join functions coming from the `DataFrames.jl` package for the per chunk joi
355409
In the future this behavior will be expanded to any type that implements its own join methods, but for now is limited to `DataFrame` only.
356410

357411
Please note that the usage of any of the keyword arguments described above will always result in the usage of generic join methods
358-
defined in `Dagger` regardless of the availability of specialized methods.
412+
defined in `DTables` regardless of the availability of specialized methods.
359413

360414
```julia
361415
julia> using Tables; pp = d -> for x in Tables.rows(d) println("$(x.a), $(x.b), $(x.c)") end;

src/table/dataframes_interface_utils.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,4 +101,4 @@ function fillcolumns(
101101
end
102102

103103
ncol(d::DTable) = length(Tables.columns(d))
104-
index(df::DTable) = Index(_columnnames_svector(df))
104+
index(df::DTable) = Index(columnnames_svector(df))

0 commit comments

Comments
 (0)