@@ -10,7 +10,7 @@ Provide a `Tables.jl` compatible source, as well as a `chunksize`, which is the
10
10
maximum number of rows of each partition:
11
11
12
12
``` julia
13
- julia> using Dagger
13
+ julia> using DTables
14
14
15
15
julia> table = (a= [1 , 2 , 3 , 4 , 5 ], b= [6 , 7 , 8 , 9 , 10 ]);
16
16
@@ -28,7 +28,7 @@ Provide a `loader_function` and a list of filenames, which are parts of the
28
28
full table:
29
29
30
30
``` julia
31
- julia> using Dagger , CSV
31
+ julia> using DTables , CSV
32
32
33
33
julia> files = [" 1.csv" , " 2.csv" , " 3.csv" ];
34
34
@@ -87,6 +87,60 @@ julia> fetch(d, NamedTuple)
87
87
(a = [1 , 2 , 3 , 4 , 5 ], b = [6 , 7 , 8 , 9 , 10 ])
88
88
```
89
89
90
+ ## Behavior of the ` interpartition_merges ` kwarg
91
+
92
+ If a source supports the ` Tables.partitions ` interface then the DTable
93
+ will assume the partitioning size from the source. However, if you decide
94
+ to specify the exact size of the chunk the DTable will attempt to create
95
+ chunks exactly of that size even if it means merging data between partitions.
96
+ The behavior can be controlled by the ` interpartition_merges ` (` true ` by default)
97
+ kwarg and is best seen on the following example.
98
+
99
+ ``` julia
100
+ julia> using DTables, CSV
101
+
102
+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 )) |> DTables. chunk_lengths
103
+ 4 - element Vector{Int64}:
104
+ 249995
105
+ 250005
106
+ 249995
107
+ 250005
108
+
109
+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 200_000 ) |> DTables. chunk_lengths
110
+ 5 - element Vector{Int64}:
111
+ 200000
112
+ 200000
113
+ 200000
114
+ 200000
115
+ 200000
116
+
117
+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 200_000 , interpartition_merges= false ) |> DTables. chunk_lengths
118
+ 8 - element Vector{Int64}:
119
+ 200000
120
+ 49995
121
+ 200000
122
+ 50005
123
+ 200000
124
+ 49995
125
+ 200000
126
+ 50005
127
+
128
+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 300_000 ) |> DTables. chunk_lengths
129
+ 4 - element Vector{Int64}:
130
+ 300000
131
+ 300000
132
+ 300000
133
+ 100000
134
+
135
+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 300_000 , interpartition_merges= false ) |> DTables. chunk_lengths
136
+ 4 - element Vector{Int64}:
137
+ 249995
138
+ 250005
139
+ 249995
140
+ 250005
141
+
142
+ ```
143
+
90
144
# Table operations
91
145
92
146
** Warning: this interface is experimental and may change at any time**
@@ -99,7 +153,7 @@ Below is an example of their usage.
99
153
For more information please refer to the API documentation and unit tests.
100
154
101
155
``` julia
102
- julia> using Dagger
156
+ julia> using DTables
103
157
104
158
julia> d = DTable ((k = repeat ([' a' , ' b' ], 500 ), v = repeat (1 : 10 , 100 )), 100 )
105
159
DTable with 10 partitions
@@ -153,16 +207,16 @@ It lets you transform a row to the required format before applying the reduce fu
153
207
In consequence a lot of memory usage should be saved due to the lack of an intermediate ` map ` step that allocates a full column.
154
208
155
209
``` julia
156
- julia> using Dagger , OnlineStats
210
+ julia> using DTables , OnlineStats
157
211
158
- julia> fetch (Dagger . mapreduce (sum, fit!, d1, init = Mean ()))
212
+ julia> fetch (DTables . mapreduce (sum, fit!, d1, init = Mean ()))
159
213
Mean: n= 100 | value= 1.50573
160
214
161
215
julia> d1 = DTable ((a= collect (1 : 100 ).% 3 , b= rand (100 )), 25 );
162
216
163
217
julia> gg = GroupBy (Int, Mean ());
164
218
165
- julia> fetch (Dagger . mapreduce (x-> (x. a, x. b), fit!, d1, init= gg))
219
+ julia> fetch (DTables . mapreduce (x-> (x. a, x. b), fit!, d1, init= gg))
166
220
GroupBy: Int64 => Mean
167
221
├─ 1
168
222
│ └─ Mean: n= 34 | value= 0.491379
@@ -175,7 +229,7 @@ julia> d2 = DTable((;a1=abs.(rand(Int, 100).%2), [Symbol("a\$(i)") => rand(100)
175
229
176
230
julia> gb = GroupBy (Int, Group ([Series (Mean (), Variance (), Extrema ()) for _ in 1 : 3 ]. .. ));
177
231
178
- julia> fetch (Dagger . mapreduce (r -> (r. a1, tuple (r... )), fit!, d2, init = gb))
232
+ julia> fetch (DTables . mapreduce (r -> (r. a1, tuple (r... )), fit!, d2, init = gb))
179
233
GroupBy: Int64 => Group
180
234
├─ 1
181
235
│ └─ Group
@@ -208,7 +262,7 @@ GroupBy: Int64 => Group
208
262
```
209
263
210
264
211
- # Dagger .groupby interface
265
+ # DTables .groupby interface
212
266
213
267
A ` DTable ` can be grouped which will result in creation of a ` GDTable ` .
214
268
A distinct set of values contained in a single or multiple columns can be used as grouping keys.
@@ -224,22 +278,22 @@ julia> d = DTable((a=shuffle(repeat('a':'d', inner=4, outer=4)),b=repeat(1:4, 16
224
278
DTable with 16 partitions
225
279
Tabletype: NamedTuple
226
280
227
- julia> Dagger . groupby (d, :a )
281
+ julia> DTables . groupby (d, :a )
228
282
GDTable with 4 partitions and 4 keys
229
283
Tabletype: NamedTuple
230
284
Grouped by: [:a ]
231
285
232
- julia> Dagger . groupby (d, [:a , :b ])
286
+ julia> DTables . groupby (d, [:a , :b ])
233
287
GDTable with 16 partitions and 16 keys
234
288
Tabletype: NamedTuple
235
289
Grouped by: [:a , :b ]
236
290
237
- julia> Dagger . groupby (d, row -> row. a + row. b)
291
+ julia> DTables . groupby (d, row -> row. a + row. b)
238
292
GDTable with 7 partitions and 7 keys
239
293
Tabletype: NamedTuple
240
294
Grouped by: # 5
241
295
242
- julia> g = Dagger . groupby (d, :a ); keys (g)
296
+ julia> g = DTables . groupby (d, :a ); keys (g)
243
297
KeySet for a Dict{Char, Vector{UInt64}} with 4 entries. Keys:
244
298
' c'
245
299
' d'
@@ -256,7 +310,7 @@ Tabletype: NamedTuple
256
310
Operations such as ` map ` , ` filter ` , ` reduce ` can be performed on a ` GDTable `
257
311
258
312
``` julia
259
- julia> g = Dagger . groupby (d, [:a , :b ])
313
+ julia> g = DTables . groupby (d, [:a , :b ])
260
314
GDTable with 16 partitions and 16 keys
261
315
Tabletype: NamedTuple
262
316
Grouped by: [:a , :b ]
@@ -308,7 +362,7 @@ julia> d = DTable((a=repeat('a':'b', inner=2),b=1:4), 2)
308
362
DTable with 2 partitions
309
363
Tabletype: NamedTuple
310
364
311
- julia> g = Dagger . groupby (d, :a )
365
+ julia> g = DTables . groupby (d, :a )
312
366
GDTable with 2 partitions and 2 keys
313
367
Tabletype: NamedTuple
314
368
Grouped by: [:a ]
@@ -355,7 +409,7 @@ the join functions coming from the `DataFrames.jl` package for the per chunk joi
355
409
In the future this behavior will be expanded to any type that implements its own join methods, but for now is limited to ` DataFrame ` only.
356
410
357
411
Please note that the usage of any of the keyword arguments described above will always result in the usage of generic join methods
358
- defined in ` Dagger ` regardless of the availability of specialized methods.
412
+ defined in ` DTables ` regardless of the availability of specialized methods.
359
413
360
414
``` julia
361
415
julia> using Tables; pp = d -> for x in Tables. rows (d) println (" $(x. a) , $(x. b) , $(x. c) " ) end ;
0 commit comments