@@ -10,7 +10,7 @@ Provide a `Tables.jl` compatible source, as well as a `chunksize`, which is the
1010maximum number of rows of each partition:
1111
1212``` julia
13- julia> using Dagger
13+ julia> using DTables
1414
1515julia> table = (a= [1 , 2 , 3 , 4 , 5 ], b= [6 , 7 , 8 , 9 , 10 ]);
1616
@@ -28,7 +28,7 @@ Provide a `loader_function` and a list of filenames, which are parts of the
2828full table:
2929
3030``` julia
31- julia> using Dagger , CSV
31+ julia> using DTables , CSV
3232
3333julia> files = [" 1.csv" , " 2.csv" , " 3.csv" ];
3434
@@ -87,6 +87,60 @@ julia> fetch(d, NamedTuple)
8787(a = [1 , 2 , 3 , 4 , 5 ], b = [6 , 7 , 8 , 9 , 10 ])
8888```
8989
90+ ## Behavior of the ` interpartition_merges ` kwarg
91+
92+ If a source supports the ` Tables.partitions ` interface then the DTable
93+ will assume the partitioning size from the source. However, if you decide
94+ to specify the exact size of the chunk the DTable will attempt to create
95+ chunks exactly of that size even if it means merging data between partitions.
96+ The behavior can be controlled by the ` interpartition_merges ` (` true ` by default)
97+ kwarg and is best seen on the following example.
98+
99+ ``` julia
100+ julia> using DTables, CSV
101+
102+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 )) |> DTables. chunk_lengths
103+ 4 - element Vector{Int64}:
104+ 249995
105+ 250005
106+ 249995
107+ 250005
108+
109+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 200_000 ) |> DTables. chunk_lengths
110+ 5 - element Vector{Int64}:
111+ 200000
112+ 200000
113+ 200000
114+ 200000
115+ 200000
116+
117+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 200_000 , interpartition_merges= false ) |> DTables. chunk_lengths
118+ 8 - element Vector{Int64}:
119+ 200000
120+ 49995
121+ 200000
122+ 50005
123+ 200000
124+ 49995
125+ 200000
126+ 50005
127+
128+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 300_000 ) |> DTables. chunk_lengths
129+ 4 - element Vector{Int64}:
130+ 300000
131+ 300000
132+ 300000
133+ 100000
134+
135+ julia> DTable (CSV. Chunks (" test.csv" , ntasks= 4 ), 300_000 , interpartition_merges= false ) |> DTables. chunk_lengths
136+ 4 - element Vector{Int64}:
137+ 249995
138+ 250005
139+ 249995
140+ 250005
141+
142+ ```
143+
90144# Table operations
91145
92146** Warning: this interface is experimental and may change at any time**
@@ -99,7 +153,7 @@ Below is an example of their usage.
99153For more information please refer to the API documentation and unit tests.
100154
101155``` julia
102- julia> using Dagger
156+ julia> using DTables
103157
104158julia> d = DTable ((k = repeat ([' a' , ' b' ], 500 ), v = repeat (1 : 10 , 100 )), 100 )
105159DTable with 10 partitions
@@ -153,16 +207,16 @@ It lets you transform a row to the required format before applying the reduce fu
153207In consequence a lot of memory usage should be saved due to the lack of an intermediate ` map ` step that allocates a full column.
154208
155209``` julia
156- julia> using Dagger , OnlineStats
210+ julia> using DTables , OnlineStats
157211
158- julia> fetch (Dagger . mapreduce (sum, fit!, d1, init = Mean ()))
212+ julia> fetch (DTables . mapreduce (sum, fit!, d1, init = Mean ()))
159213Mean: n= 100 | value= 1.50573
160214
161215julia> d1 = DTable ((a= collect (1 : 100 ).% 3 , b= rand (100 )), 25 );
162216
163217julia> gg = GroupBy (Int, Mean ());
164218
165- julia> fetch (Dagger . mapreduce (x-> (x. a, x. b), fit!, d1, init= gg))
219+ julia> fetch (DTables . mapreduce (x-> (x. a, x. b), fit!, d1, init= gg))
166220GroupBy: Int64 => Mean
167221├─ 1
168222│ └─ Mean: n= 34 | value= 0.491379
@@ -175,7 +229,7 @@ julia> d2 = DTable((;a1=abs.(rand(Int, 100).%2), [Symbol("a\$(i)") => rand(100)
175229
176230julia> gb = GroupBy (Int, Group ([Series (Mean (), Variance (), Extrema ()) for _ in 1 : 3 ]. .. ));
177231
178- julia> fetch (Dagger . mapreduce (r -> (r. a1, tuple (r... )), fit!, d2, init = gb))
232+ julia> fetch (DTables . mapreduce (r -> (r. a1, tuple (r... )), fit!, d2, init = gb))
179233GroupBy: Int64 => Group
180234├─ 1
181235│ └─ Group
@@ -208,7 +262,7 @@ GroupBy: Int64 => Group
208262```
209263
210264
211- # Dagger .groupby interface
265+ # DTables .groupby interface
212266
213267A ` DTable ` can be grouped which will result in creation of a ` GDTable ` .
214268A distinct set of values contained in a single or multiple columns can be used as grouping keys.
@@ -224,22 +278,22 @@ julia> d = DTable((a=shuffle(repeat('a':'d', inner=4, outer=4)),b=repeat(1:4, 16
224278DTable with 16 partitions
225279Tabletype: NamedTuple
226280
227- julia> Dagger . groupby (d, :a )
281+ julia> DTables . groupby (d, :a )
228282GDTable with 4 partitions and 4 keys
229283Tabletype: NamedTuple
230284Grouped by: [:a ]
231285
232- julia> Dagger . groupby (d, [:a , :b ])
286+ julia> DTables . groupby (d, [:a , :b ])
233287GDTable with 16 partitions and 16 keys
234288Tabletype: NamedTuple
235289Grouped by: [:a , :b ]
236290
237- julia> Dagger . groupby (d, row -> row. a + row. b)
291+ julia> DTables . groupby (d, row -> row. a + row. b)
238292GDTable with 7 partitions and 7 keys
239293Tabletype: NamedTuple
240294Grouped by: # 5
241295
242- julia> g = Dagger . groupby (d, :a ); keys (g)
296+ julia> g = DTables . groupby (d, :a ); keys (g)
243297KeySet for a Dict{Char, Vector{UInt64}} with 4 entries. Keys:
244298 ' c'
245299 ' d'
@@ -256,7 +310,7 @@ Tabletype: NamedTuple
256310Operations such as ` map ` , ` filter ` , ` reduce ` can be performed on a ` GDTable `
257311
258312``` julia
259- julia> g = Dagger . groupby (d, [:a , :b ])
313+ julia> g = DTables . groupby (d, [:a , :b ])
260314GDTable with 16 partitions and 16 keys
261315Tabletype: NamedTuple
262316Grouped by: [:a , :b ]
@@ -308,7 +362,7 @@ julia> d = DTable((a=repeat('a':'b', inner=2),b=1:4), 2)
308362DTable with 2 partitions
309363Tabletype: NamedTuple
310364
311- julia> g = Dagger . groupby (d, :a )
365+ julia> g = DTables . groupby (d, :a )
312366GDTable with 2 partitions and 2 keys
313367Tabletype: NamedTuple
314368Grouped by: [:a ]
@@ -355,7 +409,7 @@ the join functions coming from the `DataFrames.jl` package for the per chunk joi
355409In the future this behavior will be expanded to any type that implements its own join methods, but for now is limited to ` DataFrame ` only.
356410
357411Please note that the usage of any of the keyword arguments described above will always result in the usage of generic join methods
358- defined in ` Dagger ` regardless of the availability of specialized methods.
412+ defined in ` DTables ` regardless of the availability of specialized methods.
359413
360414``` julia
361415julia> using Tables; pp = d -> for x in Tables. rows (d) println (" $(x. a) , $(x. b) , $(x. c) " ) end ;
0 commit comments