-
Notifications
You must be signed in to change notification settings - Fork 3
Adding first
, last
and describe
convenience functions.
#42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,7 @@ using Tables: | |
materializer, | ||
partitioner, | ||
rows, | ||
rowtable, | ||
schema, | ||
Schema | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -258,6 +258,27 @@ function length(table::DTable) | |
return sum(chunk_lengths(table)) | ||
end | ||
|
||
function first(table::DTable, rows::UInt) | ||
if nrow(table) == 0 | ||
return table | ||
end | ||
|
||
chunk_length = chunk_lengths(table)[1] | ||
num_full_chunks = Int(floor(rows / chunk_length)) # number of required chunks | ||
sink = materializer(table.tabletype) | ||
if num_full_chunks * chunk_length == rows | ||
required_chunks = table.chunks[1:num_full_chunks] | ||
else | ||
# take only the needed rows from extra chunk | ||
needed_rows = rows - num_full_chunks * chunk_length | ||
extra_chunk = table.chunks[num_full_chunks + 1] | ||
extra_chunk_rows = rowtable(fetch(extra_chunk)) | ||
new_chunk = Dagger.tochunk(sink(extra_chunk_rows[1:needed_rows])) | ||
required_chunks = vcat(table.chunks[1:num_full_chunks], [new_chunk]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's better to do this with with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @krynju how does it look now? I've used the maximum among all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should use the actual chunk lengths and not a maximum of them When you call first(d,50) it should go something like this s = 50
csum=0
chunks = []
for (cl,chunk) in zip(chunk_lengths(d), d.chunks)
if csum + cl > s
# do the thing with spawn, this is the last one and we need to make a thunk from it and cut it
push!(chunks, the_cut_thunk)
else
csum += cl
push!(chunks, chunk)
end
return DTable(chunks)
end
|
||
end | ||
return DTable(required_chunks, table.tabletype) | ||
end | ||
|
||
function columnnames_svector(d::DTable) | ||
colnames_tuple = determine_columnnames(d) | ||
return colnames_tuple !== nothing ? [sym for sym in colnames_tuple] : nothing | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk lengths are not guarenteed to be equal
some may even be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @krynju. If this is the case, is there any way to retrieve the original
chunksize
? If I'm not wrong it's not stored as a property ofDTable
s.On another note: suppose for a
DTable
I havechunksize
greater than number of rows in the table. In that case, won't I lose information about whatchunksize
I passed?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it was an early design decision to make chunksize an argument of the constructor for the initial partitioning and later ignore it (and for that reason not store it either)
I think including the original chunksize in the logic would also be a bit confusing and would make it more complex, but if we have any use case for that then we can revisit this
I did think of caching the current chunk sizes, because generally that information doesn't change in a dtable (after you manipulate a dtable it becomes a new dtable)
We already cache the schema so a similar mechanism could be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for this you can just use chunk_legths as you did
https://github.com/JuliaParallel/DTables.jl/blob/9fcbe237e0c6ddd6b6f2880f33347efe99a76fdd/src/table/dtable.jl#L252C10-L255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will and you will only get one partition in the dtable
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krynju how about this: to get the chunksize, can I get the maximum value from chunk_lengths? Certainly this maximum should be the original
chunksize
, except for a boundary case wherechunksize
is greater than the number of rows.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not guaranteed. Why do you need the original chunksize?