Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/DTables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ using Tables:
materializer,
partitioner,
rows,
rowtable,
schema,
Schema

Expand Down
21 changes: 21 additions & 0 deletions src/table/dtable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,27 @@ function length(table::DTable)
return sum(chunk_lengths(table))
end

function first(table::DTable, rows::UInt)
if nrow(table) == 0
return table
end

chunk_length = chunk_lengths(table)[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk lengths are not guarenteed to be equal
some may even be empty

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @krynju. If this is the case, is there any way to retrieve the original chunksize? If I'm not wrong it's not stored as a property of DTables.

On another note: suppose for a DTable I have chunksize greater than number of rows in the table. In that case, won't I lose information about what chunksize I passed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it was an early design decision to make chunksize an argument of the constructor for the initial partitioning and later ignore it (and for that reason not store it either)

I think including the original chunksize in the logic would also be a bit confusing and would make it more complex, but if we have any use case for that then we can revisit this

I did think of caching the current chunk sizes, because generally that information doesn't change in a dtable (after you manipulate a dtable it becomes a new dtable)
We already cache the schema so a similar mechanism could be used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another note: suppose for a DTable I have chunksize greater than number of rows in the table. In that case, won't I lose information about what chunksize I passed?

You will and you will only get one partition in the dtable

Copy link
Author

@codetalker7 codetalker7 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it was an early design decision to make chunksize an argument of the constructor for the initial partitioning and later ignore it (and for that reason not store it either)

I think including the original chunksize in the logic would also be a bit confusing and would make it more complex, but if we have any use case for that then we can revisit this

I did think of caching the current chunk sizes, because generally that information doesn't change in a dtable (after you manipulate a dtable it becomes a new dtable) We already cache the schema so a similar mechanism could be used

@krynju how about this: to get the chunksize, can I get the maximum value from chunk_lengths? Certainly this maximum should be the original chunksize, except for a boundary case where chunksize is greater than the number of rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not guaranteed. Why do you need the original chunksize?

num_full_chunks = Int(floor(rows / chunk_length)) # number of required chunks
sink = materializer(table.tabletype)
if num_full_chunks * chunk_length == rows
required_chunks = table.chunks[1:num_full_chunks]
else
# take only the needed rows from extra chunk
needed_rows = rows - num_full_chunks * chunk_length
extra_chunk = table.chunks[num_full_chunks + 1]
extra_chunk_rows = rowtable(fetch(extra_chunk))
new_chunk = Dagger.tochunk(sink(extra_chunk_rows[1:needed_rows]))
required_chunks = vcat(table.chunks[1:num_full_chunks], [new_chunk])
Copy link
Member

@krynju krynju Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to do this with with Dagger.@spawn and make the last dtable chunk just a thunk (so result of Dagger.@spawn)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krynju how does it look now? I've used the maximum among all chunk_lengths to get the original chunk size, and made the last chunk a thunk.

Copy link
Member

@krynju krynju Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the actual chunk lengths and not a maximum of them

When you call first(d,50) it should go something like this
(not valid code, just writing the idea)

s = 50
csum=0
chunks = []
for (cl,chunk) in zip(chunk_lengths(d), d.chunks)
    if csum + cl > s 
        # do the thing with spawn, this is the last one and we need to make a thunk from it and cut it
        push!(chunks, the_cut_thunk)
    else
        csum += cl
        push!(chunks, chunk)
     end
     return DTable(chunks)
end

end
return DTable(required_chunks, table.tabletype)
end

function columnnames_svector(d::DTable)
colnames_tuple = determine_columnnames(d)
return colnames_tuple !== nothing ? [sym for sym in colnames_tuple] : nothing
Expand Down