-
Notifications
You must be signed in to change notification settings - Fork 374
Update dask docs #1532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dask docs #1532
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
davanstrien
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! If you don't have a chance I can also add something similar for Polars. I had one question about how Dask does filtering for the downloads. I probably misunderstood something here.
|
|
||
| # Dask will skip the files or row groups that don't | ||
| # match rhe query without downloading them. | ||
| df = df[df.dump >= "CC-MAIN-2023"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Dask not still need to download the data to check the values in this column match this filter? From what I understood in the Polars case the predicate push down is usually used for skipping the reading of a column i.e. if you drop it later it doesn't bother to load it and/or doing a filtering step early on. Is Dask directly able to do this before loading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it will skip the row groups which don't have any row that matches the query using the row group metadata
Then on the remaining row groups it will download the column used for filtering to apply the filter
The other columns will be downloaded or not based on the other operations done on the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you drop it later it doesn't bother to load it and/or doing a filtering step early on. Is Dask directly able to do this before loading?
yes correct ! would be cool to explain that here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's super cool!! For some datasets the download time does seem to end up becoming a blocker so this is very neat!
Co-authored-by: Daniel van Strien <[email protected]>
|
thanks for the review ! merging this one for now but lmk if you have more comments |
Uh oh!
There was an error while loading. Please reload this page.