[Data] Add polars usage instruction to docs#60029
[Data] Add polars usage instruction to docs#60029peterxcli wants to merge 12 commits intoray-project:masterfrom
Conversation
… section Signed-off-by: peterxcli <peterxcli@gmail.com>
f3b3875 to
5cbd0ef
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds documentation on how to enable Polars operations in Ray Data via DataContext. The added section is clear and helpful. I've found a minor issue in one sentence which contains a typo and a grammatical error. I've provided a suggestion to fix it for better clarity.
There was a problem hiding this comment.
Code Review
This pull request adds documentation on how to enable Polars-based operations in Ray Data, specifically for sorting, by setting use_polars_sort in the DataContext. The change is clear and useful. I've found a minor grammatical issue in the new documentation and suggested a correction to improve clarity.
|
@peterxcli please fix the suggestion from gemini, thanks |
|
And also, if we want to use glossary like "Polars", please use it with backtick like |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Peter Lee <peterxcli@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
|
@owenowenisme please take another look. thanks! |
|
@owenowenisme Thanks for the review🙏 |
|
|
||
| If you encounter an OOM errors, try decreasing your ``batch_size``. | ||
|
|
||
| Enabling ``Polars`` operations |
There was a problem hiding this comment.
Nit: Here and elsewhere -- I think it makes more sense for Polars to not be code text, especially since it's not part of the glossary or anything.
| ctx = ray.data.DataContext.get_current() | ||
| ctx.use_polars_sort = True | ||
|
|
||
| When you enable this flag, Ray Data automatically uses ``Polars`` for tabular dataset sorting operations, which can significantly improve performance for certain workloads. This doesn't affect your UDF code, you can still use any batch format in :meth:`~ray.data.Dataset.map_batches`. |
There was a problem hiding this comment.
What're the user-facing Ray Data APIs that benefit from the polars feature?
IIUC it doesn't improve performance for most UDFs except for map_groups, and that's because of an implementation detail where we perform a sort.
Would this information be more appropriate in a different user guide(s)?
There was a problem hiding this comment.
I moved it to performance-tips, WDYT?
|
Hi @peterxcli, are you still working on this? |
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds documentation to the Ray Data “Transforming data” guide describing how to enable polars-backed optimizations via DataContext, in response to #59224.
Changes:
- Add a new “Enabling Polars operations” subsection explaining
DataContext.use_polars_sort. - Minor whitespace cleanup in existing
.. testcode::blocks.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| You can enable ``Polars`` globally to optimize certain Ray Data operations. Ray Data uses ``Polars`` internally for better performance when processing data. | ||
|
|
||
| To enable ``Polars`` operations, configure the :class:`~ray.data.DataContext`: | ||
|
|
||
| .. testcode:: | ||
|
|
||
| import ray | ||
|
|
||
| ctx = ray.data.DataContext.get_current() | ||
| ctx.use_polars_sort = True | ||
|
|
||
| When you enable this flag, Ray Data automatically uses ``Polars`` for tabular dataset sorting operations, which can significantly improve performance for certain workloads. This doesn't affect your UDF code. You can still use any batch format in :meth:`~ray.data.Dataset.map_batches`. |
There was a problem hiding this comment.
The text implies Ray Data generally uses Polars internally, but the code path is only selected when DataContext.use_polars_sort (or deprecated use_polars) is enabled, and it currently applies to sort-related transforms. Consider rewording to clarify the scope (sorting only) and that it’s opt-in via this flag.
2a239fa to
86082dc
Compare
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Description
We can use polars to make operations more efficient by
map_batchesUDF (this is already covered at: https://docs.ray.io/en/master/data/transforming-data.html#choosing-the-right-batch-format)use_polarsoruse_polars_sortinDataContextto enable built-in polars ops (use_polarsflag is deprecated, so I only adduse_polars_sortat this time)Related issues
Closes: #59224
Additional information
No