Skip to content

Conversation

scott-routledge2
Copy link
Contributor

@scott-routledge2 scott-routledge2 commented Apr 16, 2025

Changes included in this PR

Subclass BaseExecutionEngine and implement BodoExecutionEngine, add as an attribute to bodo.jit / return_wrapped_fn.
See pandas-dev/pandas#61032 for context.

As a followup we can also implement kwargs (but not key word only arguments, basically do something similar to numba where they fold keyword arguments into args using the function signature.)

Also need to do Series.map support, however that cannot be tested right now because it hasn't been implemented in Pandas.

Testing strategy

In order to test, I needed to build Pandas from source. Here are the steps I took to test:

  1. Remove pandas from environment i.e. pixi remove pandas. Also need to update pyproject.toml. Note pyproject.toml and pixi.lock/toml will be restored before merging.
  2. Uninstall bodo, etc: pixi run clean
  3. Clone pandas and install from source: cd pandas; pip install .
  4. Build bodo: pixi run build-bodo

Ran PR CI for changes to jit decorator.

Run tests locally in spawn mode.

User facing changes

Once Pandas 3 is released, users can use Bodo to accelerate their UDFs in Pandas.
e.g.

df.apply(func, axis=1, engine=bodo.jit, args=args)

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.


class BodoExecutionEngine(BaseExecutionEngine):
@staticmethod
def map(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not implemented yet in Pandas so there is no way to test. Leaving as a followup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a PR adding engine to Series.map shortly. I have the implementation already, but map and apply tests are structured very differently, and I need to see how to implement the common fixtures without making the mess bigger or refactoring all the tests.

Copy link

codecov bot commented Apr 16, 2025

Codecov Report

Attention: Patch coverage is 18.96552% with 47 lines in your changes missing coverage. Please review.

Project coverage is 66.52%. Comparing base (76705aa) to head (e9cc70d).
Report is 11 commits behind head on main.

❌ Your project check has failed because the head coverage (66.52%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #410      +/-   ##
==========================================
- Coverage   66.59%   66.52%   -0.07%     
==========================================
  Files         176      176              
  Lines       63177    63288     +111     
  Branches     8835     8856      +21     
==========================================
+ Hits        42072    42103      +31     
- Misses      18483    18561      +78     
- Partials     2622     2624       +2     

pixi.toml Outdated
@@ -70,7 +70,6 @@ pip = "*"
# Core Python Deps
numba = "==0.61.0"
numpy = ">=1.24,<2.2"
pandas = ">=2.2,<2.3"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing locally, will restore before merging.

@scott-routledge2 scott-routledge2 marked this pull request as ready for review April 17, 2025 17:15
@scott-routledge2 scott-routledge2 changed the title [WIP] Implement interface for using bodo as engine in Pandas UDFs. [BSE-4737] Implement interface for using bodo as engine in Pandas UDFs. Apr 17, 2025
Copy link
Collaborator

@DrTodd13 DrTodd13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

args_str += ","
else:
# Add dummy value for args for spawn mode compatibility.
# TODO: fix in spawn mode.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR to fix this here: #414

Copy link
Collaborator

@ehsantn ehsantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scott-routledge2 ! Looks good to me.

pyproject.toml Outdated
"pandas>=2.2,<2.3",
"pandas>=2.2,<3.1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert?

def f(func_text, glbls, loc_vars, __name__):
bodo_exec(func_text, glbls, loc_vars, __name__)

spawner.submit_func_to_workers(f, [], apply_func_text, glbls, {}, __name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation looks like. Should be inside if spawn_mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I just realized this codepath is taken even when engine=bodo.jit(spawn=False, distributed=False). Would there be a way to also case on the decorator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will exec the func text on the workers even when spawn=False. Discussed offline it would be better to just avoid func_text altogether.


def test_udf_cache():
"""Tests that we can call the same UDF multiple times with cache flag on
without any errors. TODO: check cache."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified caching manually at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@scott-routledge2 scott-routledge2 merged commit 30b2bc1 into main Apr 17, 2025
10 checks passed
@scott-routledge2 scott-routledge2 deleted the scott/pandas_bodo_executor branch April 17, 2025 19:14
scott-routledge2 added a commit that referenced this pull request Apr 23, 2025
…s. (#410)

Subclass BaseExecutionEngine and implement BodoExecutionEngine, add as an attribute to bodo.jit / return_wrapped_fn.

As a followup we can also implement kwargs (but not key word only arguments, basically do something similar to numba where they fold keyword arguments into args using the function signature.)

Also need to do Series.map support, however that cannot be tested right now because it hasn't been implemented in Pandas.

(cherry picked from commit 30b2bc1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants