Add support for new pandas UDF engine #418

datapythonista · 2025-05-24T11:14:30Z

Closes #383

With these changes, it'll be possible to do things like this with pandas and blosc2:

import pandas
import blosc2

def my_func(x):
    return np.sin(x * 2)

s = pandas.Series([1, 2, 3], index=list('abc'), name='sample')

# blosc2 will jit the function
print(s.map(my_func, engine=blosc2.jit))

In this PR I'm assuming that blosc2.jit is intended to be used with vectorized numpy operations. If that's not the case, and you want to support scalar functions and jit Python loops (see example below), let me know and we can implement this for map:

@blosc2.jit
def my_func(array):
    for item in array:
        scalar_udf(item)

Also, apply is designed in pandas to call the udf for each column or row in a dataframe. I'm passing the whole array since it's a numpy array and I think operations should be vectorized, but I can make it work by columns or rows.

FrancescAlted

Good job. I just have a small comment about dimensionality of NumPy arrays, although I don't think this is going to be too important for pandas.

FrancescAlted · 2025-05-25T08:38:43Z

src/blosc2/proxy.py

+        """
+        data = cls._ensure_numpy_data(data)
+        func = decorator(func)
+        if data.ndim in (1, 2):


Why this check? I am pretty sure that Blosc2 can handle arrays up to 8 dims (can be made larger by recompiling the underlying C-Blosc2 library). If that is not the case, this is a bug.

Good point. pandas is the one that shouldn't be sending data with more than 2D. I had this check first as I had separate if branches for 1D and 2D, and I wanted to raise for unknown cases, but this is indeed not really needed. I removed it now.

FrancescAlted · 2025-05-25T08:41:54Z

BTW, do you have some preliminary benchmarks to check speed-ups? Thanks for your time!

datapythonista · 2025-05-25T09:13:27Z

BTW, do you have some preliminary benchmarks to check speed-ups? Thanks for your time!

Not at this point. I'm not sure about implementing proper benchmarking, since from the pandas side we'll just allow anyone to run the map and apply. What I would like to do is identify the best use cases for running pandas with Blosc2 (and for other known engines), and have them in the docs, with an execution time comparison. Is there any real world case you can think of where Blosc2 would be a good option to JIT a pandas Series or DataFrame?

FrancescAlted · 2025-05-25T09:51:14Z

Yes, Blosc2 is using numexpr behind the scenes, so it will work well with any mathematical function (sin, log, exp), conditionals, where (in the sense of numpy.where); see https://www.blosc.org/python-blosc2/reference/array_operations.html for the full list. In addition, it also supports most of the ufuncs implemented by NumPy, although the performance may not be as stellar as the ones supported by numexpr (listed above).

FrancescAlted · 2025-05-25T10:13:40Z

I forgot to mention that blosc2.jit is indeed intended to be used with vectorized numpy operations, so your implementation should work just fine.

datapythonista · 2025-05-25T10:25:00Z

Great, thanks for confirming. I'm wondering now if it'd be clearer for users if we just raise a NotImplementedError for map. I thought at first that it could make sense to run the vectorized function, but maybe that's confusing for users, if they try the same code with different engines. Maybe better to make engines always behave like the default engine, and raise if the exact behaviour is not supported. What do you think?

FrancescAlted · 2025-05-25T10:34:19Z

I think NotImplementedError for map in Blosc2 can make sense, as I expect pretty bad performance in that mode.

datapythonista · 2025-05-25T19:48:23Z

I updated the PR so the blosc2 engine passes to the udf the data in the way is expected. And I raise NotImplementedError for map.

FrancescAlted

LGTM

FrancescAlted · 2025-05-29T04:59:53Z

Thank you Marc!

FrancescAlted · 2025-05-29T05:10:01Z

BTW, I see that pandas-dev/pandas#61467 has been merged into main. Would that mean that we can start benchmarking with pandas main already? If so, can you suggest an existing benchmark that would suitable for this? If there is not an existing one, I can do some quick bench with some largish dataframe in combination with some function. Thanks again!

datapythonista · 2025-05-30T19:57:11Z

I'm not sure about a particular benchmark. What I have in mind for the pandas docs is to find some use cases that are a good usage for the different JIT compilers, and benchmark those against not using a JIT compiler. I may be wrong, but I don't think it makes too much sense to have a benchmark suite to compare Blosc2 with Numba or with Bodo.ai, since they solve very different use cases for what I know.

I'm not sure if it's easy, but personally I'd like to use real-world cases for that. Like, instead of using a sin, a multiplication, a squared root..., find a formula to compute something, and some data where it makes sense to use it, and show that. I've been using for the pandas UDF docs the formula to convert Celsius to Fahrenheit. I think it's probably too simple to show the benefits of blosc2.jit, but that's a start. Maybe we can use with a large temperatures dataset, together with a more complex use case.

FrancescAlted · 2025-09-20T11:19:54Z

@datapythonista I am giving a try at this implementation for the UDF engine, and I am detecting a strange regression for forthcoming pandas 3.0.

With the next script:

import numpy as np
import pandas
import blosc2

df = pandas.DataFrame([[4., 9.]] * 3, columns=["A", "B"])

res = df.apply(np.sqrt, engine=blosc2.jit)
print(res)

the output for pandas 2.3.2 is:

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

but for pandas-3.0.0.dev0+2416.g10a53051e7:

[[2. 3.]
 [2. 3.]
 [2. 3.]]

so, current main in pandas is returning a numpy array instead of a dataframe. Is this expected?

FrancescAlted · 2025-09-20T11:25:13Z

Also, I am trying to combine two columns of the dataframe in an expression, but I cannot realize the way of doing it. For example, this is not working:

import numpy as np
import pandas
import blosc2

df = pandas.DataFrame([[4., 9.]] * 3, columns=["A", "B"])

# res = df.apply(np.sqrt, engine=blosc2.jit)
res = df.apply(lambda x: x.A + x.B, engine=blosc2.jit)
print(res)

which errors (using 3.0.0 main branch here) with:

/Users/faltet/miniforge3/envs/blosc2/bin/python /Users/faltet/blosc/python-blosc2/prova-pandas2.py 
Traceback (most recent call last):
  File "/Users/faltet/blosc/python-blosc2/prova-pandas2.py", line 8, in <module>
    res = df.apply(lambda x: x.A + x.B, engine=blosc2.jit)
  File "/Users/faltet/software/pandas/pandas/core/frame.py", line 10784, in apply
    result = engine.__pandas_udf__.apply(
        data=data,
    ...<4 lines>...
        axis=axis,
    )
  File "/Users/faltet/blosc/python-blosc2/src/blosc2/proxy.py", line 751, in apply
    result = [func(data[:, row_idx], *args, **kwargs) for row_idx in range(data.shape[1])]
              ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/faltet/blosc/python-blosc2/src/blosc2/proxy.py", line 682, in wrapper
    retval = func(*new_args, **func_kwargs)
  File "/Users/faltet/blosc/python-blosc2/prova-pandas2.py", line 8, in <lambda>
    res = df.apply(lambda x: x.A + x.B, engine=blosc2.jit)
                             ^^^
AttributeError: 'SimpleProxy' object has no attribute 'A'

Process finished with exit code 1

Can you suggest the way (if any) to send different columns to the UDF engine?

datapythonista · 2025-09-21T00:08:47Z

I don't think returning a numpy array from apply is expected. But I'm not working on pandas anymore, since there was a lot of pushback in the project on anything I was trying to work on including this API. I guess it's a bug, but better to open an issue in case that's a new expected behavior I don't understand. For the exception, the API is quite badly designed (my work on trying to make it clearer was also often blocked and I discountinued), but I think you need axis=1 on apply to get data row by row.

…

On Sat, Sep 20, 2025, 18:25 Francesc Alted ***@***.***> wrote: *FrancescAlted* left a comment (Blosc/python-blosc2#418) <#418 (comment)> Also, I am trying to combine two columns of the dataframe in an expression, but I cannot realize the way of doing it. For example, this is not working: import numpy as npimport pandasimport blosc2 df = pandas.DataFrame([[4., 9.]] * 3, columns=["A", "B"]) # res = df.apply(np.sqrt, engine=blosc2.jit)res = df.apply(lambda x: x.A + x.B, engine=blosc2.jit)print(res) which errors (using 3.0.0 main branch here) with: /Users/faltet/miniforge3/envs/blosc2/bin/python /Users/faltet/blosc/python-blosc2/prova-pandas2.py Traceback (most recent call last): File "/Users/faltet/blosc/python-blosc2/prova-pandas2.py", line 8, in <module> res = df.apply(lambda x: x.A + x.B, engine=blosc2.jit) File "/Users/faltet/software/pandas/pandas/core/frame.py", line 10784, in apply result = engine.__pandas_udf__.apply( data=data, ...<4 lines>... axis=axis, ) File "/Users/faltet/blosc/python-blosc2/src/blosc2/proxy.py", line 751, in apply result = [func(data[:, row_idx], *args, **kwargs) for row_idx in range(data.shape[1])] ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/faltet/blosc/python-blosc2/src/blosc2/proxy.py", line 682, in wrapper retval = func(*new_args, **func_kwargs) File "/Users/faltet/blosc/python-blosc2/prova-pandas2.py", line 8, in <lambda> res = df.apply(lambda x: x.A + x.B, engine=blosc2.jit) ^^^ AttributeError: 'SimpleProxy' object has no attribute 'A' Process finished with exit code 1 Can you suggest the way (if any) to send different columns to the UDF engine? — Reply to this email directly, view it on GitHub <#418 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACMXUAB56ZHUUO2OFT55CFT3TU2S5AVCNFSM6AAAAAB52OGVO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMJUHEYTCMBYG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

FrancescAlted · 2025-09-21T08:51:20Z

Sad to hear that you discontinued your work on pandas. Ok, I'll open a a ticket so that don't fell into the cracks. For the record, specifying axis=1 raises an error in numba (?):

/Users/faltet/miniforge3/envs/blosc2/bin/python /Users/faltet/blosc/python-blosc2/prova-pandas2.py 
Traceback (most recent call last):
  File "/Users/faltet/blosc/python-blosc2/prova-pandas2.py", line 8, in <module>
    res = df.apply(lambda x: x.A + x.B, axis=1, engine=blosc2.jit)
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/frame.py", line 10381, in apply
    return op.apply().__finalize__(self, method="apply")
           ~~~~~~~~^^
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
           ~~~~~~~~~~~~~~~~~~~^^
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/apply.py", line 1065, in apply_standard
    results, res_index = self.apply_series_numba()
                         ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/apply.py", line 1099, in apply_series_numba
    results = self.apply_with_numba()
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/apply.py", line 1320, in apply_with_numba
    res = dict(nb_func(self.values, columns, index))
               ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/numba/core/dispatcher.py", line 424, in _compile_for_args
    error_rewrite(e, 'typing')
    ~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/numba/core/dispatcher.py", line 365, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function <lambda> at 0x105570a40>) found for signature:
 
 >>> <lambda>(series(float64, index([unichr x 1], C), int64))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'register_jitable.<locals>.wrap.<locals>.ov_wrap': File: numba/core/extending.py: Line 161.
    With argument(s): '(series(float64, index([unichr x 1], C), int64))':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   Unknown attribute 'A' of type series(float64, index([unichr x 1], C), int64)
   
   File "prova-pandas2.py", line 8:
   # res = df.apply(np.sqrt, engine=blosc2.jit)
   res = df.apply(lambda x: x.A + x.B, axis=1, engine=blosc2.jit)
   ^
   
   During: typing of get attribute at /Users/faltet/blosc/python-blosc2/prova-pandas2.py (8)
   
   File "prova-pandas2.py", line 8:
   # res = df.apply(np.sqrt, engine=blosc2.jit)
   res = df.apply(lambda x: x.A + x.B, axis=1, engine=blosc2.jit)
   ^
   
   During: Pass nopython_type_inference
  raised from /Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/numba/core/typeinfer.py:1074

During: resolving callee type: Function(<function <lambda> at 0x105570a40>)
During: typing of call at /Users/faltet/miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/apply.py (1302)


File "../../miniforge3/envs/blosc2/lib/python3.13/site-packages/pandas/core/apply.py", line 1302:
        def numba_func(values, col_names_index, index):
            <source elided>
                )
                results[i] = jitted_udf(ser)
                ^

During: Pass nopython_type_inference

Process finished with exit code 1

Add support for new pandas UDF engine

bc153c5

FrancescAlted reviewed May 25, 2025

View reviewed changes

Removing unneeded check

b334981

Making the blosc2 engine respect the default pandas udf signatures

5913fba

Fix typo breaking tests

c0e9602

FrancescAlted approved these changes May 29, 2025

View reviewed changes

FrancescAlted merged commit 173c438 into Blosc:main May 29, 2025
10 checks passed

Uh oh!

Add support for new pandas UDF engine #418

Add support for new pandas UDF engine #418

Uh oh!

Conversation

datapythonista commented May 24, 2025

Uh oh!

FrancescAlted left a comment

Choose a reason for hiding this comment

Uh oh!

FrancescAlted May 25, 2025

Choose a reason for hiding this comment

Uh oh!

datapythonista May 25, 2025

Choose a reason for hiding this comment

Uh oh!

FrancescAlted commented May 25, 2025

Uh oh!

datapythonista commented May 25, 2025

Uh oh!

FrancescAlted commented May 25, 2025

Uh oh!

FrancescAlted commented May 25, 2025

Uh oh!

datapythonista commented May 25, 2025

Uh oh!

FrancescAlted commented May 25, 2025

Uh oh!

datapythonista commented May 25, 2025

Uh oh!

FrancescAlted left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FrancescAlted commented May 29, 2025

Uh oh!

FrancescAlted commented May 29, 2025

Uh oh!

datapythonista commented May 30, 2025

Uh oh!

FrancescAlted commented Sep 20, 2025

Uh oh!

FrancescAlted commented Sep 20, 2025

Uh oh!

datapythonista commented Sep 21, 2025 via email

Uh oh!

FrancescAlted commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants