Skip to content

Commit c5283dc

Browse files
authored
pandas.eval() for large dataframes, cpython function call overhead, np.int vs int (#226)
* pandas eval() with link, and func overhead Mention pandas.eval() with link, and small talk discussion on python function call overhead (especially cpython < 3.11). Using pypy is not supported by many libraries, but it would probably be even faster for fibonacci. Short mention about max numpy int and max python int * Update pandas.rst obj pandas.eval() * Update pandas.rst
1 parent 70685de commit c5283dc

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

content/pandas.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,11 +431,51 @@ Exercises 3
431431
Beyond the basics
432432
-----------------
433433

434+
Larger DataFrame operations might be faster using :obj:`~pandas.eval()` with string expressions, `see
435+
<https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html>`__::
436+
437+
import pandas as pd
438+
nrows, ncols = 100000, 100
439+
rng = np.random.RandomState(42)
440+
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
441+
for i in range(4))
442+
Adding dataframes the pythonic way yields::
443+
444+
%timeit df1 + df2 + df3 + df4
445+
# 80ms
446+
447+
And by using :obj:`~pandas.eval()`::
448+
449+
%timeit pd.eval('df1 + df2 + df3 + df4')
450+
# 40ms
451+
452+
453+
We can assign function return lists as dataframe columns::
454+
455+
def fibo(n):
456+
"""Compute Fibonacci numbers. Here we skip the overhead from the
457+
recursive function calls by using a list. """
458+
if n < 0:
459+
raise NotImplementedError('Not defined for negative values')
460+
elif n < 2:
461+
return n
462+
memo = [0]*(n+1)
463+
memo[0] = 0
464+
memo[1] = 1
465+
for i in range(2, n+1):
466+
memo[i] = memo[i-1] + memo[i-2]
467+
return memo
468+
469+
df = pd.DataFrame({'Generation': np.arange(100)})
470+
df['Number of Rabbits'] = fibo(99)
471+
472+
434473
There is much more to Pandas than what we covered in this lesson. Whatever your
435474
needs are, chances are good there is a function somewhere in its `API
436475
<https://pandas.pydata.org/docs/>`__. And when there is not, you can always
437476
apply your own functions to the data using :obj:`~pandas.DataFrame.apply`::
438477

478+
439479
from functools import lru_cache
440480

441481
@lru_cache
@@ -451,6 +491,12 @@ apply your own functions to the data using :obj:`~pandas.DataFrame.apply`::
451491

452492
df = pd.DataFrame({'Generation': np.arange(100)})
453493
df['Number of Rabbits'] = df['Generation'].apply(fib)
494+
495+
496+
Note that the numpy precisision for integers caps at int64 while python ints are unbounded --
497+
limited by memory size. Thus, the result from fibonacci(99) would be erroneous when
498+
using numpy ints. The type of df['Number of Rabbits'][99] given by both functions above
499+
is in fact <class 'int'>.
454500

455501

456502
.. keypoints::

0 commit comments

Comments
 (0)