Skip to content

Commit 7577d05

Browse files
authored
Merge pull request #324 from ashwinvis/update-pandas
Update pandas
2 parents 0163088 + 46d57bb commit 7577d05

File tree

1 file changed

+44
-16
lines changed

1 file changed

+44
-16
lines changed

content/pandas.rst

Lines changed: 44 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -456,7 +456,7 @@ Exercises 3
456456

457457
nobel.groupby(['bornCountry', 'category']).size()
458458

459-
- (Optional) Create a pivot table to view a spreadsheet like structure, and view it
459+
- **(Optional)** Create a pivot table to view a spreadsheet like structure, and view it
460460

461461
- First add a column “number” to the nobel dataframe containing 1’s
462462
(to enable the counting below). We need to make a copy of
@@ -467,15 +467,17 @@ Exercises 3
467467

468468
- Then create the :meth:`~pandas.DataFrame.pivot_table`::
469469

470-
table = subset.pivot_table(values="number", index="bornCountry", columns="category", aggfunc=np.sum)
470+
table = subset.pivot_table(
471+
values="number", index="bornCountry", columns="category", aggfunc="sum"
472+
)
471473

472-
- (Optional) Install the **seaborn** visualization library if you don't
474+
- **(Optional)** Install the ``seaborn`` visualization library if you don't
473475
already have it, and create a heatmap of your table::
474476

475477
import seaborn as sns
476478
sns.heatmap(table,linewidths=.5);
477479

478-
- Play around with other nice looking plots::
480+
- **(Optional)** Play around with other nice looking plots::
479481

480482
sns.violinplot(y=subset["year"].dt.year, x="bornCountry", inner="stick", data=subset);
481483

@@ -485,8 +487,14 @@ Exercises 3
485487

486488
::
487489

488-
subset_physchem = nobel.loc[nobel['bornCountry'].isin(countries) & (nobel['category'].isin(['physics']) | nobel['category'].isin(['chemistry']))]
489-
sns.catplot(x="bornCountry", y="year", col="category", data=subset_physchem, kind="swarm");
490+
subset_physchem = nobel.loc[
491+
nobel['bornCountry'].isin(countries) & (
492+
nobel['category'].isin(['physics']) | nobel['category'].isin(['chemistry'])
493+
)
494+
]
495+
sns.catplot(
496+
x="bornCountry", y="year", col="category", data=subset_physchem, kind="swarm"
497+
);
490498

491499
::
492500

@@ -503,13 +511,13 @@ Exercises 3
503511
::
504512

505513
nobel.bornCountryCode.describe()
506-
# count 956
507-
# unique 81
514+
# count 969
515+
# unique 82
508516
# top US
509-
# freq 287
517+
# freq 292
510518

511519
We see that the US has received the largest number of Nobel prizes,
512-
and 81 countries are represented.
520+
and 82 countries are represented.
513521

514522
To calculate the age at which laureates receive their prize, we need
515523
to ensure that the "year" and "born" columns are in datetime format::
@@ -530,10 +538,20 @@ Exercises 3
530538
Beyond the basics
531539
-----------------
532540

533-
Larger DataFrame operations might be faster using :func:`~pandas.eval` with string expressions, `see
534-
<https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html>`__::
541+
Faster expression evaluation with :func:`~pandas.eval`
542+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
543+
544+
Larger DataFrame operations might be faster using :func:`~pandas.eval` with string expressions (`see
545+
here <https://pandas.pydata.org/docs/user_guide/enhancingperf.html#eval-performance-comparison>`__).
546+
To do so, we start by installing ``numexpr`` a Python library which optimizes such expressions::
547+
548+
%conda install numexpr
549+
550+
You may need to restart the kernel in Jupyter for this to be. Then::
535551

536552
import pandas as pd
553+
import numpy as np
554+
537555
# Make some really big dataframes
538556
nrows, ncols = 100000, 100
539557
rng = np.random.RandomState(42)
@@ -547,9 +565,11 @@ Adding dataframes the pythonic way yields::
547565

548566
And by using :func:`~pandas.eval`::
549567

550-
%timeit pd.eval('df1 + df2 + df3 + df4')
568+
%timeit pd.eval('df1 + df2 + df3 + df4', engine='numexpr')
551569
# 40ms
552570

571+
Assigning columns with :meth:`~pandas.DataFrame.apply`
572+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
553573

554574
We can assign function return lists as dataframe columns::
555575

@@ -597,9 +617,17 @@ apply your own functions to the data using :obj:`~pandas.DataFrame.apply`::
597617

598618

599619
Note that the numpy precision for integers caps at int64 while python ints are unbounded --
600-
limited by memory size. Thus, the result from fibonacci(99) would be erroneous when
601-
using numpy ints. The type of df['Number of Rabbits'][99] given by both functions above
602-
is in fact <class 'int'>.
620+
limited by memory size. Thus, the result from ``fibonacci(99)`` would be erroneous when
621+
using numpy ints. The type of ``df['Number of Rabbits'][99]`` given by both functions above
622+
is in fact ``<class 'int'>``.
623+
624+
.. seealso::
625+
626+
- `Modern Pandas <https://tomaugspurger.net/posts/modern-1-intro/>`__ (2020) -- a blog series
627+
on writing modern idiomatic pandas.
628+
- `Python Data Science Handbook <https://jakevdp.github.io/PythonDataScienceHandbook/index.html>`__ (2016) --
629+
which contains a chapter on `Data Manipulation with Pandas`.
630+
603631

604632
Alternatives to Pandas
605633
----------------------

0 commit comments

Comments
 (0)