@@ -456,7 +456,7 @@ Exercises 3
456456
457457 nobel.groupby(['bornCountry', 'category']).size()
458458
459- - (Optional) Create a pivot table to view a spreadsheet like structure, and view it
459+ - ** (Optional) ** Create a pivot table to view a spreadsheet like structure, and view it
460460
461461 - First add a column “number” to the nobel dataframe containing 1’s
462462 (to enable the counting below). We need to make a copy of
@@ -467,15 +467,17 @@ Exercises 3
467467
468468 - Then create the :meth: `~pandas.DataFrame.pivot_table `::
469469
470- table = subset.pivot_table(values="number", index="bornCountry", columns="category", aggfunc=np.sum)
470+ table = subset.pivot_table(
471+ values="number", index="bornCountry", columns="category", aggfunc="sum"
472+ )
471473
472- - (Optional) Install the ** seaborn ** visualization library if you don't
474+ - ** (Optional) ** Install the `` seaborn `` visualization library if you don't
473475 already have it, and create a heatmap of your table::
474476
475477 import seaborn as sns
476478 sns.heatmap(table,linewidths=.5);
477479
478- - Play around with other nice looking plots::
480+ - ** (Optional) ** Play around with other nice looking plots::
479481
480482 sns.violinplot(y=subset["year"].dt.year, x="bornCountry", inner="stick", data=subset);
481483
@@ -485,8 +487,14 @@ Exercises 3
485487
486488 ::
487489
488- subset_physchem = nobel.loc[nobel['bornCountry'].isin(countries) & (nobel['category'].isin(['physics']) | nobel['category'].isin(['chemistry']))]
489- sns.catplot(x="bornCountry", y="year", col="category", data=subset_physchem, kind="swarm");
490+ subset_physchem = nobel.loc[
491+ nobel['bornCountry'].isin(countries) & (
492+ nobel['category'].isin(['physics']) | nobel['category'].isin(['chemistry'])
493+ )
494+ ]
495+ sns.catplot(
496+ x="bornCountry", y="year", col="category", data=subset_physchem, kind="swarm"
497+ );
490498
491499 ::
492500
@@ -503,13 +511,13 @@ Exercises 3
503511 ::
504512
505513 nobel.bornCountryCode.describe()
506- # count 956
507- # unique 81
514+ # count 969
515+ # unique 82
508516 # top US
509- # freq 287
517+ # freq 292
510518
511519 We see that the US has received the largest number of Nobel prizes,
512- and 81 countries are represented.
520+ and 82 countries are represented.
513521
514522 To calculate the age at which laureates receive their prize, we need
515523 to ensure that the "year" and "born" columns are in datetime format::
@@ -530,10 +538,20 @@ Exercises 3
530538Beyond the basics
531539-----------------
532540
533- Larger DataFrame operations might be faster using :func: `~pandas.eval ` with string expressions, `see
534- <https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html> `__::
541+ Faster expression evaluation with :func: `~pandas.eval `
542+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
543+
544+ Larger DataFrame operations might be faster using :func: `~pandas.eval ` with string expressions (`see
545+ here <https://pandas.pydata.org/docs/user_guide/enhancingperf.html#eval-performance-comparison> `__).
546+ To do so, we start by installing ``numexpr `` a Python library which optimizes such expressions::
547+
548+ %conda install numexpr
549+
550+ You may need to restart the kernel in Jupyter for this to be. Then::
535551
536552 import pandas as pd
553+ import numpy as np
554+
537555 # Make some really big dataframes
538556 nrows, ncols = 100000, 100
539557 rng = np.random.RandomState(42)
@@ -547,9 +565,11 @@ Adding dataframes the pythonic way yields::
547565
548566And by using :func: `~pandas.eval `::
549567
550- %timeit pd.eval('df1 + df2 + df3 + df4')
568+ %timeit pd.eval('df1 + df2 + df3 + df4', engine='numexpr' )
551569 # 40ms
552570
571+ Assigning columns with :meth: `~pandas.DataFrame.apply `
572+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
553573
554574We can assign function return lists as dataframe columns::
555575
@@ -597,9 +617,17 @@ apply your own functions to the data using :obj:`~pandas.DataFrame.apply`::
597617
598618
599619Note that the numpy precision for integers caps at int64 while python ints are unbounded --
600- limited by memory size. Thus, the result from fibonacci(99) would be erroneous when
601- using numpy ints. The type of df['Number of Rabbits'][99] given by both functions above
602- is in fact <class 'int'>.
620+ limited by memory size. Thus, the result from ``fibonacci(99) `` would be erroneous when
621+ using numpy ints. The type of ``df['Number of Rabbits'][99] `` given by both functions above
622+ is in fact ``<class 'int'> ``.
623+
624+ .. seealso ::
625+
626+ - `Modern Pandas <https://tomaugspurger.net/posts/modern-1-intro/ >`__ (2020) -- a blog series
627+ on writing modern idiomatic pandas.
628+ - `Python Data Science Handbook <https://jakevdp.github.io/PythonDataScienceHandbook/index.html >`__ (2016) --
629+ which contains a chapter on `Data Manipulation with Pandas `.
630+
603631
604632Alternatives to Pandas
605633----------------------
0 commit comments