Skip to content

Commit fe61dcf

Browse files
authored
Merge pull request #266 from AaltoSciComp/rkdarst/pandas-revisions
content/pandas: Basic revision, no major changes.
2 parents e83d0a9 + 4258754 commit fe61dcf

File tree

1 file changed

+49
-30
lines changed

1 file changed

+49
-30
lines changed

content/pandas.rst

Lines changed: 49 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,10 @@ material, including:
3030
- a `cheatsheet <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>`__
3131
- a `cookbook <https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook>`__.
3232

33-
Let's get a flavor of what we can do with pandas. We will be working with an
34-
example dataset containing the passenger list from the Titanic, which is often used in Kaggle competitions and data science tutorials. First step is to load pandas::
33+
A quick Pandas preview
34+
----------------------
35+
36+
Let's get a flavor of what we can do with pandas (you won't be able to follow everything yet). We will be working with an example dataset containing the passenger list from the Titanic, which is often used in Kaggle competitions and data science tutorials. First step is to load pandas::
3537

3638
import pandas as pd
3739

@@ -48,6 +50,8 @@ print some summary statistics of its numerical data::
4850
# print the first 5 lines of the dataframe
4951
titanic.head()
5052

53+
::
54+
5155
# print summary statistics for each column
5256
titanic.describe()
5357

@@ -85,6 +89,8 @@ Clearly, pandas dataframes allows us to do advanced analysis with very few comma
8589
- Write a function name followed by question mark and execute the cell, e.g.
8690
write ``titanic.hist?`` and hit ``SHIFT + ENTER``.
8791
- Write the function name and hit ``SHIFT + TAB``.
92+
- Right click and select "Show contextual help". This tab will
93+
update with help for anything you click.
8894

8995

9096
What's in a dataframe?
@@ -112,7 +118,10 @@ and reading the titanic.csv datafile into a dataframe if needed, see above)::
112118

113119
titanic["Age"]
114120
titanic.Age # same as above
115-
type(titanic["Age"])
121+
122+
::
123+
124+
type(titanic["Age"]) # a pandas Series object
116125

117126
The columns have names. Here's how to get them (:attr:`~pandas.DataFrame.columns`)::
118127

@@ -123,10 +132,11 @@ However, the rows also have names! This is what Pandas calls the :obj:`~pandas.D
123132
titanic.index
124133

125134
We saw above how to select a single column, but there are many ways of
126-
selecting (and setting) single or multiple rows, columns and values. We can
127-
refer to columns and rows either by number or by their name
128-
(:attr:`~pandas.DataFrame.loc`, :attr:`~pandas.DataFrame.iloc`,
129-
:attr:`~pandas.DataFrame.at`, :attr:`~pandas.DataFrame.iat`)::
135+
selecting (and setting) single or multiple rows, columns and
136+
values. We can refer to columns and rows either by their name
137+
(:attr:`~pandas.DataFrame.loc`, :attr:`~pandas.DataFrame.at`) or by
138+
their index (:attr:`~pandas.DataFrame.iloc`,
139+
:attr:`~pandas.DataFrame.iat`)::
130140

131141
titanic.loc['Lam, Mr. Ali',"Age"] # select single value by row and column
132142
titanic.loc[:'Lam, Mr. Ali',"Survived":"Age"] # slice the dataframe by row and column *names*
@@ -193,7 +203,7 @@ Exercises 1
193203

194204
and::
195205

196-
titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean()
206+
titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean()
197207

198208

199209
Tidy data
@@ -253,10 +263,12 @@ Pandas also understands multiple other formats, for example using :obj:`~pandas.
253263

254264
But sometimes you would want to create a dataframe from scratch. Also this can be done
255265
in multiple ways, for example starting with a numpy array (see
256-
:class:`~pandas.DataFrame` docs::
266+
:class:`~pandas.DataFrame` docs)::
257267

268+
import numpy as np
258269
dates = pd.date_range('20130101', periods=6)
259270
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
271+
df
260272

261273
or a dictionary (see same docs)::
262274

@@ -265,6 +277,7 @@ or a dictionary (see same docs)::
265277
'C': np.array([3] * 8, dtype='int32'),
266278
'D': np.random.randn(8),
267279
'E': np.random.randn(8)})
280+
df
268281

269282
There are many ways to operate on dataframes. Let's look at a
270283
few examples in order to get a feeling of what's possible
@@ -347,13 +360,13 @@ Exercises 2
347360
``read_csv``, so we use :attr:`pandas.DataFrame.index` to get
348361
the names. So, names of members of largest family(ies)::
349362

350-
titanic[titanic["SibSp"] == 8].index
363+
titanic[titanic["SibSp"] == 8].index
351364

352365
- Histogram of family size based on fare class::
353366

354-
titanic.hist("SibSp",
355-
lambda x: "Poor" if titanic["Fare"].loc[x] < titanic["Fare"].mean() else "Rich",
356-
rwidth=0.9)
367+
titanic.hist("SibSp",
368+
lambda x: "Poor" if titanic["Fare"].loc[x] < titanic["Fare"].mean() else "Rich",
369+
rwidth=0.9)
357370

358371

359372

@@ -458,7 +471,7 @@ Exercises 3
458471

459472
- Play around with other nice looking plots::
460473

461-
sns.violinplot(y="year", x="bornCountry", inner="stick", data=subset);
474+
sns.violinplot(y=subset["year"].dt.year, x="bornCountry", inner="stick", data=subset);
462475

463476
::
464477

@@ -476,12 +489,15 @@ Exercises 3
476489

477490
.. solution::
478491

492+
Below is solutions for the basic steps, advanced steps are
493+
inline above.
494+
479495
We use the :meth:`describe` method:
480-
496+
481497
::
482498

483-
nobel.bornCountryCode.describe()
484-
# count 956
499+
nobel.bornCountryCode.describe()
500+
# count 956
485501
# unique 81
486502
# top US
487503
# freq 287
@@ -504,14 +520,15 @@ Exercises 3
504520
We can print names of all laureates from a given country, e.g.::
505521

506522
nobel[nobel["country"] == "Sweden"].loc[:, "firstname":"surname"]
507-
523+
508524
Beyond the basics
509525
-----------------
510526

511527
Larger DataFrame operations might be faster using :func:`~pandas.eval` with string expressions, `see
512528
<https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html>`__::
513529

514530
import pandas as pd
531+
# Make some really big dataframes
515532
nrows, ncols = 100000, 100
516533
rng = np.random.RandomState(42)
517534
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
@@ -521,17 +538,17 @@ Adding dataframes the pythonic way yields::
521538

522539
%timeit df1 + df2 + df3 + df4
523540
# 80ms
524-
541+
525542
And by using :func:`~pandas.eval`::
526543

527-
%timeit pd.eval('df1 + df2 + df3 + df4')
544+
%timeit pd.eval('df1 + df2 + df3 + df4')
528545
# 40ms
529546

530-
547+
531548
We can assign function return lists as dataframe columns::
532549

533550
def fibo(n):
534-
"""Compute Fibonacci numbers. Here we skip the overhead from the
551+
"""Compute Fibonacci numbers. Here we skip the overhead from the
535552
recursive function calls by using a list. """
536553
if n < 0:
537554
raise NotImplementedError('Not defined for negative values')
@@ -545,12 +562,14 @@ We can assign function return lists as dataframe columns::
545562
return memo
546563

547564
df = pd.DataFrame({'Generation': np.arange(100)})
548-
df['Number of Rabbits'] = fibo(99)
549-
550-
565+
df['Number of Rabbits'] = fibo(99) # Assigns list to column
566+
567+
551568
There is much more to Pandas than what we covered in this lesson. Whatever your
552569
needs are, chances are good there is a function somewhere in its `API
553-
<https://pandas.pydata.org/docs/>`__. And when there is not, you can always
570+
<https://pandas.pydata.org/docs/>`__. You should try to get good at
571+
searching the web for an example showing what you can do. And when
572+
there is not, you can always
554573
apply your own functions to the data using :obj:`~pandas.DataFrame.apply`::
555574

556575

@@ -569,10 +588,10 @@ apply your own functions to the data using :obj:`~pandas.DataFrame.apply`::
569588

570589
df = pd.DataFrame({'Generation': np.arange(100)})
571590
df['Number of Rabbits'] = df['Generation'].apply(fib)
572-
573-
574-
Note that the numpy precisision for integers caps at int64 while python ints are unbounded --
575-
limited by memory size. Thus, the result from fibonacci(99) would be erroneous when
591+
592+
593+
Note that the numpy precisision for integers caps at int64 while python ints are unbounded --
594+
limited by memory size. Thus, the result from fibonacci(99) would be erroneous when
576595
using numpy ints. The type of df['Number of Rabbits'][99] given by both functions above
577596
is in fact <class 'int'>.
578597

0 commit comments

Comments
 (0)