Skip to content

Commit 472e232

Browse files
authored
Merge pull request #239 from AaltoSciComp/rkdarst/pandas-update
content/pandas: Misc updates, mainly formatting and linking
2 parents 804bac5 + 73d6d9e commit 472e232

File tree

1 file changed

+54
-20
lines changed

1 file changed

+54
-20
lines changed

content/pandas.rst

Lines changed: 54 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -175,11 +175,25 @@ Exercises 1
175175

176176
.. solution::
177177

178-
- Mean age of the first 10 passengers: ``titanic.iloc[:10,:]["Age"].mean()``
179-
or ``titanic.loc[:"Nasser, Mrs. Nicholas (Adele Achem)","Age"].mean()`` or ``titanic.iloc[:10,4].mean()``.
180-
- Survival rate among passengers over and under average age:
181-
``titanic[titanic["Age"] > titanic["Age"].mean()]["Survived"].mean()`` and
182-
``titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean()``.
178+
- Mean age of the first 10 passengers::
179+
180+
titanic.iloc[:10,:]["Age"].mean()
181+
182+
or::
183+
184+
titanic.loc[:"Nasser, Mrs. Nicholas (Adele Achem)","Age"].mean()
185+
186+
or::
187+
188+
titanic.iloc[:10,4].mean()
189+
190+
- Survival rate among passengers over and under average age::
191+
192+
titanic[titanic["Age"] > titanic["Age"].mean()]["Survived"].mean()
193+
194+
and::
195+
196+
titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean()
183197

184198

185199
Tidy data
@@ -238,12 +252,13 @@ Pandas also understands multiple other formats, for example using :obj:`~pandas.
238252
:obj:`~pandas.DataFrame.to_csv`, :obj:`~pandas.DataFrame.to_excel`, :obj:`~pandas.DataFrame.to_hdf`, :obj:`~pandas.DataFrame.to_json`, etc.)
239253

240254
But sometimes you would want to create a dataframe from scratch. Also this can be done
241-
in multiple ways, for example starting with a numpy array::
255+
in multiple ways, for example starting with a numpy array (see
256+
:class:`~pandas.DataFrame` docs::
242257

243258
dates = pd.date_range('20130101', periods=6)
244259
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
245260

246-
or a dictionary::
261+
or a dictionary (see same docs)::
247262

248263
df = pd.DataFrame({'A': ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'dog'],
249264
'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
@@ -255,11 +270,10 @@ There are many ways to operate on dataframes. Let's look at a
255270
few examples in order to get a feeling of what's possible
256271
and what the use cases can be.
257272

258-
We can easily split and concatenate or append dataframes::
273+
We can easily split and :func:`concatenate <pandas.concat>` dataframes::
259274

260275
sub1, sub2, sub3 = df[:2], df[2:4], df[4:]
261276
pd.concat([sub1, sub2, sub3])
262-
sub1.append([sub2, sub3]) # same as above
263277

264278
When pulling data from multiple dataframes, a powerful :obj:`pandas.DataFrame.merge` method is
265279
available that acts similarly to merging in SQL. Say we have a dataframe containing the age of some athletes::
@@ -313,20 +327,33 @@ Exercises 2
313327
In the Titanic passenger list dataset,
314328
investigate the family size of the passengers (i.e. the "SibSp" column).
315329

316-
- What different family sizes exist in the passenger list? Hint: try the :obj:`~pandas.Series.unique` method
330+
- What different family sizes exist in the passenger list? Hint: try the :meth:`~pandas.Series.unique` method
317331
- What are the names of the people in the largest family group?
318332
- (Advanced) Create histograms showing the distribution of family sizes for
319333
passengers split by the fare, i.e. one group of high-fare passengers (where
320334
the fare is above average) and one for low-fare passengers
321335
(Hint: instead of an existing column name, you can give a lambda function
322-
as a parameter to ``hist`` to compute a value on the fly. For example
336+
as a parameter to :meth:`~pandas.DataFrame.hist` to compute a value on the fly. For example
323337
``lambda x: "Poor" if df["Fare"].loc[x] < df["Fare"].mean() else "Rich"``).
324338

325-
.. solution::
326-
327-
- Existing family sizes: ``df["SibSp"].unique()``
328-
- Names of members of largest family(ies): ``df[df["SibSp"] == 8]["Name"]``
329-
- ``df.hist("SibSp", lambda x: "Poor" if df["Fare"].loc[x] < df["Fare"].mean() else "Rich", rwidth=0.9)``
339+
.. solution::
340+
341+
- Existing family sizes::
342+
343+
titanic["SibSp"].unique()
344+
345+
- We get 8 from above. There is no ``Name`` column, since we
346+
made ``Name`` the index when we loaded the dataframe with
347+
``read_csv``, so we use :attr:`pandas.DataFrame.index` to get
348+
the names. So, names of members of largest family(ies)::
349+
350+
titanic[titanic["SibSp"] == 8].index
351+
352+
- Histogram of family size based on fare class::
353+
354+
titanic.hist("SibSp",
355+
lambda x: "Poor" if titanic["Fare"].loc[x] < titanic["Fare"].mean() else "Rich",
356+
rwidth=0.9)
330357

331358

332359

@@ -347,19 +374,22 @@ a browser and download the file, or use the JupyterLab interface by clicking
347374

348375
We can then load and explore the data::
349376

377+
# File → Open from URL → enter https://api.nobelprize.org/v1/laureate.csv
378+
# This opens it in JupyterLab but also saves it as laureate.csv
350379
nobel = pd.read_csv("laureate.csv")
351380
nobel.head()
352381

353382
This dataset has three columns for time, "born"/"died" and "year".
354383
These are represented as strings and integers, respectively, and
355-
need to be converted to datetime format::
384+
need to be converted to datetime format. :func:`pandas.to_datetime`
385+
makes this easy::
356386

357387
# the errors='coerce' argument is needed because the dataset is a bit messy
358388
nobel["born"] = pd.to_datetime(nobel["born"], errors ='coerce')
359389
nobel["died"] = pd.to_datetime(nobel["died"], errors ='coerce')
360390
nobel["year"] = pd.to_datetime(nobel["year"], format="%Y")
361391

362-
Pandas knows a lot about dates::
392+
Pandas knows a lot about dates (using :attr:`~pandas.Series.dt`)::
363393

364394
print(nobel["born"].dt.day)
365395
print(nobel["born"].dt.year)
@@ -410,7 +440,11 @@ Exercises 3
410440
- (Optional) Create a pivot table to view a spreadsheet like structure, and view it
411441

412442
- First add a column “number” to the nobel dataframe containing 1’s
413-
(to enable the counting below).
443+
(to enable the counting below). We need to make a copy of
444+
``subset``, because right now it is only a view::
445+
446+
subset = subset.copy()
447+
subset.loc[:, 'number'] = 1
414448

415449
- Then create the :meth:`~pandas.DataFrame.pivot_table`::
416450

@@ -424,7 +458,7 @@ Exercises 3
424458

425459
- Play around with other nice looking plots::
426460

427-
sns.violinplot(y="year", x="bornCountry",inner="stick", data=subset);
461+
sns.violinplot(y="year", x="bornCountry", inner="stick", data=subset);
428462

429463
::
430464

0 commit comments

Comments
 (0)