@@ -175,11 +175,25 @@ Exercises 1
175175
176176 .. solution ::
177177
178- - Mean age of the first 10 passengers: ``titanic.iloc[:10,:]["Age"].mean() ``
179- or ``titanic.loc[:"Nasser, Mrs. Nicholas (Adele Achem)","Age"].mean() `` or ``titanic.iloc[:10,4].mean() ``.
180- - Survival rate among passengers over and under average age:
181- ``titanic[titanic["Age"] > titanic["Age"].mean()]["Survived"].mean() `` and
182- ``titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean() ``.
178+ - Mean age of the first 10 passengers::
179+
180+ titanic.iloc[:10,:]["Age"].mean()
181+
182+ or::
183+
184+ titanic.loc[:"Nasser, Mrs. Nicholas (Adele Achem)","Age"].mean()
185+
186+ or::
187+
188+ titanic.iloc[:10,4].mean()
189+
190+ - Survival rate among passengers over and under average age::
191+
192+ titanic[titanic["Age"] > titanic["Age"].mean()]["Survived"].mean()
193+
194+ and::
195+
196+ titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean()
183197
184198
185199Tidy data
@@ -238,12 +252,13 @@ Pandas also understands multiple other formats, for example using :obj:`~pandas.
238252:obj: `~pandas.DataFrame.to_csv `, :obj: `~pandas.DataFrame.to_excel `, :obj: `~pandas.DataFrame.to_hdf `, :obj: `~pandas.DataFrame.to_json `, etc.)
239253
240254But sometimes you would want to create a dataframe from scratch. Also this can be done
241- in multiple ways, for example starting with a numpy array::
255+ in multiple ways, for example starting with a numpy array (see
256+ :class: `~pandas.DataFrame ` docs::
242257
243258 dates = pd.date_range('20130101', periods=6)
244259 df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
245260
246- or a dictionary::
261+ or a dictionary (see same docs) ::
247262
248263 df = pd.DataFrame({'A': ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'dog'],
249264 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
@@ -255,11 +270,10 @@ There are many ways to operate on dataframes. Let's look at a
255270few examples in order to get a feeling of what's possible
256271and what the use cases can be.
257272
258- We can easily split and concatenate or append dataframes::
273+ We can easily split and :func: ` concatenate <pandas.concat> ` dataframes::
259274
260275 sub1, sub2, sub3 = df[:2], df[2:4], df[4:]
261276 pd.concat([sub1, sub2, sub3])
262- sub1.append([sub2, sub3]) # same as above
263277
264278When pulling data from multiple dataframes, a powerful :obj: `pandas.DataFrame.merge ` method is
265279available that acts similarly to merging in SQL. Say we have a dataframe containing the age of some athletes::
@@ -313,20 +327,33 @@ Exercises 2
313327 In the Titanic passenger list dataset,
314328 investigate the family size of the passengers (i.e. the "SibSp" column).
315329
316- - What different family sizes exist in the passenger list? Hint: try the :obj : `~pandas.Series.unique ` method
330+ - What different family sizes exist in the passenger list? Hint: try the :meth : `~pandas.Series.unique ` method
317331 - What are the names of the people in the largest family group?
318332 - (Advanced) Create histograms showing the distribution of family sizes for
319333 passengers split by the fare, i.e. one group of high-fare passengers (where
320334 the fare is above average) and one for low-fare passengers
321335 (Hint: instead of an existing column name, you can give a lambda function
322- as a parameter to `` hist ` ` to compute a value on the fly. For example
336+ as a parameter to :meth: ` ~pandas.DataFrame. hist ` to compute a value on the fly. For example
323337 ``lambda x: "Poor" if df["Fare"].loc[x] < df["Fare"].mean() else "Rich" ``).
324338
325- .. solution ::
326-
327- - Existing family sizes: ``df["SibSp"].unique() ``
328- - Names of members of largest family(ies): ``df[df["SibSp"] == 8]["Name"] ``
329- - ``df.hist("SibSp", lambda x: "Poor" if df["Fare"].loc[x] < df["Fare"].mean() else "Rich", rwidth=0.9) ``
339+ .. solution ::
340+
341+ - Existing family sizes::
342+
343+ titanic["SibSp"].unique()
344+
345+ - We get 8 from above. There is no ``Name `` column, since we
346+ made ``Name `` the index when we loaded the dataframe with
347+ ``read_csv ``, so we use :attr: `pandas.DataFrame.index ` to get
348+ the names. So, names of members of largest family(ies)::
349+
350+ titanic[titanic["SibSp"] == 8].index
351+
352+ - Histogram of family size based on fare class::
353+
354+ titanic.hist("SibSp",
355+ lambda x: "Poor" if titanic["Fare"].loc[x] < titanic["Fare"].mean() else "Rich",
356+ rwidth=0.9)
330357
331358
332359
@@ -347,19 +374,22 @@ a browser and download the file, or use the JupyterLab interface by clicking
347374
348375We can then load and explore the data::
349376
377+ # File → Open from URL → enter https://api.nobelprize.org/v1/laureate.csv
378+ # This opens it in JupyterLab but also saves it as laureate.csv
350379 nobel = pd.read_csv("laureate.csv")
351380 nobel.head()
352381
353382This dataset has three columns for time, "born"/"died" and "year".
354383These are represented as strings and integers, respectively, and
355- need to be converted to datetime format::
384+ need to be converted to datetime format. :func: `pandas.to_datetime `
385+ makes this easy::
356386
357387 # the errors='coerce' argument is needed because the dataset is a bit messy
358388 nobel["born"] = pd.to_datetime(nobel["born"], errors ='coerce')
359389 nobel["died"] = pd.to_datetime(nobel["died"], errors ='coerce')
360390 nobel["year"] = pd.to_datetime(nobel["year"], format="%Y")
361391
362- Pandas knows a lot about dates::
392+ Pandas knows a lot about dates (using :attr: ` ~pandas.Series.dt `) ::
363393
364394 print(nobel["born"].dt.day)
365395 print(nobel["born"].dt.year)
@@ -410,7 +440,11 @@ Exercises 3
410440 - (Optional) Create a pivot table to view a spreadsheet like structure, and view it
411441
412442 - First add a column “number” to the nobel dataframe containing 1’s
413- (to enable the counting below).
443+ (to enable the counting below). We need to make a copy of
444+ ``subset ``, because right now it is only a view::
445+
446+ subset = subset.copy()
447+ subset.loc[:, 'number'] = 1
414448
415449 - Then create the :meth: `~pandas.DataFrame.pivot_table `::
416450
@@ -424,7 +458,7 @@ Exercises 3
424458
425459 - Play around with other nice looking plots::
426460
427- sns.violinplot(y="year", x="bornCountry",inner="stick", data=subset);
461+ sns.violinplot(y="year", x="bornCountry", inner="stick", data=subset);
428462
429463 ::
430464
0 commit comments