Skip to content

Commit be835d7

Browse files
authored
Ch3 fig cleanup (#103)
* figure polishing for ch3 * more ch3 figures
1 parent f284c6e commit be835d7

File tree

6 files changed

+27
-35
lines changed

6 files changed

+27
-35
lines changed

source/img/code-figures.pptx

193 KB
Binary file not shown.
-160 KB
Loading
-32.7 KB
Loading
79.7 KB
Loading
-55.5 KB
Loading

source/wrangling.md

Lines changed: 27 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -138,8 +138,6 @@ region = pd.Series(["Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa"])
138138
region
139139
```
140140

141-
<!-- **(FIGURE 14 NEEDS UPDATING: (a) ZERO-BASED INDEXING, (b) TYPE SHOULD BE STRING (NOT CHARACTER))** -->
142-
143141
+++ {"tags": []}
144142

145143
```{figure} img/wrangling/pandas_dataframe_series.png
@@ -194,7 +192,7 @@ It is important in Python to make sure you represent your data with the correct
194192
Many of the `pandas` functions we use in this book treat
195193
the various data types differently. You should use `int` and `float` types
196194
to represent numbers and perform arithmetic. The `int` type is for integers that have no decimal point,
197-
while the `float` type is for numbers that have a decimal point.
195+
while the `float` type is for numbers that have a decimal point.
198196
The `bool` type are boolean variables that can only take on one of two values: `True` or `False`.
199197
The `string` type is used to represent data that should
200198
be thought of as "text", such as words, names, paths, URLs, and more.
@@ -480,8 +478,6 @@ in the `melt` function to accomplish this data transformation.
480478

481479
+++ {"tags": []}
482480

483-
**(FIGURE UPDATE NEEDED TO MATCH THE CODE BELOW)**
484-
485481
```{figure} img/wrangling/pandas_melt_args_labels.png
486482
:name: fig:img-pivot-longer
487483
:figclass: figure
@@ -986,7 +982,7 @@ with higher numbers of people who speak it as their primary language at home
986982
compared to French in Montréal, then we can use `[]` to obtain rows
987983
where the value of `most_at_home` is greater than
988984
{glue:text}`most_french`. We use the `>` symbol to look for values *above* a threshold,
989-
and the `<` symbol to look for values *below* a threshold. The `>=` and `<=`
985+
and the `<` symbol to look for values *below* a threshold. The `>=` and `<=`
990986
symbols similarly look for *equal to or above* a threshold and *equal to or below* a threshold.
991987

992988
```{code-cell} ipython3
@@ -1448,28 +1444,28 @@ so that we can convert them from `int64` to `int32`. We will use what is called
14481444
a `lambda` function in python; `lambda` functions are just regular functions,
14491445
except that you don't need to give them a name.
14501446
That means you can pass them as an argument into `apply` easily!
1451-
Let's consider a simple example of a `lambda` function that
1447+
Let's consider a simple example of a `lambda` function that
14521448
multiplies a number by two.
14531449
```{code-cell} ipython3
14541450
lambda x: 2*x
14551451
```
1456-
We define a `lambda` function in the following way. We start with the syntax `lambda`, which is a special word
1452+
We define a `lambda` function in the following way. We start with the syntax `lambda`, which is a special word
14571453
that tells Python "what follows is
1458-
a function." Following this, we then state the name of the arguments of the function.
1454+
a function." Following this, we then state the name of the arguments of the function.
14591455
In this case, we just have one argument named `x`. After the list of arguments, we put a
14601456
colon `:`. And finally after the colon are the instructions: take the value provided and multiply it by 2.
14611457
Let's call our shiny new `lambda` function with the argument `2` (so the output should be `4`).
14621458
Just like a regular function, we pass its argument between parentheses `()` symbols.
14631459
```{code-cell} ipython3
14641460
(lambda x: 2*x)(2)
14651461
```
1466-
> **Note:** Because we didn't give the `lambda` function a name, we have to surround it with
1462+
> **Note:** Because we didn't give the `lambda` function a name, we have to surround it with
14671463
> parentheses too if we want to call it. Otherwise, if we wrote something like `lambda x: 2*x(2)`, Python would get confused
14681464
> and think that `(2)` was part of the instructions that comprise the `lambda` function.
14691465
> As long as we don't want to call the `lambda` function ourselves, we don't need those parentheses. For example,
1470-
> we can pass a `lambda` function as an argument to `apply` without any parentheses.
1466+
> we can pass a `lambda` function as an argument to `apply` without any parentheses.
14711467
1472-
Returning to our example, let's use `apply` to convert the columns `"mother_tongue":"lang_known"`
1468+
Returning to our example, let's use `apply` to convert the columns `"mother_tongue":"lang_known"`
14731469
to `int32`. To accomplish this we create a `lambda` function that takes one argument---a single column
14741470
of the data frame, which we will name `col`---and apply the `astype` method to it.
14751471
Then the `apply` method will use that `lambda` function on every column we specify via `loc[]`.
@@ -1514,8 +1510,8 @@ region_lang_nums.apply(max, axis=1)
15141510

15151511
We see that we get a column, which is the maximum value between `mother_tongue`,
15161512
`most_at_home`, `most_at_work` and `lang_known` for each language
1517-
and region. It is often the case that we want to include a column result
1518-
from using `apply` row-wise as a new column in the data frame, so that we can make
1513+
and region. It is often the case that we want to include a column result
1514+
from using `apply` row-wise as a new column in the data frame, so that we can make
15191515
plots or continue our analysis. To make this happen,
15201516
we will use `assign` to create a new column. This is discussed in the next section.
15211517

@@ -1540,7 +1536,7 @@ with the new column added to it.
15401536

15411537
To use the `assign` method, we specify one argument for each column we want to create.
15421538
In this case we want to create one new column named `maximum`, so the argument
1543-
to `assign` begins with `maximum = `.
1539+
to `assign` begins with `maximum = `.
15441540
Then after the `=`, we specify what the contents of that new column
15451541
should be. In this case we use `apply` just as we did in the previous section to give us the maximum values.
15461542
Remember to specify `axis=1` in the `apply` method so that we compute the row-wise maximum value.
@@ -1550,7 +1546,7 @@ region_lang.assign(
15501546
maximum = region_lang_nums.apply(max, axis=1)
15511547
)
15521548
```
1553-
This gives us a new data frame that looks like the `region_lang` data frame,
1549+
This gives us a new data frame that looks like the `region_lang` data frame,
15541550
except that it has an additional column named `maximum`.
15551551
The `maximum` column contains
15561552
the maximum value between `mother_tongue`,
@@ -1575,7 +1571,7 @@ glue("toronto_popn", "{0:,.0f}".format(toronto_popn))
15751571
glue("prop_eng_tor", "{0:.2f}".format(number_most_home / toronto_popn))
15761572
```
15771573

1578-
As another example, we might ask the question: "What proportion of
1574+
As another example, we might ask the question: "What proportion of
15791575
the population reported English as their primary language at home in the 2016 census?"
15801576
For example, in Toronto, {glue:text}`number_most_home` people reported
15811577
speaking English as their primary language at home, and the
@@ -1597,43 +1593,39 @@ and name the new data frame `english_langs`.
15971593
```{code-cell} ipython3
15981594
:tags: ["output_scroll"]
15991595
english_lang = region_lang[
1600-
(region_lang["language"] == "English") &
1601-
(region_lang["region"].isin(five_cities["region"]))
1602-
]
1596+
(region_lang["language"] == "English") &
1597+
(region_lang["region"].isin(five_cities["region"]))
1598+
]
16031599
english_lang
16041600
```
16051601

16061602
Okay, now we have a data frame that pertains only to the English language
1607-
and the five cities mentioned earlier.
1603+
and the five cities mentioned earlier.
16081604
In order to compute the proportion of the population speaking English in each of these cities,
16091605
we need to add the population data from the `five_cities` data frame.
16101606
```{code-cell} ipython3
16111607
five_cities
16121608
```
1613-
The data frame above shows that the populations of the five cities in 2016 were
1609+
The data frame above shows that the populations of the five cities in 2016 were
16141610
5928040 (Toronto), 4098927 (Montréal), 2463431 (Vancouver), 1392609 (Calgary), and 1321426 (Edmonton).
1615-
We will add this information to our data frame in a new column named `city_pops` by using `assign`.
1616-
Once again we specify the new column name (`city_pops`) as the argument, followed by the equal symbol `=`,
1611+
We will add this information to our data frame in a new column named `city_pops` by using `assign`.
1612+
Once again we specify the new column name (`city_pops`) as the argument, followed by the equal symbol `=`,
16171613
and finally the data in the column.
16181614
Note that the order of the rows in the `english_lang` data frame is Montréal, Toronto, Calgary, Edmonton, Vancouver.
16191615
So we will create a column called `city_pops` where we list the populations of those cities in that
16201616
order, and add it to our data frame.
16211617
Also note that we write `english_lang = ` on the left so that the newly created data frame overwrites our
1622-
old `english_lang` data frame; remember that by default, like other `pandas` functions, `assign` does not
1618+
old `english_lang` data frame; remember that by default, like other `pandas` functions, `assign` does not
16231619
modify the original data frame directly!
16241620
```{code-cell} ipython3
16251621
:tags: ["output_scroll"]
16261622
english_lang = english_lang.assign(
1627-
city_pops=[4098927,
1628-
5928040,
1629-
1392609,
1630-
1321426,
1631-
2463431
1632-
])
1623+
city_pops=[4098927, 5928040, 1392609, 1321426, 2463431]
1624+
)
16331625
english_lang
16341626
```
16351627
> **Note**: Inserting data manually in this is generally very error-prone and is not recommended.
1636-
> We do it here to demonstrate another usage of `assign` that does not involve `apply`.
1628+
> We do it here to demonstrate another usage of `assign` that does not involve `apply`.
16371629
> But in more advanced data wrangling,
16381630
> one would solve this problem in a less error-prone way using
16391631
> the `merge` function, which lets you combine two data frames. We will show you an
@@ -1645,8 +1637,8 @@ proportion of people who speak English the most at home by taking the ratio of t
16451637
```{code-cell} ipython3
16461638
:tags: ["output_scroll"]
16471639
english_lang.assign(
1648-
proportion=english_lang["most_at_home"]/english_lang["city_pops"]
1649-
)
1640+
proportion=english_lang["most_at_home"]/english_lang["city_pops"]
1641+
)
16501642
```
16511643

16521644

@@ -1737,7 +1729,7 @@ right order, and it could be easy to make a mistake this way. An alternative app
17371729
is to (1) create a new, empty data frame, (2) use `assign` to assign the city names and populations in that
17381730
data frame, and (3) use `merge` to combine the two data frames, recognizing that the "regions" are the same.
17391731

1740-
We create a new, empty data frame by calling `pd.DataFrame` with no arguments.
1732+
We create a new, empty data frame by calling `pd.DataFrame` with no arguments.
17411733
We then use `assign` to add the city names in a column called `"region"`
17421734
and their populations in a column called `"population"`.
17431735
```{code-cell} ipython3

0 commit comments

Comments
 (0)