@@ -138,8 +138,6 @@ region = pd.Series(["Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa"])
138
138
region
139
139
```
140
140
141
- <!-- **(FIGURE 14 NEEDS UPDATING: (a) ZERO-BASED INDEXING, (b) TYPE SHOULD BE STRING (NOT CHARACTER))** -->
142
-
143
141
+++ {"tags": [ ] }
144
142
145
143
``` {figure} img/wrangling/pandas_dataframe_series.png
@@ -194,7 +192,7 @@ It is important in Python to make sure you represent your data with the correct
194
192
Many of the ` pandas ` functions we use in this book treat
195
193
the various data types differently. You should use ` int ` and ` float ` types
196
194
to represent numbers and perform arithmetic. The ` int ` type is for integers that have no decimal point,
197
- while the ` float ` type is for numbers that have a decimal point.
195
+ while the ` float ` type is for numbers that have a decimal point.
198
196
The ` bool ` type are boolean variables that can only take on one of two values: ` True ` or ` False ` .
199
197
The ` string ` type is used to represent data that should
200
198
be thought of as "text", such as words, names, paths, URLs, and more.
@@ -480,8 +478,6 @@ in the `melt` function to accomplish this data transformation.
480
478
481
479
+++ {"tags": [ ] }
482
480
483
- ** (FIGURE UPDATE NEEDED TO MATCH THE CODE BELOW)**
484
-
485
481
``` {figure} img/wrangling/pandas_melt_args_labels.png
486
482
:name: fig:img-pivot-longer
487
483
:figclass: figure
@@ -986,7 +982,7 @@ with higher numbers of people who speak it as their primary language at home
986
982
compared to French in Montréal, then we can use ` [] ` to obtain rows
987
983
where the value of ` most_at_home ` is greater than
988
984
{glue: text }` most_french ` . We use the ` > ` symbol to look for values * above* a threshold,
989
- and the ` < ` symbol to look for values * below* a threshold. The ` >= ` and ` <= `
985
+ and the ` < ` symbol to look for values * below* a threshold. The ` >= ` and ` <= `
990
986
symbols similarly look for * equal to or above* a threshold and * equal to or below* a threshold.
991
987
992
988
``` {code-cell} ipython3
@@ -1448,28 +1444,28 @@ so that we can convert them from `int64` to `int32`. We will use what is called
1448
1444
a ` lambda ` function in python; ` lambda ` functions are just regular functions,
1449
1445
except that you don't need to give them a name.
1450
1446
That means you can pass them as an argument into ` apply ` easily!
1451
- Let's consider a simple example of a ` lambda ` function that
1447
+ Let's consider a simple example of a ` lambda ` function that
1452
1448
multiplies a number by two.
1453
1449
``` {code-cell} ipython3
1454
1450
lambda x: 2*x
1455
1451
```
1456
- We define a ` lambda ` function in the following way. We start with the syntax ` lambda ` , which is a special word
1452
+ We define a ` lambda ` function in the following way. We start with the syntax ` lambda ` , which is a special word
1457
1453
that tells Python "what follows is
1458
- a function." Following this, we then state the name of the arguments of the function.
1454
+ a function." Following this, we then state the name of the arguments of the function.
1459
1455
In this case, we just have one argument named ` x ` . After the list of arguments, we put a
1460
1456
colon ` : ` . And finally after the colon are the instructions: take the value provided and multiply it by 2.
1461
1457
Let's call our shiny new ` lambda ` function with the argument ` 2 ` (so the output should be ` 4 ` ).
1462
1458
Just like a regular function, we pass its argument between parentheses ` () ` symbols.
1463
1459
``` {code-cell} ipython3
1464
1460
(lambda x: 2*x)(2)
1465
1461
```
1466
- > ** Note:** Because we didn't give the ` lambda ` function a name, we have to surround it with
1462
+ > ** Note:** Because we didn't give the ` lambda ` function a name, we have to surround it with
1467
1463
> parentheses too if we want to call it. Otherwise, if we wrote something like ` lambda x: 2*x(2) ` , Python would get confused
1468
1464
> and think that ` (2) ` was part of the instructions that comprise the ` lambda ` function.
1469
1465
> As long as we don't want to call the ` lambda ` function ourselves, we don't need those parentheses. For example,
1470
- > we can pass a ` lambda ` function as an argument to ` apply ` without any parentheses.
1466
+ > we can pass a ` lambda ` function as an argument to ` apply ` without any parentheses.
1471
1467
1472
- Returning to our example, let's use ` apply ` to convert the columns ` "mother_tongue":"lang_known" `
1468
+ Returning to our example, let's use ` apply ` to convert the columns ` "mother_tongue":"lang_known" `
1473
1469
to ` int32 ` . To accomplish this we create a ` lambda ` function that takes one argument---a single column
1474
1470
of the data frame, which we will name ` col ` ---and apply the ` astype ` method to it.
1475
1471
Then the ` apply ` method will use that ` lambda ` function on every column we specify via ` loc[] ` .
@@ -1514,8 +1510,8 @@ region_lang_nums.apply(max, axis=1)
1514
1510
1515
1511
We see that we get a column, which is the maximum value between ` mother_tongue ` ,
1516
1512
` most_at_home ` , ` most_at_work ` and ` lang_known ` for each language
1517
- and region. It is often the case that we want to include a column result
1518
- from using ` apply ` row-wise as a new column in the data frame, so that we can make
1513
+ and region. It is often the case that we want to include a column result
1514
+ from using ` apply ` row-wise as a new column in the data frame, so that we can make
1519
1515
plots or continue our analysis. To make this happen,
1520
1516
we will use ` assign ` to create a new column. This is discussed in the next section.
1521
1517
@@ -1540,7 +1536,7 @@ with the new column added to it.
1540
1536
1541
1537
To use the ` assign ` method, we specify one argument for each column we want to create.
1542
1538
In this case we want to create one new column named ` maximum ` , so the argument
1543
- to ` assign ` begins with ` maximum = ` .
1539
+ to ` assign ` begins with ` maximum = ` .
1544
1540
Then after the ` = ` , we specify what the contents of that new column
1545
1541
should be. In this case we use ` apply ` just as we did in the previous section to give us the maximum values.
1546
1542
Remember to specify ` axis=1 ` in the ` apply ` method so that we compute the row-wise maximum value.
@@ -1550,7 +1546,7 @@ region_lang.assign(
1550
1546
maximum = region_lang_nums.apply(max, axis=1)
1551
1547
)
1552
1548
```
1553
- This gives us a new data frame that looks like the ` region_lang ` data frame,
1549
+ This gives us a new data frame that looks like the ` region_lang ` data frame,
1554
1550
except that it has an additional column named ` maximum ` .
1555
1551
The ` maximum ` column contains
1556
1552
the maximum value between ` mother_tongue ` ,
@@ -1575,7 +1571,7 @@ glue("toronto_popn", "{0:,.0f}".format(toronto_popn))
1575
1571
glue("prop_eng_tor", "{0:.2f}".format(number_most_home / toronto_popn))
1576
1572
```
1577
1573
1578
- As another example, we might ask the question: "What proportion of
1574
+ As another example, we might ask the question: "What proportion of
1579
1575
the population reported English as their primary language at home in the 2016 census?"
1580
1576
For example, in Toronto, {glue: text }` number_most_home ` people reported
1581
1577
speaking English as their primary language at home, and the
@@ -1597,43 +1593,39 @@ and name the new data frame `english_langs`.
1597
1593
``` {code-cell} ipython3
1598
1594
:tags: ["output_scroll"]
1599
1595
english_lang = region_lang[
1600
- (region_lang["language"] == "English") &
1601
- (region_lang["region"].isin(five_cities["region"]))
1602
- ]
1596
+ (region_lang["language"] == "English") &
1597
+ (region_lang["region"].isin(five_cities["region"]))
1598
+ ]
1603
1599
english_lang
1604
1600
```
1605
1601
1606
1602
Okay, now we have a data frame that pertains only to the English language
1607
- and the five cities mentioned earlier.
1603
+ and the five cities mentioned earlier.
1608
1604
In order to compute the proportion of the population speaking English in each of these cities,
1609
1605
we need to add the population data from the ` five_cities ` data frame.
1610
1606
``` {code-cell} ipython3
1611
1607
five_cities
1612
1608
```
1613
- The data frame above shows that the populations of the five cities in 2016 were
1609
+ The data frame above shows that the populations of the five cities in 2016 were
1614
1610
5928040 (Toronto), 4098927 (Montréal), 2463431 (Vancouver), 1392609 (Calgary), and 1321426 (Edmonton).
1615
- We will add this information to our data frame in a new column named ` city_pops ` by using ` assign ` .
1616
- Once again we specify the new column name (` city_pops ` ) as the argument, followed by the equal symbol ` = ` ,
1611
+ We will add this information to our data frame in a new column named ` city_pops ` by using ` assign ` .
1612
+ Once again we specify the new column name (` city_pops ` ) as the argument, followed by the equal symbol ` = ` ,
1617
1613
and finally the data in the column.
1618
1614
Note that the order of the rows in the ` english_lang ` data frame is Montréal, Toronto, Calgary, Edmonton, Vancouver.
1619
1615
So we will create a column called ` city_pops ` where we list the populations of those cities in that
1620
1616
order, and add it to our data frame.
1621
1617
Also note that we write ` english_lang = ` on the left so that the newly created data frame overwrites our
1622
- old ` english_lang ` data frame; remember that by default, like other ` pandas ` functions, ` assign ` does not
1618
+ old ` english_lang ` data frame; remember that by default, like other ` pandas ` functions, ` assign ` does not
1623
1619
modify the original data frame directly!
1624
1620
``` {code-cell} ipython3
1625
1621
:tags: ["output_scroll"]
1626
1622
english_lang = english_lang.assign(
1627
- city_pops=[4098927,
1628
- 5928040,
1629
- 1392609,
1630
- 1321426,
1631
- 2463431
1632
- ])
1623
+ city_pops=[4098927, 5928040, 1392609, 1321426, 2463431]
1624
+ )
1633
1625
english_lang
1634
1626
```
1635
1627
> ** Note** : Inserting data manually in this is generally very error-prone and is not recommended.
1636
- > We do it here to demonstrate another usage of ` assign ` that does not involve ` apply ` .
1628
+ > We do it here to demonstrate another usage of ` assign ` that does not involve ` apply ` .
1637
1629
> But in more advanced data wrangling,
1638
1630
> one would solve this problem in a less error-prone way using
1639
1631
> the ` merge ` function, which lets you combine two data frames. We will show you an
@@ -1645,8 +1637,8 @@ proportion of people who speak English the most at home by taking the ratio of t
1645
1637
``` {code-cell} ipython3
1646
1638
:tags: ["output_scroll"]
1647
1639
english_lang.assign(
1648
- proportion=english_lang["most_at_home"]/english_lang["city_pops"]
1649
- )
1640
+ proportion=english_lang["most_at_home"]/english_lang["city_pops"]
1641
+ )
1650
1642
```
1651
1643
1652
1644
@@ -1737,7 +1729,7 @@ right order, and it could be easy to make a mistake this way. An alternative app
1737
1729
is to (1) create a new, empty data frame, (2) use ` assign ` to assign the city names and populations in that
1738
1730
data frame, and (3) use ` merge ` to combine the two data frames, recognizing that the "regions" are the same.
1739
1731
1740
- We create a new, empty data frame by calling ` pd.DataFrame ` with no arguments.
1732
+ We create a new, empty data frame by calling ` pd.DataFrame ` with no arguments.
1741
1733
We then use ` assign ` to add the city names in a column called ` "region" `
1742
1734
and their populations in a column called ` "population" ` .
1743
1735
``` {code-cell} ipython3
0 commit comments