@@ -747,13 +747,15 @@ on the slash character `"/"`.
747
747
```
748
748
749
749
The ` pandas ` package provides similar functions that we can access
750
- by using the ` str ` method. So, to split all of the entries for an entire
751
- column in a data frame, we would use the ` str.split ` method.
752
- Once we use this method,
753
- one column will contain only the counts of Canadians
750
+ by using the ` str ` method. So to split all of the entries for an entire
751
+ column in a data frame, we will use the ` str.split ` method.
752
+ The output of this method is a data frame with two columns:
753
+ one containing only the counts of Canadians
754
754
that speak each language most at home,
755
- and the other will contain the counts of Canadians
755
+ and the other containing only the counts of Canadians
756
756
that speak each language most at work for each region.
757
+ We then drop the no-longer-needed ` value ` column from the ` lang_messy_longer `
758
+ data frame, and assign the two columns from ` str.split ` to two new columns.
757
759
{numref}` fig:img-separate `
758
760
outlines what we need to specify to use ` str.split ` .
759
761
@@ -766,45 +768,12 @@ outlines what we need to specify to use `str.split`.
766
768
Syntax for the `str.split` function.
767
769
```
768
770
769
- We will do this in multiple steps. First, we create a new object
770
- that contains two columns. We will set the ` expand ` argument to ` True `
771
- to tell ` pandas ` that we want to expand the output into two columns.
772
-
773
- ``` {code-cell} ipython3
774
- split_counts = lang_messy_longer["value"].str.split("/", expand=True)
775
- split_counts
776
- ```
777
- Since we only operated on the ` value ` column, the ` split_counts ` data frame
778
- doesn't have the rest of the columns (` language ` , ` region ` , etc.)
779
- that were in our original data frame. We don't want to lose this information, so
780
- we will contatenate (combine) the original data frame with ` split_counts ` using
781
- the ` concat ` function from ` pandas ` . The ` concat ` function * concatenates* data frames
782
- along an axis. By default, it concatenates the data frames vertically along ` axis=0 ` yielding a single
783
- * taller* data frame. Since we want to concatenate our old columns to our
784
- new ` split_counts ` data frame (to obtain a * wider* data frame), we will specify ` axis=1 ` .
785
-
786
771
``` {code-cell} ipython3
787
- :tags: ["output_scroll"]
788
- tidy_lang = pd.concat(
789
- [lang_messy_longer, split_counts],
790
- axis=1,
791
- )
772
+ tidy_lang = lang_messy_longer.drop(columns=["value"])
773
+ tidy_lang[["most_at_home", "most_at_work"]] = lang_messy_longer["value"].str.split("/", expand=True)
792
774
tidy_lang
793
775
```
794
776
795
- Next, we will rename our newly created columns (currently called
796
- ` 0 ` and ` 1 ` ) to the more meaningful names ` "most_at_home" ` and ` "most_at_work" ` ,
797
- and drop the ` value ` column from our data frame using the ` drop ` method.
798
-
799
- ``` {code-cell} ipython3
800
- :tags: ["output_scroll"]
801
- tidy_lang = (
802
- tidy_lang.rename(columns={0: "most_at_home", 1: "most_at_work"})
803
- .drop(columns=["value"])
804
- )
805
- tidy_lang
806
- ```
807
- Note that we could have chained these steps together to make our code more compact.
808
777
Is this data set now tidy? If we recall the three criteria for tidy data:
809
778
810
779
- each row is a single observation,
0 commit comments