Skip to content

Commit 4bfe4d0

Browse files
more streamlined str.split, and moving concat intro later
1 parent c3351b4 commit 4bfe4d0

File tree

2 files changed

+14
-42
lines changed

2 files changed

+14
-42
lines changed

source/classification1.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1415,9 +1415,12 @@ what the data would look like if the cancer was rare. We will do this by
14151415
picking only 3 observations from the malignant group, and keeping all
14161416
of the benign observations. We choose these 3 observations using the `.head()`
14171417
method, which takes the number of rows to select from the top (`n`).
1418-
We use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
1418+
We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
14191419
function from `pandas` to glue the two resulting filtered
1420-
data frames back together by passing them together in a sequence.
1420+
data frames back together. The `concat` function *concatenates* data frames
1421+
along an axis. By default, it concatenates the data frames vertically along `axis=0` yielding a single
1422+
*taller* data frame, which is what we want to do here. If we instead wanted to concatenate horizontally
1423+
to produce a *wider* data frame, we would specify `axis=1`.
14211424
The new imbalanced data is shown in {numref}`fig:05-unbalanced`,
14221425
and we print the counts of the classes using the `value_counts` function.
14231426

source/wrangling.md

Lines changed: 9 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -747,13 +747,15 @@ on the slash character `"/"`.
747747
```
748748

749749
The `pandas` package provides similar functions that we can access
750-
by using the `str` method. So, to split all of the entries for an entire
751-
column in a data frame, we would use the `str.split` method.
752-
Once we use this method,
753-
one column will contain only the counts of Canadians
750+
by using the `str` method. So to split all of the entries for an entire
751+
column in a data frame, we will use the `str.split` method.
752+
The output of this method is a data frame with two columns:
753+
one containing only the counts of Canadians
754754
that speak each language most at home,
755-
and the other will contain the counts of Canadians
755+
and the other containing only the counts of Canadians
756756
that speak each language most at work for each region.
757+
We then drop the no-longer-needed `value` column from the `lang_messy_longer`
758+
data frame, and assign the two columns from `str.split` to two new columns.
757759
{numref}`fig:img-separate`
758760
outlines what we need to specify to use `str.split`.
759761

@@ -766,45 +768,12 @@ outlines what we need to specify to use `str.split`.
766768
Syntax for the `str.split` function.
767769
```
768770

769-
We will do this in multiple steps. First, we create a new object
770-
that contains two columns. We will set the `expand` argument to `True`
771-
to tell `pandas` that we want to expand the output into two columns.
772-
773-
```{code-cell} ipython3
774-
split_counts = lang_messy_longer["value"].str.split("/", expand=True)
775-
split_counts
776-
```
777-
Since we only operated on the `value` column, the `split_counts` data frame
778-
doesn't have the rest of the columns (`language`, `region`, etc.)
779-
that were in our original data frame. We don't want to lose this information, so
780-
we will contatenate (combine) the original data frame with `split_counts` using
781-
the `concat` function from `pandas`. The `concat` function *concatenates* data frames
782-
along an axis. By default, it concatenates the data frames vertically along `axis=0` yielding a single
783-
*taller* data frame. Since we want to concatenate our old columns to our
784-
new `split_counts` data frame (to obtain a *wider* data frame), we will specify `axis=1`.
785-
786771
```{code-cell} ipython3
787-
:tags: ["output_scroll"]
788-
tidy_lang = pd.concat(
789-
[lang_messy_longer, split_counts],
790-
axis=1,
791-
)
772+
tidy_lang = lang_messy_longer.drop(columns=["value"])
773+
tidy_lang[["most_at_home", "most_at_work"]] = lang_messy_longer["value"].str.split("/", expand=True)
792774
tidy_lang
793775
```
794776

795-
Next, we will rename our newly created columns (currently called
796-
`0` and `1`) to the more meaningful names `"most_at_home"` and `"most_at_work"`,
797-
and drop the `value` column from our data frame using the `drop` method.
798-
799-
```{code-cell} ipython3
800-
:tags: ["output_scroll"]
801-
tidy_lang = (
802-
tidy_lang.rename(columns={0: "most_at_home", 1: "most_at_work"})
803-
.drop(columns=["value"])
804-
)
805-
tidy_lang
806-
```
807-
Note that we could have chained these steps together to make our code more compact.
808777
Is this data set now tidy? If we recall the three criteria for tidy data:
809778

810779
- each row is a single observation,

0 commit comments

Comments
 (0)