Skip to content

Commit 9d72fa8

Browse files
Merge pull request #301 from UBC-DSCI/assign-apply-lambda
Handling assign, apply, and lambdas
2 parents ee599ab + 721b047 commit 9d72fa8

File tree

9 files changed

+253
-317
lines changed

9 files changed

+253
-317
lines changed

source/classification1.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -628,15 +628,16 @@ Scatter plot of concavity versus perimeter with new observation represented as a
628628
```{code-cell} ipython3
629629
new_obs_Perimeter = 0
630630
new_obs_Concavity = 3.5
631-
(
632-
cancer
633-
[["Perimeter", "Concavity", "Class"]]
634-
.assign(dist_from_new = (
631+
cancer["dist_from_new"] = (
635632
(cancer["Perimeter"] - new_obs_Perimeter) ** 2
636633
+ (cancer["Concavity"] - new_obs_Concavity) ** 2
637-
)**(1/2))
638-
.nsmallest(5, "dist_from_new")
639-
)
634+
)**(1/2)
635+
cancer.nsmallest(5, "dist_from_new")[[
636+
"Perimeter",
637+
"Concavity",
638+
"Class",
639+
"dist_from_new"
640+
]]
640641
```
641642

642643
```{code-cell} ipython3
@@ -751,16 +752,18 @@ three predictors.
751752
new_obs_Perimeter = 0
752753
new_obs_Concavity = 3.5
753754
new_obs_Symmetry = 1
754-
(
755-
cancer
756-
[["Perimeter", "Concavity", "Symmetry", "Class"]]
757-
.assign(dist_from_new = (
758-
(cancer["Perimeter"] - new_obs_Perimeter) ** 2
759-
+ (cancer["Concavity"] - new_obs_Concavity) ** 2
760-
+ (cancer["Symmetry"] - new_obs_Symmetry) ** 2
761-
)**(1/2))
762-
.nsmallest(5, "dist_from_new")
763-
)
755+
cancer["dist_from_new"] = (
756+
(cancer["Perimeter"] - new_obs_Perimeter) ** 2
757+
+ (cancer["Concavity"] - new_obs_Concavity) ** 2
758+
+ (cancer["Symmetry"] - new_obs_Symmetry) ** 2
759+
)**(1/2)
760+
cancer.nsmallest(5, "dist_from_new")[[
761+
"Perimeter",
762+
"Concavity",
763+
"Symmetry",
764+
"Class",
765+
"dist_from_new"
766+
]]
764767
```
765768

766769
Based on $K=5$ nearest neighbors with these three predictors we would classify

source/classification2.md

Lines changed: 14 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -606,18 +606,16 @@ knn_pipeline
606606
```
607607

608608
Now that we have a $K$-nearest neighbors classifier object, we can use it to
609-
predict the class labels for our test set. We will use the `assign` method to
610-
augment the original test data with a column of predictions, creating the
611-
`cancer_test_predictions` data frame. The `Class` variable contains the actual
609+
predict the class labels for our test set and
610+
augment the original test data with a column of predictions.
611+
The `Class` variable contains the actual
612612
diagnoses, while the `predicted` contains the predicted diagnoses from the
613613
classifier. Note that below we print out just the `ID`, `Class`, and `predicted`
614614
variables in the output data frame.
615615

616616
```{code-cell} ipython3
617-
cancer_test_predictions = cancer_test.assign(
618-
predicted = knn_pipeline.predict(cancer_test[["Smoothness", "Concavity"]])
619-
)
620-
cancer_test_predictions[["ID", "Class", "predicted"]]
617+
cancer_test["predicted"] = knn_pipeline.predict(cancer_test[["Smoothness", "Concavity"]])
618+
cancer_test[["ID", "Class", "predicted"]]
621619
```
622620

623621
### Evaluate performance
@@ -632,11 +630,11 @@ number of predictions. First we filter the rows to find the number of correct pr
632630
and then divide the number of rows with correct predictions by the total number of rows
633631
using the `shape` attribute.
634632
```{code-cell} ipython3
635-
correct_preds = cancer_test_predictions[
636-
cancer_test_predictions["Class"] == cancer_test_predictions["predicted"]
633+
correct_preds = cancer_test[
634+
cancer_test["Class"] == cancer_test["predicted"]
637635
]
638636
639-
correct_preds.shape[0] / cancer_test_predictions.shape[0]
637+
correct_preds.shape[0] / cancer_test.shape[0]
640638
```
641639

642640
The `scitkit-learn` package also provides a more convenient way to do this using
@@ -669,15 +667,15 @@ arguments: the actual labels first, then the predicted labels second.
669667

670668
```{code-cell} ipython3
671669
pd.crosstab(
672-
cancer_test_predictions["Class"],
673-
cancer_test_predictions["predicted"]
670+
cancer_test["Class"],
671+
cancer_test["predicted"]
674672
)
675673
```
676674

677675
```{code-cell} ipython3
678676
:tags: [remove-cell]
679-
_ctab = pd.crosstab(cancer_test_predictions["Class"],
680-
cancer_test_predictions["predicted"]
677+
_ctab = pd.crosstab(cancer_test["Class"],
678+
cancer_test["predicted"]
681679
)
682680
683681
c11 = _ctab["Malignant"]["Malignant"]
@@ -1205,15 +1203,14 @@ We will also rename the parameter name column to be a bit more readable,
12051203
and drop the now unused `std_test_score` column.
12061204

12071205
```{code-cell} ipython3
1206+
accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
12081207
accuracies_grid = (
12091208
accuracies_grid[[
12101209
"param_kneighborsclassifier__n_neighbors",
12111210
"mean_test_score",
1212-
"std_test_score"
1211+
"sem_test_score"
12131212
]]
1214-
.assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
12151213
.rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
1216-
.drop(columns=["std_test_score"])
12171214
)
12181215
accuracies_grid
12191216
```

source/clustering.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -856,14 +856,14 @@ order to do that, we first need to augment our
856856
original `penguins` data frame with the cluster assignments.
857857
We can access these using the `labels_` attribute of the clustering object
858858
("labels" is a common alternative term to "assignments" in clustering), and
859-
add them to the data frame using `assign`.
859+
add them to the data frame.
860860

861861
```{code-cell} ipython3
862-
clustered_data = penguins.assign(cluster = penguin_clust[1].labels_)
863-
clustered_data
862+
penguins["cluster"] = penguin_clust[1].labels_
863+
penguins
864864
```
865865

866-
Now that we have the cluster assignments included in the `clustered_data` data frame, we can
866+
Now that we have the cluster assignments included in the `penguins` data frame, we can
867867
visualize them as shown in {numref}`cluster_plot`.
868868
Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
869869
visualize the *standardized* data, we would need to use the `fit` and `transform` functions
@@ -874,7 +874,7 @@ will treat the `cluster` variable as a nominal/categorical variable, and
874874
hence use a discrete color map for the visualization.
875875

876876
```{code-cell} ipython3
877-
cluster_plot=alt.Chart(clustered_data).mark_circle().encode(
877+
cluster_plot=alt.Chart(penguins).mark_circle().encode(
878878
x=alt.X("flipper_length_mm").title("Flipper Length").scale(zero=False),
879879
y=alt.Y("bill_length_mm").title("Bill Length").scale(zero=False),
880880
color=alt.Color("cluster:N").title("Cluster"),

source/inference.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -250,11 +250,11 @@ expect our sample proportions from this population to vary for samples of size 4
250250

251251
We again use the `sample` to take samples of size 40 from our
252252
population of Airbnb listings. But this time we use a list comprehension
253-
to repeat an operation multiple time (as in the previous chapter).
254-
In this case we are taking 20,000 samples of size 40
255-
and to make it clear which rows in the data frame come
256-
which of the 20,000 samples,
257-
we also add a column called `replicate` with this information.
253+
to repeat the operation multiple times (as we did previously in {numref}`Chapter %s <clustering>`).
254+
In this case we repeat the operation 20,000 times to obtain 20,000 samples of size 40.
255+
To make it clear which rows in the data frame come
256+
which of the 20,000 samples, we also add a column called `replicate` with this information using the `assign` function,
257+
introduced previously in {numref}`Chapter %s <wrangling>`.
258258
The call to `concat` concatenates all the 20,000 data frames
259259
returned from the list comprehension into a single big data frame.
260260

source/intro.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,8 @@ ten_lang = arranged_lang.head(10)
646646
ten_lang
647647
```
648648

649-
## Adding and modifying columns using `assign`
649+
(ch1-adding-modifying)=
650+
## Adding and modifying columns
650651

651652
```{index} assign
652653
```
@@ -663,7 +664,7 @@ column by the total Canadian population according to the 2016
663664
census&mdash;i.e., 35,151,728&mdash;and multiply it by 100. We can perform
664665
this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`.
665666
Then to store the result in a new column (or
666-
overwrite an existing column), we use the `assign` method. We specify the name of the new
667+
overwrite an existing column), we specify the name of the new
667668
column to create (or old column to modify), then the assignment symbol `=`,
668669
and then the computation to store in that column. In this case, we will opt to
669670
create a new column called `mother_tongue_percent`.
@@ -676,10 +677,18 @@ and do not affect how Python interprets the number. In other words,
676677
although the latter is much clearer!
677678
```
678679

680+
```{code-cell} ipython3
681+
:tags: [remove-cell]
682+
# disable setting with copy warning
683+
# it's not important for this chapter and just distracting
684+
# only occurs here because we did a much earlier .loc operation that is being picked up below by the coln assignment
685+
pd.options.mode.chained_assignment = None
686+
```
687+
679688
```{code-cell} ipython3
680689
canadian_population = 35_151_728
681-
ten_lang_percent = ten_lang.assign(mother_tongue_percent=100 * ten_lang["mother_tongue"] / canadian_population)
682-
ten_lang_percent
690+
ten_lang["mother_tongue_percent"] = 100 * ten_lang["mother_tongue"] / canadian_population
691+
ten_lang
683692
```
684693

685694
The `ten_lang_percent` data frame shows that

source/regression1.md

Lines changed: 30 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -294,17 +294,15 @@ of a house that is 2,000 square feet.
294294
```
295295

296296
```{code-cell} ipython3
297-
nearest_neighbors = (
298-
small_sacramento.assign(diff=(2000 - small_sacramento["sqft"]).abs())
299-
.nsmallest(5, "diff")
300-
)
301-
297+
small_sacramento["dist"] = (2000 - small_sacramento["sqft"]).abs()
298+
nearest_neighbors = small_sacramento.nsmallest(5, "dist")
302299
nearest_neighbors
303300
```
304301

305302
```{code-cell} ipython3
306303
:tags: [remove-cell]
307304
305+
308306
nn_plot = small_plot + rule
309307
310308
# plot horizontal lines which is perpendicular to x=2000
@@ -609,16 +607,15 @@ sacr_gridsearch.fit(
609607
)
610608
611609
# Retrieve the CV scores
612-
sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)[[
613-
"param_kneighborsregressor__n_neighbors",
614-
"mean_test_score",
615-
"std_test_score"
616-
]]
610+
sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)
611+
sacr_results["sem_test_score"] = sacr_results["std_test_score"] / 5**(1/2)
617612
sacr_results = (
618-
sacr_results
619-
.assign(sem_test_score=sacr_results["std_test_score"] / 5**(1/2))
613+
sacr_results[[
614+
"param_kneighborsregressor__n_neighbors",
615+
"mean_test_score",
616+
"sem_test_score"
617+
]]
620618
.rename(columns={"param_kneighborsregressor__n_neighbors": "n_neighbors"})
621-
.drop(columns=["std_test_score"])
622619
)
623620
sacr_results
624621
```
@@ -834,12 +831,10 @@ model uses a different default scoring metric than the RMSPE.
834831
```{code-cell} ipython3
835832
from sklearn.metrics import mean_squared_error
836833
837-
sacr_preds = sacramento_test.assign(
838-
predicted = sacr_gridsearch.predict(sacramento_test)
839-
)
834+
sacramento_test["predicted"] = sacr_gridsearch.predict(sacramento_test)
840835
RMSPE = mean_squared_error(
841-
y_true = sacr_preds["price"],
842-
y_pred=sacr_preds["predicted"]
836+
y_true = sacramento_test["price"],
837+
y_pred = sacramento_test["predicted"]
843838
)**(1/2)
844839
RMSPE
845840
```
@@ -890,9 +885,7 @@ sqft_prediction_grid = pd.DataFrame({
890885
"sqft": np.arange(sacramento["sqft"].min(), sacramento["sqft"].max(), 10)
891886
})
892887
# Predict the price for each of the sqft values in the grid
893-
sacr_preds = sqft_prediction_grid.assign(
894-
predicted = sacr_gridsearch.predict(sqft_prediction_grid)
895-
)
888+
sqft_prediction_grid["predicted"] = sacr_gridsearch.predict(sqft_prediction_grid)
896889
897890
# Plot all the houses
898891
base_plot = alt.Chart(sacramento).mark_circle(opacity=0.4).encode(
@@ -905,7 +898,10 @@ base_plot = alt.Chart(sacramento).mark_circle(opacity=0.4).encode(
905898
)
906899
907900
# Add the predictions as a line
908-
sacr_preds_plot = base_plot + alt.Chart(sacr_preds, title=f"K = {best_k_sacr}").mark_line(
901+
sacr_preds_plot = base_plot + alt.Chart(
902+
sqft_prediction_grid,
903+
title=f"K = {best_k_sacr}"
904+
).mark_line(
909905
color="#ff7f0e"
910906
).encode(
911907
x="sqft",
@@ -1018,25 +1014,24 @@ sacr_gridsearch = GridSearchCV(
10181014
cv=5,
10191015
scoring="neg_root_mean_squared_error"
10201016
)
1017+
10211018
sacr_gridsearch.fit(
10221019
sacramento_train[["sqft", "beds"]],
10231020
sacramento_train["price"]
10241021
)
10251022
10261023
# retrieve the CV scores
1027-
sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)[[
1028-
"param_kneighborsregressor__n_neighbors",
1029-
"mean_test_score",
1030-
"std_test_score"
1031-
]]
1032-
1024+
sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)
1025+
sacr_results["sem_test_score"] = sacr_results["std_test_score"] / 5**(1/2)
1026+
sacr_results["mean_test_score"] = -sacr_results["mean_test_score"]
10331027
sacr_results = (
1034-
sacr_results
1035-
.assign(sem_test_score=sacr_results["std_test_score"] / 5**(1/2))
1028+
sacr_results[[
1029+
"param_kneighborsregressor__n_neighbors",
1030+
"mean_test_score",
1031+
"sem_test_score"
1032+
]]
10361033
.rename(columns={"param_kneighborsregressor__n_neighbors" : "n_neighbors"})
1037-
.drop(columns=["std_test_score"])
10381034
)
1039-
sacr_results["mean_test_score"] = -sacr_results["mean_test_score"]
10401035
10411036
# show only the row of minimum RMSPE
10421037
sacr_results.nsmallest(1, "mean_test_score")
@@ -1069,12 +1064,10 @@ via the `predict` method of the fit `GridSearchCV` object. Finally, we will use
10691064
to compute the RMSPE.
10701065

10711066
```{code-cell} ipython3
1072-
sacr_preds = sacramento_test.assign(
1073-
predicted = sacr_gridsearch.predict(sacramento_test)
1074-
)
1067+
sacramento_test["predicted"] = sacr_gridsearch.predict(sacramento_test)
10751068
RMSPE_mult = mean_squared_error(
1076-
y_true = sacr_preds["price"],
1077-
y_pred=sacr_preds["predicted"]
1069+
y_true = sacramento_test["price"],
1070+
y_pred = sacramento_test["predicted"]
10781071
)**(1/2)
10791072
RMSPE_mult
10801073

0 commit comments

Comments
 (0)