Skip to content

Commit ae56d01

Browse files
Merge pull request #337 from UBC-DSCI/bug-hunt
Bug Hunt
2 parents ef26fbf + 64432bc commit ae56d01

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+52526
-1165
lines changed

source/acknowledgements.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,12 @@ is reflected in the content of this book.
5151

5252
We'd like to thank everyone that has contributed to the development of
5353
[*Data Science: A First Introduction (Python Edition)*](https://python.datasciencebook.ca).
54-
This is an open source Python translation of the original [*Data Science: A First Introduction*](https://datasciencebook.ca)
54+
This is an open source Python translation of the original
5555
book, which focused on the R programming language. Both of these books are
5656
used to teach DSCI 100 at the University of British Columbia (UBC).
5757
We would like to give special thanks to Navya Dahiya and Gloria Ye
5858
for completing the first round of translation of the R material to Python,
5959
and to Philip Austin for his leadership and guidance throughout the translation process.
60-
We also gratefully acknowledge the UBC Open Educational Resources Fund
61-
and the UBC Department of Statistics for supporting the translation of
60+
We also gratefully acknowledge the UBC Open Educational Resources Fund, the UBC Department of Statistics,
61+
and the UBC Department of Earth, Ocean, and Atmospheric Sciences for supporting the translation of
6262
the original R textbook and exercises to the Python programming language.
63-
64-

source/classification1.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1359,7 +1359,7 @@ glue(
13591359
:::{glue:figure} fig:05-scaling-plt
13601360
:name: fig:05-scaling-plt
13611361

1362-
Comparison of K = 3 nearest neighbors with standardized and unstandardized data.
1362+
Comparison of K = 3 nearest neighbors with unstandardized and standardized data.
13631363
:::
13641364

13651365
```{code-cell} ipython3
@@ -1421,7 +1421,7 @@ To better illustrate the problem, let's revisit the scaled breast cancer data,
14211421
what the data would look like if the cancer was rare. We will do this by
14221422
picking only 3 observations from the malignant group, and keeping all
14231423
of the benign observations. We choose these 3 observations using the `.head()`
1424-
method, which takes the number of rows to select from the top (`n`).
1424+
method, which takes the number of rows to select from the top.
14251425
We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
14261426
function from `pandas` to glue the two resulting filtered
14271427
data frames back together. The `concat` function *concatenates* data frames
@@ -1532,8 +1532,8 @@ Imbalanced data with 7 nearest neighbors to a new observation highlighted.
15321532
+++
15331533

15341534
{numref}`fig:05-upsample-2` shows what happens if we set the background color of
1535-
each area of the plot to the predictions the K-nearest neighbors
1536-
classifier would make. We can see that the decision is
1535+
each area of the plot to the prediction the K-nearest neighbors
1536+
classifier would make for a new observation at that location. We can see that the decision is
15371537
always "benign," corresponding to the blue color.
15381538

15391539
```{code-cell} ipython3

source/classification2.md

Lines changed: 26 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ it classified 3 malignant observations as benign, and 4 benign observations as
159159
malignant. The accuracy of this classifier is roughly
160160
89%, given by the formula
161161

162-
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892$$
162+
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892.$$
163163

164164
But we can also see that the classifier only identified 1 out of 4 total malignant
165165
tumors; in other words, it misclassified 75% of the malignant cases present in the
@@ -279,7 +279,7 @@ are completely determined by a
279279
but is actually totally reproducible. As long as you pick the same seed
280280
value, you get the same result!
281281

282-
```{index} sample; numpy.random.choice
282+
```{index} sample, to_list
283283
```
284284

285285
Let's use an example to investigate how randomness works in Python. Say we
@@ -291,6 +291,8 @@ Below we use the seed number `1`. At
291291
that point, Python will keep track of the randomness that occurs throughout the code.
292292
For example, we can call the `sample` method
293293
on the series of numbers, passing the argument `n=10` to indicate that we want 10 samples.
294+
The `to_list` method converts the resulting series into a basic Python list to make
295+
the output easier to read.
294296

295297
```{code-cell} ipython3
296298
import numpy as np
@@ -300,7 +302,7 @@ np.random.seed(1)
300302
301303
nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
302304
303-
random_numbers1 = nums_0_to_9.sample(n=10).to_numpy()
305+
random_numbers1 = nums_0_to_9.sample(n=10).to_list()
304306
random_numbers1
305307
```
306308
You can see that `random_numbers1` is a list of 10 numbers
@@ -309,7 +311,7 @@ we run the `sample` method again,
309311
we will get a fresh batch of 10 numbers that also look random.
310312

311313
```{code-cell} ipython3
312-
random_numbers2 = nums_0_to_9.sample(n=10).to_numpy()
314+
random_numbers2 = nums_0_to_9.sample(n=10).to_list()
313315
random_numbers2
314316
```
315317

@@ -319,12 +321,12 @@ as before---and then call the `sample` method again.
319321

320322
```{code-cell} ipython3
321323
np.random.seed(1)
322-
random_numbers1_again = nums_0_to_9.sample(n=10).to_numpy()
324+
random_numbers1_again = nums_0_to_9.sample(n=10).to_list()
323325
random_numbers1_again
324326
```
325327

326328
```{code-cell} ipython3
327-
random_numbers2_again = nums_0_to_9.sample(n=10).to_numpy()
329+
random_numbers2_again = nums_0_to_9.sample(n=10).to_list()
328330
random_numbers2_again
329331
```
330332

@@ -336,21 +338,21 @@ obtain a different sequence of random numbers.
336338

337339
```{code-cell} ipython3
338340
np.random.seed(4235)
339-
random_numbers = nums_0_to_9.sample(n=10).to_numpy()
340-
random_numbers
341+
random_numbers1_different = nums_0_to_9.sample(n=10).to_list()
342+
random_numbers1_different
341343
```
342344

343345
```{code-cell} ipython3
344-
random_numbers = nums_0_to_9.sample(n=10).to_numpy()
345-
random_numbers
346+
random_numbers2_different = nums_0_to_9.sample(n=10).to_list()
347+
random_numbers2_different
346348
```
347349

348350
In other words, even though the sequences of numbers that Python is generating *look*
349351
random, they are totally determined when we set a seed value!
350352

351353
So what does this mean for data analysis? Well, `sample` is certainly not the
352-
only data frame method that uses randomness in Python. Many of the functions
353-
that we use in `scikit-learn`, `pandas`, and beyond use randomness—many
354+
only place where randomness is used in Python. Many of the functions
355+
that we use in `scikit-learn` and beyond use randomness—some
354356
of them without even telling you about it. Also note that when Python starts
355357
up, it creates its own seed to use. So if you do not explicitly
356358
call the `np.random.seed` function, your results
@@ -387,22 +389,23 @@ reproducible.
387389
In this book, we will generally only use packages that play nicely with `numpy`'s
388390
default random number generator, so we will stick with `np.random.seed`.
389391
You can achieve more careful control over randomness in your analysis
390-
by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html)
392+
by creating a `numpy` [`Generator` object](https://numpy.org/doc/stable/reference/random/generator.html)
391393
once at the beginning of your analysis, and passing it to
392394
the `random_state` argument that is available in many `pandas` and `scikit-learn`
393-
functions. Those functions will then use your `RandomState` to generate random numbers instead of
394-
`numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState`
395+
functions. Those functions will then use your `Generator` to generate random numbers instead of
396+
`numpy`'s default generator. For example, we can reproduce our earlier example by using a `Generator`
395397
object with the `seed` value set to 1; we get the same lists of numbers once again.
396398
```{code}
397-
rnd = np.random.RandomState(seed=1)
398-
random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
399+
from numpy.random import Generator, PCG64
400+
rng = Generator(PCG64(seed=1))
401+
random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rng).to_list()
399402
random_numbers1_third
400403
```
401404
```{code}
402405
array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5])
403406
```
404407
```{code}
405-
random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
408+
random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rng).to_list()
406409
random_numbers2_third
407410
```
408411
```{code}
@@ -1830,7 +1833,7 @@ summary_df = pd.DataFrame(
18301833
)
18311834
plt_irrelevant_accuracies = (
18321835
alt.Chart(summary_df)
1833-
.mark_line() #point=True
1836+
.mark_line(point=True)
18341837
.encode(
18351838
x=alt.X("ks", title="Number of Irrelevant Predictors"),
18361839
y=alt.Y(
@@ -1864,12 +1867,12 @@ this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls of
18641867
18651868
plt_irrelevant_nghbrs = (
18661869
alt.Chart(summary_df)
1867-
.mark_line() # point=True
1870+
.mark_line(point=True)
18681871
.encode(
18691872
x=alt.X("ks", title="Number of Irrelevant Predictors"),
18701873
y=alt.Y(
18711874
"nghbrs",
1872-
title="Number of neighbors",
1875+
title="Tuned number of neighbors",
18731876
),
18741877
)
18751878
)
@@ -1894,7 +1897,7 @@ plt_irrelevant_nghbrs_fixed = (
18941897
alt.Chart(
18951898
melted_summary_df
18961899
)
1897-
.mark_line() # point=True
1900+
.mark_line(point=True)
18981901
.encode(
18991902
x=alt.X("ks", title="Number of Irrelevant Predictors"),
19001903
y=alt.Y(
@@ -2134,7 +2137,7 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
21342137
21352138
fwd_sel_accuracies_plot = (
21362139
alt.Chart(accuracies)
2137-
.mark_line() # point=True
2140+
.mark_line(point=True)
21382141
.encode(
21392142
x=alt.X("size", title="Number of Predictors"),
21402143
y=alt.Y(

source/clustering.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -320,7 +320,7 @@ improves it by making adjustments to the assignment of data
320320
to clusters until it cannot improve any further. But how do we measure
321321
the "quality" of a clustering, and what does it mean to improve it?
322322
In K-means clustering, we measure the quality of a cluster by its
323-
*within-cluster sum-of-squared-distances* (WSSD), also called *intertia*. Computing this involves two steps.
323+
*within-cluster sum-of-squared-distances* (WSSD), also called *inertia*. Computing this involves two steps.
324324
First, we find the cluster centers by computing the mean of each variable
325325
over data points in the cluster. For example, suppose we have a
326326
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -608,7 +608,7 @@ First three iterations of K-means clustering on the `penguins_standardized` exam
608608
+++
609609

610610
Note that at this point, we can terminate the algorithm since none of the assignments changed
611-
in the fourth iteration; both the centers and labels will remain the same from this point onward.
611+
in the third iteration; both the centers and labels will remain the same from this point onward.
612612

613613
```{index} K-means; termination
614614
```
@@ -949,7 +949,7 @@ For example,
949949
we could square all the numbers from 1-4 and store them in a list:
950950

951951
```{code-cell} ipython3
952-
[number ** 2 for number in range(1, 5)]
952+
[number**2 for number in range(1, 5)]
953953
```
954954

955955
Next, we will use this approach to compute the WSSD for the K-values 1 through 9.

source/img/classification2/ML-paradigm-test.ai

Lines changed: 1121 additions & 1077 deletions
Large diffs are not rendered by default.
15 KB
Loading

0 commit comments

Comments
 (0)