@@ -159,7 +159,7 @@ it classified 3 malignant observations as benign, and 4 benign observations as
159
159
malignant. The accuracy of this classifier is roughly
160
160
89%, given by the formula
161
161
162
- $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892 $$
162
+ $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892. $$
163
163
164
164
But we can also see that the classifier only identified 1 out of 4 total malignant
165
165
tumors; in other words, it misclassified 75% of the malignant cases present in the
@@ -279,7 +279,7 @@ are completely determined by a
279
279
but is actually totally reproducible. As long as you pick the same seed
280
280
value, you get the same result!
281
281
282
- ``` {index} sample; numpy.random.choice
282
+ ``` {index} sample, to_list
283
283
```
284
284
285
285
Let's use an example to investigate how randomness works in Python. Say we
@@ -291,6 +291,8 @@ Below we use the seed number `1`. At
291
291
that point, Python will keep track of the randomness that occurs throughout the code.
292
292
For example, we can call the ` sample ` method
293
293
on the series of numbers, passing the argument ` n=10 ` to indicate that we want 10 samples.
294
+ The ` to_list ` method converts the resulting series into a basic Python list to make
295
+ the output easier to read.
294
296
295
297
``` {code-cell} ipython3
296
298
import numpy as np
@@ -300,7 +302,7 @@ np.random.seed(1)
300
302
301
303
nums_0_to_9 = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
302
304
303
- random_numbers1 = nums_0_to_9.sample(n=10).to_numpy ()
305
+ random_numbers1 = nums_0_to_9.sample(n=10).to_list ()
304
306
random_numbers1
305
307
```
306
308
You can see that ` random_numbers1 ` is a list of 10 numbers
@@ -309,7 +311,7 @@ we run the `sample` method again,
309
311
we will get a fresh batch of 10 numbers that also look random.
310
312
311
313
``` {code-cell} ipython3
312
- random_numbers2 = nums_0_to_9.sample(n=10).to_numpy ()
314
+ random_numbers2 = nums_0_to_9.sample(n=10).to_list ()
313
315
random_numbers2
314
316
```
315
317
@@ -319,12 +321,12 @@ as before---and then call the `sample` method again.
319
321
320
322
``` {code-cell} ipython3
321
323
np.random.seed(1)
322
- random_numbers1_again = nums_0_to_9.sample(n=10).to_numpy ()
324
+ random_numbers1_again = nums_0_to_9.sample(n=10).to_list ()
323
325
random_numbers1_again
324
326
```
325
327
326
328
``` {code-cell} ipython3
327
- random_numbers2_again = nums_0_to_9.sample(n=10).to_numpy ()
329
+ random_numbers2_again = nums_0_to_9.sample(n=10).to_list ()
328
330
random_numbers2_again
329
331
```
330
332
@@ -336,21 +338,21 @@ obtain a different sequence of random numbers.
336
338
337
339
``` {code-cell} ipython3
338
340
np.random.seed(4235)
339
- random_numbers = nums_0_to_9.sample(n=10).to_numpy ()
340
- random_numbers
341
+ random_numbers1_different = nums_0_to_9.sample(n=10).to_list ()
342
+ random_numbers1_different
341
343
```
342
344
343
345
``` {code-cell} ipython3
344
- random_numbers = nums_0_to_9.sample(n=10).to_numpy ()
345
- random_numbers
346
+ random_numbers2_different = nums_0_to_9.sample(n=10).to_list ()
347
+ random_numbers2_different
346
348
```
347
349
348
350
In other words, even though the sequences of numbers that Python is generating * look*
349
351
random, they are totally determined when we set a seed value!
350
352
351
353
So what does this mean for data analysis? Well, ` sample ` is certainly not the
352
- only data frame method that uses randomness in Python. Many of the functions
353
- that we use in ` scikit-learn ` , ` pandas ` , and beyond use randomness&mdash ; many
354
+ only place where randomness is used in Python. Many of the functions
355
+ that we use in ` scikit-learn ` and beyond use randomness&mdash ; some
354
356
of them without even telling you about it. Also note that when Python starts
355
357
up, it creates its own seed to use. So if you do not explicitly
356
358
call the ` np.random.seed ` function, your results
@@ -387,22 +389,23 @@ reproducible.
387
389
In this book, we will generally only use packages that play nicely with `numpy`'s
388
390
default random number generator, so we will stick with `np.random.seed`.
389
391
You can achieve more careful control over randomness in your analysis
390
- by creating a `numpy` [`RandomState ` object](https://numpy.org/doc/1.16 /reference/generated/numpy. random.RandomState .html)
392
+ by creating a `numpy` [`Generator ` object](https://numpy.org/doc/stable /reference/random/generator .html)
391
393
once at the beginning of your analysis, and passing it to
392
394
the `random_state` argument that is available in many `pandas` and `scikit-learn`
393
- functions. Those functions will then use your `RandomState ` to generate random numbers instead of
394
- `numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState `
395
+ functions. Those functions will then use your `Generator ` to generate random numbers instead of
396
+ `numpy`'s default generator. For example, we can reproduce our earlier example by using a `Generator `
395
397
object with the `seed` value set to 1; we get the same lists of numbers once again.
396
398
```{code}
397
- rnd = np.random.RandomState(seed=1)
398
- random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
399
+ from numpy.random import Generator, PCG64
400
+ rng = Generator(PCG64(seed=1))
401
+ random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rng).to_list()
399
402
random_numbers1_third
400
403
```
401
404
```{code}
402
405
array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5])
403
406
```
404
407
```{code}
405
- random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy ()
408
+ random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rng).to_list ()
406
409
random_numbers2_third
407
410
```
408
411
```{code}
@@ -1830,7 +1833,7 @@ summary_df = pd.DataFrame(
1830
1833
)
1831
1834
plt_irrelevant_accuracies = (
1832
1835
alt.Chart(summary_df)
1833
- .mark_line() # point=True
1836
+ .mark_line(point=True)
1834
1837
.encode(
1835
1838
x=alt.X("ks", title="Number of Irrelevant Predictors"),
1836
1839
y=alt.Y(
@@ -1864,12 +1867,12 @@ this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls of
1864
1867
1865
1868
plt_irrelevant_nghbrs = (
1866
1869
alt.Chart(summary_df)
1867
- .mark_line() # point=True
1870
+ .mark_line(point=True)
1868
1871
.encode(
1869
1872
x=alt.X("ks", title="Number of Irrelevant Predictors"),
1870
1873
y=alt.Y(
1871
1874
"nghbrs",
1872
- title="Number of neighbors",
1875
+ title="Tuned number of neighbors",
1873
1876
),
1874
1877
)
1875
1878
)
@@ -1894,7 +1897,7 @@ plt_irrelevant_nghbrs_fixed = (
1894
1897
alt.Chart(
1895
1898
melted_summary_df
1896
1899
)
1897
- .mark_line() # point=True
1900
+ .mark_line(point=True)
1898
1901
.encode(
1899
1902
x=alt.X("ks", title="Number of Irrelevant Predictors"),
1900
1903
y=alt.Y(
@@ -2134,7 +2137,7 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
2134
2137
2135
2138
fwd_sel_accuracies_plot = (
2136
2139
alt.Chart(accuracies)
2137
- .mark_line() # point=True
2140
+ .mark_line(point=True)
2138
2141
.encode(
2139
2142
x=alt.X("size", title="Number of Predictors"),
2140
2143
y=alt.Y(
0 commit comments