Skip to content

Commit cf0d75c

Browse files
Tom's edits of first wald_friedman lecture
1 parent 7c9af26 commit cf0d75c

File tree

1 file changed

+41
-36
lines changed

1 file changed

+41
-36
lines changed

lectures/wald_friedman.md

Lines changed: 41 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,9 @@ In the spirit of {doc}`this earlier lecture <prob_meaning>`, the present lecture
4444
In this lecture, we describe Wald's formulation of the problem from the perspective of a statistician
4545
working within the Neyman-Pearson tradition of a frequentist statistician who thinks about testing hypotheses and consequently use laws of large numbers to investigate limiting properties of particular statistics under a given **hypothesis**, i.e., a vector of **parameters** that pins down a particular member of a manifold of statistical models that interest the statistician.
4646

47-
* From {doc}`this earlier lecture on frequentist and bayesian statistics<prob_meaning>`, please remember that a frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters.
47+
* From {doc}`this lecture on frequentist and bayesian statistics<prob_meaning>`, please remember that a frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters.
4848

49-
In {doc}`this sequel <wald_friedman_2>` we'll discuss another formulation that adopts the perspective of a **Bayesian statistician** who views parameters as vectors of random variables that are jointly distributed with observable variables that he is concerned about.
49+
In {doc}`this related lecture <wald_friedman_2>` we'll discuss another formulation that adopts the perspective of a **Bayesian statistician** who views parameters as random variables that are jointly distributed with observable variables that he is concerned about.
5050

5151
Because we are taking a frequentist perspective that is concerned about relative frequencies conditioned on alternative parameter values, i.e.,
5252
alternative **hypotheses**, key ideas in this lecture
@@ -85,7 +85,7 @@ during World War II, when they worked at the US Government's
8585
Statistical Research Group at Columbia University.
8686

8787
```{note}
88-
See pages 25 and 26 of Allen Wallis's 1980 article {cite}`wallis1980statistical` about the Statistical Research Group at Columbia University during World War II for his account of the episode and for important contributions that Harold Hotelling made to formulating the problem. Also see chapter 5 of Jennifer Burns book about
88+
See pages 25 and 26 of Allen Wallis's 1980 article {cite}`wallis1980statistical` about the Statistical Research Group at Columbia University during World War II for his account of the episode and for important contributions that Harold Hotelling made to formulating the problem. Also see chapter 5 of Jennifer Burns' book about
8989
Milton Friedman {cite}`Burns_2023`.
9090
```
9191

@@ -117,16 +117,18 @@ Let's listen to Milton Friedman tell us what happened
117117
> because it is obviously superior beyond what was hoped for
118118
> $\ldots$.
119119
120-
Friedman and Wallis worked on the problem but, after realizing that
121-
they were not able to solve it, they told Abraham Wald about the problem.
120+
Friedman and Wallis worked on the problem for a while but didn't completely solve it.
122121

123-
That started Wald on the path that led him to *Sequential Analysis* {cite}`Wald47`.
122+
Realizing that, they told Abraham Wald about the problem.
123+
124+
That set Wald on a path that led him to create *Sequential Analysis* {cite}`Wald47`.
124125

125126
## Neyman-Pearson Formulation
126127

127128
It is useful to begin by describing the theory underlying the test
128-
that Navy Captain G. S. Schuyler had been told to use and that led him
129-
to approach Milton Friedman and Allan Wallis to convey his conjecture
129+
that the U.S. Navy told Captain G. S. Schuyler to use.
130+
131+
Captain Schulyer's doubts motivated him to tell Milton Friedman and Allan Wallis his conjecture
130132
that superior practical procedures existed.
131133

132134
Evidently, the Navy had told Captain Schuyler to use what was then a state-of-the-art
@@ -275,7 +277,7 @@ Here is how Wald introduces the notion of a sequential test
275277
276278
## Wald's Sequential Formulation
277279

278-
In contradistinction to Neyman and Pearson's formulation of the problem, in Wald's formulation
280+
By way of contrast to Neyman and Pearson's formulation of the problem, in Wald's formulation
279281

280282

281283
- The sample size $n$ is not fixed but rather a random variable.
@@ -296,7 +298,7 @@ The density of a Beta probability distribution with parameters $a$ and $b$ is
296298
$$
297299
f(z; a, b) = \frac{\Gamma(a+b) z^{a-1} (1-z)^{b-1}}{\Gamma(a) \Gamma(b)}
298300
\quad \text{where} \quad
299-
\Gamma(t) := \int_{0}^{\infty} x^{t-1} e^{-x} dx
301+
\Gamma(p) := \int_{0}^{\infty} x^{p-1} e^{-x} dx
300302
$$
301303

302304
The next figure shows two beta distributions.
@@ -363,19 +365,17 @@ chooses among three distinct actions:
363365
$z_{k+1}$
364366

365367

366-
Wald proceeds as follows.
367-
368-
He defines
368+
Wald defines
369369

370370
- $p_{0m} = f_0(z_1) \cdots f_0(z_m)$
371371
- $p_{1m} = f_1(z_1) \cdots f_1(z_m)$
372372
- $L_{m} = \frac{p_{1m}}{p_{0m}}$
373373

374374
Here $\{L_m\}_{m=0}^\infty$ is a **likelihood ratio process**.
375375

376-
One of Wald's sequential decision rule is parameterized by two real numbers $B < A$.
376+
Wald's sequential decision rule is parameterized by real numbers $B < A$.
377377

378-
For a given pair $A, B$ the decision rule is
378+
For a given pair $A, B$, the decision rule is
379379

380380
$$
381381
\begin{aligned}
@@ -429,12 +429,14 @@ In particular, Wald constructs a mathematical argument that leads him to conclud
429429
> the number of observations required by the test.
430430
431431
432+
We'll write some Python code to help us illustrate Wald's claims about how $\alpha$ and $\beta$ are related to the parameters $A$ and $B$
433+
that characterize his sequential probability ratio test.
432434
433435
## Simulations
434436
435-
In this section, we experiment with different distributions $f_0$ and $f_1$ to examine how Wald's test performs under various conditions.
437+
We experiment with different distributions $f_0$ and $f_1$ to examine how Wald's test performs under various conditions.
436438
437-
The goal of these simulations is to understand trade-offs between decision speed and accuracy associated with Wald's **sequential probability ratio test**.
439+
Our goal in conducting these simulations is to understand trade-offs between decision speed and accuracy associated with Wald's **sequential probability ratio test**.
438440
439441
Specifically, we will watch how:
440442
@@ -457,9 +459,9 @@ SPRTParams = namedtuple('SPRTParams',
457459
458460
Now we can run the simulation following Wald's recommendation.
459461
460-
We use the log-likelihood ratio and compare it to the logarithms of the thresholds $\log(A)$ and $\log(B)$.
462+
We'll compare the log-likelihood ratio to logarithms of the thresholds $\log(A)$ and $\log(B)$.
461463
462-
Below is the algorithm for the simulation.
464+
The following algorithm underlies our simulations.
463465
464466
1. Compute thresholds $A = \frac{1-\beta}{\alpha}$, $B = \frac{\beta}{1-\alpha}$ and work with $\log A$, $\log B$.
465467
@@ -589,8 +591,8 @@ print(f"Empirical type II error: {results['type_II']:.3f} (target = {params.β
589591
590592
As anticipated in the passage above in which Wald discussed the quality of
591593
$a(\alpha, \beta), b(\alpha, \beta)$ given in approximation {eq}`eq:Waldrule`,
592-
we find that the algorithm "overshoots" the error rates by giving us a
593-
lower type I and type II error rates than the target values.
594+
we find that the algorithm actually gives
595+
**lower** type I and type II error rates than the target values.
594596
595597
```{note}
596598
For recent work on the quality of approximation {eq}`eq:Waldrule`, see, e.g., {cite}`fischer2024improving`.
@@ -624,11 +626,11 @@ axes[1].set_ylabel("frequency")
624626
plt.show()
625627
```
626628
627-
In this simple case, the stopping time stays below 10.
629+
In this example, the stopping time stays below 10.
628630
629-
We can also examine a $2 \times 2$ "confusion matrix" whose diagonal elements
630-
show the number of times when Wald's rule results in correct acceptance and
631-
rejection of the null hypothesis.
631+
We can construct a $2 \times 2$ "confusion matrix" whose diagonal elements
632+
count the number of times that Wald's decision rule correctly accepts and
633+
rejects the null hypothesis.
632634
633635
```{code-cell} ipython3
634636
# Accept H0 when H0 is true (correct)
@@ -769,21 +771,24 @@ plot_sprt_results(results_2, params_2)
769771
plot_sprt_results(results_3, params_3)
770772
```
771773
772-
We can see a clear pattern in the stopping times and how close "separated" the two distributions are.
774+
Notice that the stopping times are less when the two distributions are farther apart.
775+
776+
This makes sense.
777+
778+
When two distributions are "far apart", it should not take too long to decide which one is generating the data.
773779
774-
We can link this to the discussion of [Kullback–Leibler divergence](rel_entropy) in {doc}`likelihood_ratio_process`.
780+
When two distributions are "close", it should takes longer to decide which one is generating the data.
775781
776-
Intuitively, KL divergence is large when the distribution from one distribution to another is
777-
large.
782+
It is tempting to link this pattern to our discussion of [Kullback–Leibler divergence](rel_entropy) in {doc}`likelihood_ratio_process`.
778783
779-
When two distributions are "far apart", it should not take long to decide which one is generating the data.
784+
While, KL divergence is larger when two distribution differ more, KL divergence is not symmetric, meaning that the KL divergence of distribution $f$ from distribution $g$ is not necessarily equal to the KL
785+
divergence of $g$ from $f$.
780786
781-
When two distributions are "close" to each other, it takes longer to decide which one is generating the data.
787+
If we want a symmetric measure of divergence that actually a metric, we can instead use [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html).
782788
783-
However, KL divergence is not symmetric, meaning that the divergence from one distribution to another is not necessarily the same as the reverse.
789+
That is what we shall do now.
784790
785-
To measure the discrepancy between two distributions, we use a metric
786-
called [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html) and plot it against the average stopping times.
791+
We shall compute Jensen-Shannon distance and plot it against the average stopping times.
787792
788793
```{code-cell} ipython3
789794
def kl_div(h, f):
@@ -843,7 +848,7 @@ plt.show()
843848
844849
The plot demonstrates a clear negative correlation between relative entropy and mean stopping time.
845850
846-
As the KL divergence increases (distributions become more separated), the mean stopping time decreases exponentially.
851+
As Jensen-Shannon divergence increases (distributions become more separated), the mean stopping time decreases exponentially.
847852
848853
Below are sampled examples from the experiments we have above
849854
@@ -972,7 +977,7 @@ plot_likelihood_paths(params_3, n_highlight=10, n_background=100)
972977
973978
Next, let's adjust the decision thresholds $A$ and $B$ and examine how the mean stopping time and the type I and type II error rates change.
974979
975-
In the code below, we break Wald's rule by adjusting the thresholds $A$ and $B$ using factors $A_f$ and $B_f$.
980+
In the code below, we adjust Wald's rule by adjusting the thresholds $A$ and $B$ using factors $A_f$ and $B_f$.
976981
977982
```{code-cell} ipython3
978983
@njit(parallel=True)

0 commit comments

Comments
 (0)